SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation
Pass@1
major challenge for current code generation. In this paper, we 20% 176
0
propose a novel adaptation approach named SEED, which stands 15%
for Sample-Efficient adaptation with Error-Driven learning for
code generation. SEED leverages the errors made by LLMs 10%
as learning opportunities, using error revision to overcome its 5%
own shortcomings, thus achieving efficient learning. Specifically,
0%
SEED involves identifying error code generated by LLMs,
Direct Fine-tuning SEED Direct Fine-tuning SEED
employing S ELF -R EVISE for code revision, optimizing the model (Full) (Full) (LoRA) (LoRA)
with revised code, and iteratively adapting the process for con-
tinuous improvement. Experimental results show that, compared CodeGen-2B CodeLlama-7B
to other mainstream fine-tuning approaches, SEED achieves
superior performance with few training samples, showing an
average relative improvement of 54.7% in Pass@1 on multiple Fig. 1. The performance of direct generation, fine-tuning, and our proposed
code generation benchmarks. We also validate the effectiveness SEED on MBPP dataset with few training samples. The numbers on the bars
of S ELF -R EVISE, which generates revised code that optimizes indicate the training samples used by different approaches.
the model more efficiently compared to the code samples from
datasets. Moreover, SEED consistently demonstrates strong per-
formance across various LLMs, underscoring its generalizability. scarcity of resources. For example, in safety-critical scenarios
like aerospace, medical devices, transportation, and financial
Index Terms—Code Generation, Sample-Efficient Adaptation, industries, the generated code must adhere to specific security
Large Language Model.
specifications, and accessing relevant data is often extremely
difficult due to high confidentiality and strict access control.
I. I NTRODUCTION
Under the circumstance of limited data, mainstream fine-
Code generation is an important technology that can im- tuning approaches might not enable LLMs to achieve the
prove the efficiency and quality of software development. desired code generation performance and may even lead to
Given the human requirement expressed in natural language, a degradation in model performance [16]–[18], as shown
code generation allows machines to generate executable pro- in Figure 1. Consequently, how to effectively adapt LLMs
grams that satisfy this requirement. Code generation has been to specific scenarios with limited data available is a major
a research hot topic in the fields of artificial intelligence, soft- challenge for code generation in practice.
ware engineering, and natural language processing. Recently, The mainstream fine-tuning approaches use a large number
code generation technologies have made significant advance- of samples gathered under specific scenarios for training [19].
ments in both academia and industry [1]–[3]. In particular, They enable model to exhaustively learn the features present in
large language models (LLMs) demonstrate great potential these samples and thus adapt to the specific scenarios. How-
in code generation tasks [4]–[9]. However, LLMs still face ever, they have two disadvantages. First, compelling LLMs
significant challenges in code generation for some specific to relearn the entire code data of these new scenarios is
scenarios or domains [10], [11]. inefficient. Considering that LLMs are pre-trained on large-
For specific code generation scenarios, fine-tuning is an scale and diverse data, it’s reasonably assumed that they
essential adaptation approach to ensure LLMs fulfill particular possess a certain level of general knowledge, lacking only
needs [12]–[15]. However, in these specific scenarios, it is particular information for application in specific scenarios.
difficult to obtain sufficient training samples for fine-tuning Second, when faced with insufficient data volume or data
LLMs, due to common reasons such as industry secrecy and drift, model may learn certain undesirable features (such as
inaccurate or irrelevant programming knowledge and patterns),
Requirement
thereby affecting its learning efficiency and negatively impact-
Write a function to multiply all the numbers in a list and divide with
ing its final performance.
the length of the list.
To overcome the disadvantages of these mainstream fine-
tuning approaches, we take inspiration from the error-driven (a) Output of the LLM
learning observed in humans. First, error-driven learning re- def multiply_num(numbers):
"""
quires learners to identify their errors through testing. It helps :type numbers: List[int]
learners to identify what they have mastered and what they :rtype: float
"""
still need to learn, allowing them to narrow the scope of if not numbers:
return 0
learning and avoid wasting efforts on irrelevancies. Second, return reduce(lambda x, y: x * y, numbers) / len(numbers)
through error revision, learners can understand their deficien-
cies and make targeted improvements, thus enhancing learning (b) Revision to the output
efficiency and effectiveness. This motivates us to explore def multiply_num(numbers):
"""
approaches to achieve sample-efficient adaptation of LLMs :type numbers: List[int]
:rtype: float
for code generation guided by error-driven learning. """
In this paper, we propose SEED, a Sample-Efficient adap- from functools import reduce
if not numbers:
tation based on Error-Driven learning for code generation. return 0
return reduce(lambda x, y: x * y, numbers) / len(numbers)
SEED aims to alleviate the problem of poor code generation
performance of fine-tuning LLMs in scenarios with few train- (c) Sample in the dataset
ing samples. Following the error-driven learning, our approach def multiply_num(numbers):
proceeds in four steps: ❶ Error Code Collection. We identify total = 1
for x in numbers:
and collect error codes generated by LLMs, aiming to mine the total *= x
return total/len(numbers)
weaknesses of LLMs. ❷ Automatic Code Revision. To obtain
revisions of error codes in a low-cost way, we design S ELF -
R EVISE to realize automatic revision leveraging information Fig. 2. A motivation example of SEED.
in the original dataset and code execution feedback. ❸ Model
Optimization. We optimize the LLMs using the revised code,
using error-driven learning in the LLMs adaptation process of
making them focus on learning the revision of these critical
code generation.
errors, thereby improving the learning efficiency of LLMs. ❹
By observing the output (a) generated by LLMs, we can
Iterative Adaptation. We adopt an iterative strategy, which
find that LLMs generate a basically correct code that adopts
involves repeating the preceding three steps, to continuously
a commonly-used function ‘reduce’ [20]. However, this code
optimize and improve the performance of LLMs. Extensive
still fails due to a critical error: it does not take dependencies
experimental results demonstrate the superiority and general-
into account, i.e., without importing the correct libraries and
izability of SEED in the sample-efficient adaptation of LLMs
functions. This observation demonstrates that LLMs have most
for specific code generation scenarios.
of the capabilities to solve the problem, but also reveals a
To summarize, the main contributions of this paper are:
shortcoming in dealing with dependencies, which is related
• We introduce error-driven learning for LLMs. It utilizes to the code generation scenario1 . This shortcoming can be
revisions of model’s erroneous outputs for training and overcome just by bootstrapping LLMs to import the correct
has higher training efficiency compared to using samples dependencies, as shown in revision (b). However, in traditional
in the dataset. fine-tuning approaches, it is challenging to overcome the
• We propose a sample-efficient adaptation approach of shortcoming through learning samples in the dataset. Because
LLMs for code generation, named SEED. Based on the sample (c) in the dataset proposed a different solution from
principle of error-driven learning, SEED continuously output (a), it did not use the ‘reduce’ function. LLM needs
optimizes model by identifying and revising error codes to put in more effort to learn the new solution from scratch
generated by LLMs, enabling effective adaptation to and also misses the opportunity to overcome its shortcomings.
specific code generation scenarios with limited samples. Furthermore, there is a potential risk when training LLMs with
• SEED outperforms other mainstream fine-tuning ap- sample (c): LLMs may incorrectly infer that sample (c) is the
proaches on three public code generation benchmarks optimal solution for this requirement, resulting in the omission
across various LLMs with few training samples. of the Guard Clause “if not numbers\n return 0”
in output (a). Omitting the Guard Clause is an inadvisable
II. M OTIVATION E XAMPLE
1 Making LLMs generate code with dependencies that match the develop-
Aligning LLMs with specific scenarios and addressing their ment environment can be viewed as a code generation scenario. The required
unique challenges by learning samples in the dataset is diffi- dependencies are usually different in different development environments. For
example, if the development environment is Python2, ”reduce” is a built-in
cult, especially when training samples are limited. We present function, but if it is Python3, it must be imported from the standard library
a motivation example in Figure 2 to clarify the advantages of ”functools” in order to be used.
Automatic Code Revision
ℳ!∗
Traditional Fine-tuning
ℳ"#$%&#
ℳ!
Update parameters
Fig. 3. An overview of the proposed SEED and its differences from traditional fine-tuning approaches.
programming pattern, which is undesirable to learn. Due to approach, aimed at enhancing the performance of LLMs in
the absence of the Guard Clause as a safeguard for handling specific code generation scenarios.
edge cases, an error could occur in the edge case where the
III. M ETHODOLOGY
input list is empty. Therefore, using revision (b) to train
LLMs is a better choice, which allows LLMs to focus on In this section, we present the detailed methodology of
and learn to solve the critical error, while simultaneously SEED. Specifically, given a code generation scenario/domain
maintaining the advantages of LLM’s original output. with a limited-sample training dataset Dtrain = {(r, c)},
Further, we explore the effectiveness of adopting error- where each sample pair (r, c) consists of an input requirement
driven learning from the perspective of model optimization. r and an associated example of desired output code c. For a
We consider the potential significant discrepancy between the pre-trained LLM Mθ with parameter θ, we aim to adapt Mθ
model-generated output and the sample in the dataset. By to the specific scenario of Dtrain . SEED achieves sample-
learning the revisions of the model’s erroneous outputs, we efficient adaptation of LLMs through four steps: Error Code
can find more effective navigation in the optimization process. Collection (§III-A), Automatic Code Revision (§III-B), Model
This might provide a shorter, smoother path to a good local Optimization (§III-C), and Iterative Adaptation (§III-D). The
minimum compared to learning from samples in the dataset, overview of SEED and its differences from traditional fine-
rather than attempting to direct it toward a distant area that tuning are demonstrated in Figure 3.
may not align well with its existing knowledge or biases. A. Error Code Collection
We conduct the statistical analysis of the discrepancies in the In this step, we systematically identify and collect erroneous
model’s latent representations2 . The findings reveal that the output of LLMs using testing as criteria.
average distance between the model’s erroneous outputs and We employ rejection sampling [21] to draw error code
the dataset’s samples is 12.35, whereas the average distance samples from the distribution produced by Mθ . For each
between the erroneous outputs and their revisions is signif- requirement r ∈ Dtrain , we sample
icantly lower, at 6.39. These experimental results suggest
that within the model’s representation space, revised codes c′ ∼ Mθ (r) | ¬f, (1)
are closer and similar to the erroneous output codes than where we sample multiple times and employ the criterion
the original code samples. This evidence lends support to function f to determine the retention of the code sam-
our hypothesis of why the error-driven learning method is ple. Specifically, the error code sample c′ is retained when
more efficient. f (r, c′ ) = 0, where f (r, c′ ) = 0 if the rejection condition is
Therefore, our work is determined to explore the use of satisfied, otherwise 1.
error-driven learning to achieve a sample-efficient adaptation We define f as a test evaluation function since testing is the
criterion for judging the correctness of code in practice:
2 Specifically, on MBPP dataset, we obtain erroneous outputs of CodeGen-
0, if c′ fails Sr ,
2B, revisions of the outputs, and samples in MBPP. We concatenate the T EST E VAL(r, c′ ) = (2)
requirements with their code, input them into CodeGen-2B, and extract the 1, otherwise,
hidden representations from the model’s final layer. Then, we compute the
Euclidean distances within the model’s representational space to quantify the where Sr is a suit of test cases under the requirement r and is
disparities between these three elements. equipped by the code generation dataset. When collecting the
• Correct Solution (g): Provide a type of correct solution
Template 𝑻𝒊 Prompt Construction as a reference to reduce the difficulty of revision. The
Requirement: {𝑟! } correct solution used here is the code sample c in the
Correct Solution: {g! }
Error Code: {𝑐!" } + Revised Code 𝑐(∗ training dataset;
′
• Error Code (c ): Give the error code that needs to be
Error Messages: {𝑚! } ×𝒌
Failed Test Cases: {𝑡! } revised. The error code is generated by Mθ under r;
Revised Code : { } Prompt 𝑃 • Error Messages (m) and Failed Test Cases (t): Point
out the error messages received during execution and the
𝑻 𝑷||𝑻 specific test cases where the error code fails, allowing for
more focused troubleshooting and revision.
ℳ!"#$%" ℳ!"#$%" These parts are combined as the input of S ELF -R EVISE
according to the template:
or
T = Template(r, g, c′ , m, t) (3)
where Template is shown in Figure 4.
Self-Revise (FT) Self-Revise (FST) Following previous work [22], [23], we use two settings for
S ELF -R EVISE, i.e., fine-tuning (FT) and few-shot prompting
(FSP), to get the model MRevise for revising error codes.
S ELF -R EVISE (FT) entails the process of fine-tuning Mθ
Error
ErrorCode
Code(𝑐(𝑐∗ )∗ ) X with a small number of samples for the purpose of automatic
Revised Code Set
code revision. The training objective is to minimize the
Acceptance Sampling following loss function:
X
L(θ) = lce (Mθ (Ti ), c∗i ) (4)
i
Revised Code 𝑐 ∗ where lce represents the standard cross-entropy loss, and we
update the parameters initialized with Mθ to obtain the model
Fig. 4. Illustration of automatic code revision MRevise in S ELF -R EVISE (FT).
S ELF -R EVISE (FSP) adopts the few-shot prompting tech-
nique and leverages k examples of Ti and c∗i to construct
error codes for test failures, we can keep the failed test cases the prompt P for aligning Mθ to automatic code revision. In
and error messages simultaneously for further error diagnosis. S ELF -R EVISE (FSP), MRevise (·) is defined as Mθ (P || · ),
To gain insights into the propensity of Mθ to make certain where || denotes the concatenation operation.
errors, it is advisable to select error code sample c′ for which In contrast to the previous error code collection step, for
the model demonstrates relatively high confidence. Therefore, each error code c′ , we construct T and use acceptance sam-
among multiple error codes collected for the same r, we select pling to obtain the revised code c∗ :
the one with the highest generation probability 3 .
c∗ ∼ MRevise (T ) | f. (5)
B. Automatic Code Revision where c∗ is retained if T EST E VAL(r, c∗ ) = 1 in Eq. (2),
In this step, we perform automatic revision for error codes i.e., the revised code c∗ passes its test cases. We sample
using our S ELF -R EVISE approach. Based on the LLM Mθ multiple times and it is sufficient if MRevise could correctly
itself, S ELF -R EVISE revises the error code by providing the revise the error code once. To prevent MRevise from simply
information in the original dataset and pointing out the error replicating the provided correct solution g, we exclude the
with code execution feedback. Our objective is to derive a output identical to g. Subsequently, for each requirement r,
revised code that is as close as possible to the error code. Note and select the version that is most similar to the error code
that although there is already a correct code c in the dataset, among the remaining code revisions.
it may differ significantly from the error code, as illustrated C. Model Optimization
by examples (a), (b), and (c) in Figure 3. The pipeline of
automatic code revision is shown in Figure 4. In this step, we employ pairs of the requirement r and
Specifically, we leverage the following parts as the input of its revised code c∗ to further fine-tune the model Mθ . This
S ELF -R EVISE: process leads to the enhanced version of Mθ , referred to as
Mθ∗ , in the specific scenario of dataset Dtrain .
• Requirement (r): Clarify the requirement that need to For fine-tuning [24], we update the entire parameter θ of
be addressed; LLMs as
X
3 We determine the probability of the generated code by averaging the θ∗ = arg min lce (Mθ (r), c∗ ), (6)
probabilities of each generated token. θ
(r,c∗ )
Algorithm 1 Pseudocode of the proposed SEED. IV. E VALUATION
Require: Few-sample Dataset Dtrain = {(r, c)}, initial LLM Mθ .
Ensure: LLM Mθ∗ . We present extensive experiments that span three represen-
1: Initial iteration index l = 0 and Mθl+1 = Mθ . tative code generation datasets, two fine-tuning settings, and
2: # Iterative Adaptation
five different LLMs of varying sizes or series. We aim to
3: repeat
4: Update l = l + 1. investigate six research questions:
5: # Error Code Collection • RQ1: How does SEED perform compared to the main-
6: Perform rejection sampling to collect error codes {c′ }l based
on Mθl via Eq. (1) and (2).
stream adaptation approaches?
7: # Automatic Code Revision • RQ2: How does SEED perform when applied to various
8: Perform acceptance sampling to collect revised codes {c∗ }l LLMs?
based on Mθl and S ELF -R EVISE via Eq. (2), (3), and (5). • RQ3: What kind of training sample has the better training
9: Calculate the union of {(r, c∗ )}1:l via Eq. (10). effect?
10: # Model Optimization
11: Fine-tune Mθl to yield M∗θl via Eq. (6) if the computational
• RQ4: How does the number of iterations affect the
resources are sufficient, otherwise via Eq. (7), (8), and (9). effectiveness of SEED?
12: Update Mθl+1 = Mθl∗ . • RQ5: What is the impact of implementing the automatic
13: until End condition is satisfied code revision component of SEED in conjunction with
14: return Mθl∗ alternative LLMs?
• RQ6: How does each input component of S ELF -R EVISE
contribute to the effectiveness?
When the computational resources are insufficient, we em-
ploy Low-Rank Adaptation (LoRA) [25] to fine-tune LLMs. A. Datasets
For a weight matrix W ∈ Rd×k , LoRA represents its update
with a low-rank decomposition: MBPP [26] contains crowd-sourced Python programming
problems, covering programming fundamentals. We selected
W + ∆W = W + ∆αWdown Wup , (7) the version in the work [27], which consists of 276 problems
and some generated error codes alongside their human-revised
where α is a tunable scalar hyperparameter, Wdown ∈ Rd×r , counterparts, thus facilitating subsequent experiments.
Wup ∈ Rr×k , and r << min(r, k). In LoRA, we update HumanEval [1] is a widely-used code generation bench-
parameter θ∗ as mark, containing 164 handwritten programming problems,
proposed by OpenAI. Each programming problem includes a
θ∗ = θ + ∆θ, (8)
function signature, a natural language description, use cases,
X a correct solution in Python, and several test tests.
∆θ = arg min lce (Mθ+∆θ (r), c∗ ). (9) DS-pandas [28] comprises 291 data science problems uti-
∆θ
(r,c∗ ) lizing Pandas libraries, sourced from real-world problems
posted by developers on StackOverflow. This dataset can
D. Iterative Adaptation evaluate the ability of LLMs to utilize specific data-analysis
libraries for code generation.
The preceding three steps can go through multiple iterations
until a certain number of rounds is reached or the performance Due to the scarcity and privacy of domain data, it becomes
no longer improves. difficult to obtain these data, and there are no available
datasets. Therefore, we use the MBPP, HumanEval, and DS-
For l-th iteration that l > 1, we initialize its initial model
pandas to simulate the specific scenario with limited samples.
Mθl as the enhanced model of the previous iteration Mθl−1 ∗ .
We choose the first 100 problems from all datasets as test
Based on Mθl , we repeat the process in steps of error code
dataset Dtest. The remaining problems (up to 200 problems)
collection and automatic code revision to sample error codes
serve as training dataset Dtrain. The statistics of all datasets
{c′ }l and revised codes {c∗ }l , respectively. We use the union
are presented in Table I.
of {(r, c∗ )}1:l , that is,
Write a function to perform index wise addition of tuple elements in the given Write a function to check if the common elements between
two nested tuples. two given lists are in the same order or not.
Write a function to extract all the pairs which are symmetric in the given tuple list. Write a function to remove odd characters in a string.
Fig. 5. Cases for two settings of self-revise, where “-” and “+” respectively indicate lines of code before and after revision.
the prompt. In contrast, S ELF -R EVISE (FT) is capable of mak- ponents, i.e., correct solution, failed test cases, and error
ing minimal yet effective modifications to the model’s initial messages. By removing these components individually, we
error code outputs, thereby generating the correct code. Based observe their specific impact on the performance of automatic
on the observations, S ELF -R EVISE (FT) is recommended as code revision and the final model, and thus evaluate the
the more preferable approach for automatic code revision effectiveness of these components.
within SEED.
TABLE VII
Summary of RQ5: The effectiveness of automatic code R ESULTS OF THE ABLATION STUDY ON S ELF -R EVISE I NPUT.
revision improves with the use of more powerful LLMs.
MRevise Mθ∗
However, using the LLM consistent with the final optimized Approach
LLM and under the fine-tuning setting is most beneficial for Pass@1 Pass@10 Pass@any Pass@1 Pass@10