Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views13 pages

SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation

Uploaded by

Jeevan H.R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

SEED: Customize Large Language Models With Sample-Efficient Adaptation For Code Generation

Uploaded by

Jeevan H.R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SEED: Customize Large Language Models with

Sample-Efficient Adaptation for Code Generation


Xue Jiang, Yihong Dong, Zhi Jin, Ge Li
Key Laboratory of High Confidence Software Technologies (Peking University),
Ministry of Education; School of Computer Science, Peking University, Beijing, China
{jiangxue, dongyh}@stu.pku.edu.cn, {zhijin, lige}@pku.edu.cn

Abstract—Although Large Language Models (LLMs) have


made significant progress in code generation, they still struggle
40%
arXiv:2403.00046v2 [cs.SE] 23 Mar 2024

with code generation tasks in specific scenarios. These scenarios 46


usually necessitate the adaptation of LLMs to fulfill specific needs, 35% 41
but the limited training samples available in practice lead to 30%
poor code generation performance. Therefore, how to effectively 176
25%
adapt LLMs to new scenarios with few training samples is a 0

Pass@1
major challenge for current code generation. In this paper, we 20% 176
0
propose a novel adaptation approach named SEED, which stands 15%
for Sample-Efficient adaptation with Error-Driven learning for
code generation. SEED leverages the errors made by LLMs 10%
as learning opportunities, using error revision to overcome its 5%
own shortcomings, thus achieving efficient learning. Specifically,
0%
SEED involves identifying error code generated by LLMs,
Direct Fine-tuning SEED Direct Fine-tuning SEED
employing S ELF -R EVISE for code revision, optimizing the model (Full) (Full) (LoRA) (LoRA)
with revised code, and iteratively adapting the process for con-
tinuous improvement. Experimental results show that, compared CodeGen-2B CodeLlama-7B
to other mainstream fine-tuning approaches, SEED achieves
superior performance with few training samples, showing an
average relative improvement of 54.7% in Pass@1 on multiple Fig. 1. The performance of direct generation, fine-tuning, and our proposed
code generation benchmarks. We also validate the effectiveness SEED on MBPP dataset with few training samples. The numbers on the bars
of S ELF -R EVISE, which generates revised code that optimizes indicate the training samples used by different approaches.
the model more efficiently compared to the code samples from
datasets. Moreover, SEED consistently demonstrates strong per-
formance across various LLMs, underscoring its generalizability. scarcity of resources. For example, in safety-critical scenarios
like aerospace, medical devices, transportation, and financial
Index Terms—Code Generation, Sample-Efficient Adaptation, industries, the generated code must adhere to specific security
Large Language Model.
specifications, and accessing relevant data is often extremely
difficult due to high confidentiality and strict access control.
I. I NTRODUCTION
Under the circumstance of limited data, mainstream fine-
Code generation is an important technology that can im- tuning approaches might not enable LLMs to achieve the
prove the efficiency and quality of software development. desired code generation performance and may even lead to
Given the human requirement expressed in natural language, a degradation in model performance [16]–[18], as shown
code generation allows machines to generate executable pro- in Figure 1. Consequently, how to effectively adapt LLMs
grams that satisfy this requirement. Code generation has been to specific scenarios with limited data available is a major
a research hot topic in the fields of artificial intelligence, soft- challenge for code generation in practice.
ware engineering, and natural language processing. Recently, The mainstream fine-tuning approaches use a large number
code generation technologies have made significant advance- of samples gathered under specific scenarios for training [19].
ments in both academia and industry [1]–[3]. In particular, They enable model to exhaustively learn the features present in
large language models (LLMs) demonstrate great potential these samples and thus adapt to the specific scenarios. How-
in code generation tasks [4]–[9]. However, LLMs still face ever, they have two disadvantages. First, compelling LLMs
significant challenges in code generation for some specific to relearn the entire code data of these new scenarios is
scenarios or domains [10], [11]. inefficient. Considering that LLMs are pre-trained on large-
For specific code generation scenarios, fine-tuning is an scale and diverse data, it’s reasonably assumed that they
essential adaptation approach to ensure LLMs fulfill particular possess a certain level of general knowledge, lacking only
needs [12]–[15]. However, in these specific scenarios, it is particular information for application in specific scenarios.
difficult to obtain sufficient training samples for fine-tuning Second, when faced with insufficient data volume or data
LLMs, due to common reasons such as industry secrecy and drift, model may learn certain undesirable features (such as
inaccurate or irrelevant programming knowledge and patterns),
Requirement
thereby affecting its learning efficiency and negatively impact-
Write a function to multiply all the numbers in a list and divide with
ing its final performance.
the length of the list.
To overcome the disadvantages of these mainstream fine-
tuning approaches, we take inspiration from the error-driven (a) Output of the LLM
learning observed in humans. First, error-driven learning re- def multiply_num(numbers):
"""
quires learners to identify their errors through testing. It helps :type numbers: List[int]
learners to identify what they have mastered and what they :rtype: float
"""
still need to learn, allowing them to narrow the scope of if not numbers:
return 0
learning and avoid wasting efforts on irrelevancies. Second, return reduce(lambda x, y: x * y, numbers) / len(numbers)
through error revision, learners can understand their deficien-
cies and make targeted improvements, thus enhancing learning (b) Revision to the output
efficiency and effectiveness. This motivates us to explore def multiply_num(numbers):
"""
approaches to achieve sample-efficient adaptation of LLMs :type numbers: List[int]
:rtype: float
for code generation guided by error-driven learning. """
In this paper, we propose SEED, a Sample-Efficient adap- from functools import reduce
if not numbers:
tation based on Error-Driven learning for code generation. return 0
return reduce(lambda x, y: x * y, numbers) / len(numbers)
SEED aims to alleviate the problem of poor code generation
performance of fine-tuning LLMs in scenarios with few train- (c) Sample in the dataset
ing samples. Following the error-driven learning, our approach def multiply_num(numbers):
proceeds in four steps: ❶ Error Code Collection. We identify total = 1
for x in numbers:
and collect error codes generated by LLMs, aiming to mine the total *= x
return total/len(numbers)
weaknesses of LLMs. ❷ Automatic Code Revision. To obtain
revisions of error codes in a low-cost way, we design S ELF -
R EVISE to realize automatic revision leveraging information Fig. 2. A motivation example of SEED.
in the original dataset and code execution feedback. ❸ Model
Optimization. We optimize the LLMs using the revised code,
using error-driven learning in the LLMs adaptation process of
making them focus on learning the revision of these critical
code generation.
errors, thereby improving the learning efficiency of LLMs. ❹
By observing the output (a) generated by LLMs, we can
Iterative Adaptation. We adopt an iterative strategy, which
find that LLMs generate a basically correct code that adopts
involves repeating the preceding three steps, to continuously
a commonly-used function ‘reduce’ [20]. However, this code
optimize and improve the performance of LLMs. Extensive
still fails due to a critical error: it does not take dependencies
experimental results demonstrate the superiority and general-
into account, i.e., without importing the correct libraries and
izability of SEED in the sample-efficient adaptation of LLMs
functions. This observation demonstrates that LLMs have most
for specific code generation scenarios.
of the capabilities to solve the problem, but also reveals a
To summarize, the main contributions of this paper are:
shortcoming in dealing with dependencies, which is related
• We introduce error-driven learning for LLMs. It utilizes to the code generation scenario1 . This shortcoming can be
revisions of model’s erroneous outputs for training and overcome just by bootstrapping LLMs to import the correct
has higher training efficiency compared to using samples dependencies, as shown in revision (b). However, in traditional
in the dataset. fine-tuning approaches, it is challenging to overcome the
• We propose a sample-efficient adaptation approach of shortcoming through learning samples in the dataset. Because
LLMs for code generation, named SEED. Based on the sample (c) in the dataset proposed a different solution from
principle of error-driven learning, SEED continuously output (a), it did not use the ‘reduce’ function. LLM needs
optimizes model by identifying and revising error codes to put in more effort to learn the new solution from scratch
generated by LLMs, enabling effective adaptation to and also misses the opportunity to overcome its shortcomings.
specific code generation scenarios with limited samples. Furthermore, there is a potential risk when training LLMs with
• SEED outperforms other mainstream fine-tuning ap- sample (c): LLMs may incorrectly infer that sample (c) is the
proaches on three public code generation benchmarks optimal solution for this requirement, resulting in the omission
across various LLMs with few training samples. of the Guard Clause “if not numbers\n return 0”
in output (a). Omitting the Guard Clause is an inadvisable
II. M OTIVATION E XAMPLE
1 Making LLMs generate code with dependencies that match the develop-
Aligning LLMs with specific scenarios and addressing their ment environment can be viewed as a code generation scenario. The required
unique challenges by learning samples in the dataset is diffi- dependencies are usually different in different development environments. For
example, if the development environment is Python2, ”reduce” is a built-in
cult, especially when training samples are limited. We present function, but if it is Python3, it must be imported from the standard library
a motivation example in Figure 2 to clarify the advantages of ”functools” in order to be used.
Automatic Code Revision

ℳ!∗

Traditional Fine-tuning
ℳ"#$%&#

(a) Error Code Generated by LLM


Model Optimization
(b) Revisions of Error Code
𝒓:Remove the k'th element from a given list 𝒓:Remove the k'th element from a given list
𝒄! :def remove_kth_element(list, k): 𝒄" :def remove_kth_element(list, k):
list.pop(k) list.pop(k-1)
SEED

return list return list (c) Code Example in Dataset


𝒓:Remove the k'th element from a given list
𝒄:def remove_kth_element(list, k):

Iterative Adaptation return list[:k-1] + list[k:]

Error Code Model


Collection Optimization
ℳ! ℳ!∗

ℳ!
Update parameters

Fig. 3. An overview of the proposed SEED and its differences from traditional fine-tuning approaches.

programming pattern, which is undesirable to learn. Due to approach, aimed at enhancing the performance of LLMs in
the absence of the Guard Clause as a safeguard for handling specific code generation scenarios.
edge cases, an error could occur in the edge case where the
III. M ETHODOLOGY
input list is empty. Therefore, using revision (b) to train
LLMs is a better choice, which allows LLMs to focus on In this section, we present the detailed methodology of
and learn to solve the critical error, while simultaneously SEED. Specifically, given a code generation scenario/domain
maintaining the advantages of LLM’s original output. with a limited-sample training dataset Dtrain = {(r, c)},
Further, we explore the effectiveness of adopting error- where each sample pair (r, c) consists of an input requirement
driven learning from the perspective of model optimization. r and an associated example of desired output code c. For a
We consider the potential significant discrepancy between the pre-trained LLM Mθ with parameter θ, we aim to adapt Mθ
model-generated output and the sample in the dataset. By to the specific scenario of Dtrain . SEED achieves sample-
learning the revisions of the model’s erroneous outputs, we efficient adaptation of LLMs through four steps: Error Code
can find more effective navigation in the optimization process. Collection (§III-A), Automatic Code Revision (§III-B), Model
This might provide a shorter, smoother path to a good local Optimization (§III-C), and Iterative Adaptation (§III-D). The
minimum compared to learning from samples in the dataset, overview of SEED and its differences from traditional fine-
rather than attempting to direct it toward a distant area that tuning are demonstrated in Figure 3.
may not align well with its existing knowledge or biases. A. Error Code Collection
We conduct the statistical analysis of the discrepancies in the In this step, we systematically identify and collect erroneous
model’s latent representations2 . The findings reveal that the output of LLMs using testing as criteria.
average distance between the model’s erroneous outputs and We employ rejection sampling [21] to draw error code
the dataset’s samples is 12.35, whereas the average distance samples from the distribution produced by Mθ . For each
between the erroneous outputs and their revisions is signif- requirement r ∈ Dtrain , we sample
icantly lower, at 6.39. These experimental results suggest
that within the model’s representation space, revised codes c′ ∼ Mθ (r) | ¬f, (1)
are closer and similar to the erroneous output codes than where we sample multiple times and employ the criterion
the original code samples. This evidence lends support to function f to determine the retention of the code sam-
our hypothesis of why the error-driven learning method is ple. Specifically, the error code sample c′ is retained when
more efficient. f (r, c′ ) = 0, where f (r, c′ ) = 0 if the rejection condition is
Therefore, our work is determined to explore the use of satisfied, otherwise 1.
error-driven learning to achieve a sample-efficient adaptation We define f as a test evaluation function since testing is the
criterion for judging the correctness of code in practice:
2 Specifically, on MBPP dataset, we obtain erroneous outputs of CodeGen-
0, if c′ fails Sr ,

2B, revisions of the outputs, and samples in MBPP. We concatenate the T EST E VAL(r, c′ ) = (2)
requirements with their code, input them into CodeGen-2B, and extract the 1, otherwise,
hidden representations from the model’s final layer. Then, we compute the
Euclidean distances within the model’s representational space to quantify the where Sr is a suit of test cases under the requirement r and is
disparities between these three elements. equipped by the code generation dataset. When collecting the
• Correct Solution (g): Provide a type of correct solution
Template 𝑻𝒊 Prompt Construction as a reference to reduce the difficulty of revision. The
Requirement: {𝑟! } correct solution used here is the code sample c in the
Correct Solution: {g! }
Error Code: {𝑐!" } + Revised Code 𝑐(∗ training dataset;

• Error Code (c ): Give the error code that needs to be
Error Messages: {𝑚! } ×𝒌
Failed Test Cases: {𝑡! } revised. The error code is generated by Mθ under r;
Revised Code : { } Prompt 𝑃 • Error Messages (m) and Failed Test Cases (t): Point
out the error messages received during execution and the
𝑻 𝑷||𝑻 specific test cases where the error code fails, allowing for
more focused troubleshooting and revision.
ℳ!"#$%" ℳ!"#$%" These parts are combined as the input of S ELF -R EVISE
according to the template:
or
T = Template(r, g, c′ , m, t) (3)
where Template is shown in Figure 4.
Self-Revise (FT) Self-Revise (FST) Following previous work [22], [23], we use two settings for
S ELF -R EVISE, i.e., fine-tuning (FT) and few-shot prompting
(FSP), to get the model MRevise for revising error codes.
S ELF -R EVISE (FT) entails the process of fine-tuning Mθ
Error
ErrorCode
Code(𝑐(𝑐∗ )∗ ) X with a small number of samples for the purpose of automatic
Revised Code Set
code revision. The training objective is to minimize the
Acceptance Sampling following loss function:
X
L(θ) = lce (Mθ (Ti ), c∗i ) (4)
i
Revised Code 𝑐 ∗ where lce represents the standard cross-entropy loss, and we
update the parameters initialized with Mθ to obtain the model
Fig. 4. Illustration of automatic code revision MRevise in S ELF -R EVISE (FT).
S ELF -R EVISE (FSP) adopts the few-shot prompting tech-
nique and leverages k examples of Ti and c∗i to construct
error codes for test failures, we can keep the failed test cases the prompt P for aligning Mθ to automatic code revision. In
and error messages simultaneously for further error diagnosis. S ELF -R EVISE (FSP), MRevise (·) is defined as Mθ (P || · ),
To gain insights into the propensity of Mθ to make certain where || denotes the concatenation operation.
errors, it is advisable to select error code sample c′ for which In contrast to the previous error code collection step, for
the model demonstrates relatively high confidence. Therefore, each error code c′ , we construct T and use acceptance sam-
among multiple error codes collected for the same r, we select pling to obtain the revised code c∗ :
the one with the highest generation probability 3 .
c∗ ∼ MRevise (T ) | f. (5)
B. Automatic Code Revision where c∗ is retained if T EST E VAL(r, c∗ ) = 1 in Eq. (2),
In this step, we perform automatic revision for error codes i.e., the revised code c∗ passes its test cases. We sample
using our S ELF -R EVISE approach. Based on the LLM Mθ multiple times and it is sufficient if MRevise could correctly
itself, S ELF -R EVISE revises the error code by providing the revise the error code once. To prevent MRevise from simply
information in the original dataset and pointing out the error replicating the provided correct solution g, we exclude the
with code execution feedback. Our objective is to derive a output identical to g. Subsequently, for each requirement r,
revised code that is as close as possible to the error code. Note and select the version that is most similar to the error code
that although there is already a correct code c in the dataset, among the remaining code revisions.
it may differ significantly from the error code, as illustrated C. Model Optimization
by examples (a), (b), and (c) in Figure 3. The pipeline of
automatic code revision is shown in Figure 4. In this step, we employ pairs of the requirement r and
Specifically, we leverage the following parts as the input of its revised code c∗ to further fine-tune the model Mθ . This
S ELF -R EVISE: process leads to the enhanced version of Mθ , referred to as
Mθ∗ , in the specific scenario of dataset Dtrain .
• Requirement (r): Clarify the requirement that need to For fine-tuning [24], we update the entire parameter θ of
be addressed; LLMs as
X
3 We determine the probability of the generated code by averaging the θ∗ = arg min lce (Mθ (r), c∗ ), (6)
probabilities of each generated token. θ
(r,c∗ )
Algorithm 1 Pseudocode of the proposed SEED. IV. E VALUATION
Require: Few-sample Dataset Dtrain = {(r, c)}, initial LLM Mθ .
Ensure: LLM Mθ∗ . We present extensive experiments that span three represen-
1: Initial iteration index l = 0 and Mθl+1 = Mθ . tative code generation datasets, two fine-tuning settings, and
2: # Iterative Adaptation
five different LLMs of varying sizes or series. We aim to
3: repeat
4: Update l = l + 1. investigate six research questions:
5: # Error Code Collection • RQ1: How does SEED perform compared to the main-
6: Perform rejection sampling to collect error codes {c′ }l based
on Mθl via Eq. (1) and (2).
stream adaptation approaches?
7: # Automatic Code Revision • RQ2: How does SEED perform when applied to various
8: Perform acceptance sampling to collect revised codes {c∗ }l LLMs?
based on Mθl and S ELF -R EVISE via Eq. (2), (3), and (5). • RQ3: What kind of training sample has the better training
9: Calculate the union of {(r, c∗ )}1:l via Eq. (10). effect?
10: # Model Optimization
11: Fine-tune Mθl to yield M∗θl via Eq. (6) if the computational
• RQ4: How does the number of iterations affect the
resources are sufficient, otherwise via Eq. (7), (8), and (9). effectiveness of SEED?
12: Update Mθl+1 = Mθl∗ . • RQ5: What is the impact of implementing the automatic
13: until End condition is satisfied code revision component of SEED in conjunction with
14: return Mθl∗ alternative LLMs?
• RQ6: How does each input component of S ELF -R EVISE
contribute to the effectiveness?
When the computational resources are insufficient, we em-
ploy Low-Rank Adaptation (LoRA) [25] to fine-tune LLMs. A. Datasets
For a weight matrix W ∈ Rd×k , LoRA represents its update
with a low-rank decomposition: MBPP [26] contains crowd-sourced Python programming
problems, covering programming fundamentals. We selected
W + ∆W = W + ∆αWdown Wup , (7) the version in the work [27], which consists of 276 problems
and some generated error codes alongside their human-revised
where α is a tunable scalar hyperparameter, Wdown ∈ Rd×r , counterparts, thus facilitating subsequent experiments.
Wup ∈ Rr×k , and r << min(r, k). In LoRA, we update HumanEval [1] is a widely-used code generation bench-
parameter θ∗ as mark, containing 164 handwritten programming problems,
proposed by OpenAI. Each programming problem includes a
θ∗ = θ + ∆θ, (8)
function signature, a natural language description, use cases,
X a correct solution in Python, and several test tests.
∆θ = arg min lce (Mθ+∆θ (r), c∗ ). (9) DS-pandas [28] comprises 291 data science problems uti-
∆θ
(r,c∗ ) lizing Pandas libraries, sourced from real-world problems
posted by developers on StackOverflow. This dataset can
D. Iterative Adaptation evaluate the ability of LLMs to utilize specific data-analysis
libraries for code generation.
The preceding three steps can go through multiple iterations
until a certain number of rounds is reached or the performance Due to the scarcity and privacy of domain data, it becomes
no longer improves. difficult to obtain these data, and there are no available
datasets. Therefore, we use the MBPP, HumanEval, and DS-
For l-th iteration that l > 1, we initialize its initial model
pandas to simulate the specific scenario with limited samples.
Mθl as the enhanced model of the previous iteration Mθl−1 ∗ .
We choose the first 100 problems from all datasets as test
Based on Mθl , we repeat the process in steps of error code
dataset Dtest. The remaining problems (up to 200 problems)
collection and automatic code revision to sample error codes
serve as training dataset Dtrain. The statistics of all datasets
{c′ }l and revised codes {c∗ }l , respectively. We use the union
are presented in Table I.
of {(r, c∗ )}1:l , that is,

{(r, c∗ )}1 ∪ · · · ∪ {(r, c∗ )}i · · · ∪ {(r, c∗ )}l , (10) TABLE I


S TATISTICS OF MBPP, H UMAN E VAL , AND DS- PANDAS .

to update parameters in the model optimization step, thereby


Num of Problems Avg. Length Avg. Num
yielding the enhanced model of the l-th iteration Mθl∗ . At Dataset
each iteration, the model is trained to convergence. Dtrain Dtest Requirement Code Test Case
This iteration is a step-by-step optimization designed to MBPP 176 100 15.7 32.5 3
continuously improve the adaptability of models to the specific HumanEval 64 100 23.0 30.5 8
DS-pandas 191 100 184.8 26.2 2
scenario. The complete process of SEED is summarized in
Algorithm 1.
B. Implementation Details • Direct Generation: We evaluate the LLM directly to
We use a single A6000 GPU to conduct all experiments. We demonstrate the performance of the original LLM.
• Fine-tuning (Full): We employ full-parameter fine-
select CodeGen-2B [5] as our base model by default, which is
a well-known open-source LLM for code and is suitable for tuning for the LLM on Dtrain .
• Fine-tuning (LoRA): We fine-tune the LLM on Dtrain
full fine-tuning within our computational resource constraints.
Mθ is initialized to our base model, and MRevise is derived using LoRA [25], which greatly reduces the number of
from Mθ though S ELF -R EVISE (§III-B). trainable parameters.
• Few-shot Prompting: We use 4-shot prompt [30] to align
For full parameter fine-tuning, i.e., Fine-tuning (Full) [24],
we use the AdamW optimizer [29], with hyperparameters LLMs with the input-output format of Dtrain , where 4
β1 = 0.9 and β2 = 0.9, accompanied by a linear learning examples in prompt are randomly selected from Dtrain .
rate schedule. The initial learning rate is set to 5e-6, with Considering the baselines involve full-parameter fine-
a batch size of 1 and gradient accumulation of 32 steps for tuning, CodeGen-2B is uniformly selected as the base model
training across 10 epochs. For parameter-efficient fine-tuning, in this experiment. For SEED, we use 30% data of Dtrain
i.e., Fine-tuning (LoRA) [25], the learning rate is set to 2e-4. for S ELF -R EVISE (FT)4 , while the remaining 70% data of
Additionally, the rank r is adjusted to 128, and the scaling Dtrain is employed for model optimizing, where we use full-
factor α is set at 8. All other hyperparameters remain aligned parameter fine-tuning.
with Fine-tuning (Full). For few-shot prompting [30], we set
the number of examples in prompt to 4. All baselines in the TABLE II
PASS @ K (%) OF SEED AND BASELINES ON MBPP, H UMAN E VAL , AND
experiments use consistent settings. DS- PANDAS BENCHMARKS , AND THE TEAL NUMBER AFTER ↑ DENOTES
In the error code collection step (§III-A) and the automatic THE RELATIVE IMPROVEMENT OF SEED OVER F INE - TUNING (F ULL ).
code revision step (§III-B), we use temperature [31], [32]
Benchmark Approach Pass@1 Pass@5 Pass@10
sampling to generate multiple samples: 5 samples in the former
and 30 in the latter, with the temperature set to 0.8. To obtain Direct Generation 15.6% 31.4% 40.2%
Fine-tuning (Full) 25.8% 46.2% 57.6%
the final revised code in the automatic code revision step, MBPP
Fine-tuning (LoRA) 19.8% 39.8% 55.2%
we choose the one of revised code exhibiting the minimum Few-shot Prompting 24.4% 38.0% 49.4%
SEED 32.8% (↑ 27.2%) 46.8% 64.0%
Levenshtein distance [33] to the error code. The number of
Direct Generation 24.8% 44.7% 51.8%
iterative adaptations is set to 2. The maximum generation Fine-tuning (Full) 29.8% 47.9% 56.4%
HumanEval
length is uniformly limited to 256 tokens. Fine-tuning (LoRA) 27.4% 46.9% 53.9%
Few-shot Prompting 25.2% 45.8% 53.1%
SEED 38.6% (↑ 32.9%) 54.7% 62.2%
C. Metrics Direct Generation 0.8% 3.1% 5.6%
Following the practice of real software development which Fine-tuning (Full) 2.6% 6.5% 9.6%
DS-pandas
Fine-tuning (LoRA) 2.2% 6.0% 8.9%
utilizes testing for evaluation [34], [35], we employ the Few-shot Prompting 1.9% 4.5% 5.7%
Pass@k [36] metric to measure the functional correctness SEED 5.3% (↑ 103.9%) 9.5% 12.3%

of the generated code by executing test cases. We use the


unbiased version [1] of Pass@k, where n >= k samples Results. We conducted experiments on three public bench-
are generated for each problem, count the number of correct marks, i.e., MBPP, HumanEval, and DS-pandas. The experi-
samples c <= n which pass test cases and calculate the mental results are summarized in Table II. This comparison
following estimator, yielded three insightful observations: 1) Significant superi-
ority of SEED: Our proposed SEED performs significantly
!

n−c better than the other four baselines on the three benchmarks.
k Notably, SEED exhibits significant relative improvements of
Pass@k = 1 − ! . (11)
 
E
Problems n
k 27.2%, 32.9%, and 103.9%, respectively, when compared to
the best-performing baseline Fine-tuning (Full). 2) Worst
For automatic code revision, we add the pass@any metric performance of Direct Generation: The performance of
which refers to the percentage of tasks for which the model Direct Generation is significantly lower than the Fine-tuning
generates at least one correct code that passed all test cases. (Full), Fine-tuning (LoRA), and Prompt baselines. This result
In the final evaluation of this paper, we set the temperature suggests that directly applying LLMs for evaluation may be
to 0.8 and generate n = 50 samples, which are then used to less suitable for specific scenarios, resulting in performance
calculate unbiased Pass@k [1] via Eq. (11) in all experiments. differences. 3) Fine-tuning (LoRA) is less effective than
All evaluation results are averaged over five test runs. Fine-tuning (Full): Although Fine-tuning (LoRA) offers the
advantage of reduced computational resource requirements
D. RQ1: Comparison of SEED and mainstream adaptation
for fine-tuning LLMs, it trades off the performance. 4) Less
approaches
4 In addition to MBPP dataset, for other two datasets (i.e., HumanEval and
Baselines. In this section, we evaluate the effectiveness of
DS-pandas), we generate one error code per sample in a subset comprising
SEED by comparing it against other four mainstream ap- 30% of the training set, using CodeGen-2B. Subsequently, authors collabora-
proaches used as baselines: tively apply the minimal necessary revisions to correct these error codes.
improvement of Few-shot Prompting: Few-shot Prompt- steadily. In contrast, the effectiveness of SEED on these LLMs
ing generally underperforms compared to the Fine-tuning is also continuously improving.
approaches. It only outperforms Fine-tuning (LoRA) in the
MBPP benchmark’s Pass@1 metric. However, in other situa- Summary of RQ2: SEED consistently enhances perfor-
tions, the efficacy of Few-shot Prompting is the lowest, with mance across all LLMs, outperforming Direct Generation,
the exception of Direct Generation. Fine-tuning, and Few-shot Prompting baselines, showcasing
the generalizability of SEED.
Summary of RQ1: In code generation scenarios with lim-
ited training samples, SEED exhibits improvements com- F. RQ3: The Effect of Training Sample Variants
pared to the mainstream adaptation approaches, achieving
relative improvements between 27.2% and 103.9%. Baselines. We investigate the influence of different training
samples on the final adapted model Mθ∗ to validate the
E. RQ2: SEED with Different LLMs effectiveness of using revisions of model’s erroneous output
Baselines. We use the following types of well-known LLMs for training. The different variants of training samples include:
to perform SEED, including • W/o Training: Direct generation without without any
• CodeGen-2B and CodeGen-6B [5] are code LLMs training samples.
trained on natural language and programming data for • Raw Dtrain : All samples in Dtrain , which are the raw
conversation-based program synthesis. requirement and code sample pairs in the dataset.
• Llama-7B [37] is a general-purpose foundational LLM • Dtrain ∩ SEED: The samples of the same problem as
that has been trained on diverse data to handle a wide SEED in Dtrain
range of tasks, which is developed by Meta. • SEED ∪ Dtrain : Include not only samples of problems
• CodeLlama-7B [3] is an open foundational LLM for obtained through S ELF -R EVISE, but also samples of other
code generation tasks, derived from continuous training problems in Dtrain .
and fine-tuning based on Llama architecture [37]. • Human-revised Dtrain : Samples obtained human revi-
Among them, CodeGen-2B uses full fine-tuning, and the re- sion, which are the pairs of requirement and revised code
maining LLMs use parameter-efficient fine-tuning with LoRA. of LLM’s erroneous output.
Each LLM has the same baselines as RQ1 (§IV-D), i.e., Direct • SEED: Samples obtained through S ELF -R EVISE, which
Generation, Fine-tuning, and Few-shot Prompting. are the pairs of requirement and revised code of LLM’s
erroneous output.
TABLE III
PASS @ K (%) OF SEED AND BASELINES WITH DIFFERENT LLM S , AND TABLE IV
THE TEAL NUMBER AFTER ↑ DENOTES THE RELATIVE IMPROVEMENT OF C OMPARISON OF THE EFFECT OF DIFFERENT TRAINING SAMPLE VARIANTS .
SEED OVER F INE - TUNING .

Model Approach Pass@1 Pass@5 Pass@10


Variants Pass@1 Pass@5 Pass@10
W/o Training 15.6% 31.4% 40.2%
Direct Generation 15.6% 31.4% 40.2%
Fine-tuning (Full) 25.8% 46.2% 57.6%
Raw Dtrain 25.8% 46.2% 57.6%
CodeGen-2B Dtrain ∩ SEED 22.4% 33.8% 42.8%
Few-shot Prompting 24.4% 38.0% 49.4%
SEED (Full) 32.8% ↑ 27.2% 46.8% 64.0% SEED ∪ Dtrain 29.2% 44.2% 58.0%
Direct Generation 19.6% 40.2% 60.8% Human-revised Dtrain 28.0% 46.2% 59.8%
CodeGen-6B
Fine-tuning (LoRA) 26.6% 46.8% 63.0% SEED 32.8% 46.8% 64.0%
Few-shot Prompting 26.2% 45.2% 60.2%
SEED (LoRA) 33.4% ↑ 25.6% 47.4% 67.6%
Direct Generation 13.4% 29.8% 37.4% Results. As shown in Table IV, we discover that: 1) SEED
Fine-tuning (LoRA) 15.2% 27.4% 34.0%
Llama-7B
Few-shot Prompting 16.6% 26.2% 33.8%
exceeds Raw Dtrain , despite Raw Dtrain having more
SEED (LoRA) 22.0% ↑ 44.7% 30.4% 40.8% training samples. This proves that training using revisions
Direct Generation 20.4% 43.8% 52.8% produced by S ELF -R EVISE is more efficient compared to
Fine-tuning (LoRA) 19.9% 42.4% 53.2%
CodeLlama-7B
Few-shot Prompting 27.8% 46.6% 64.8%
using samples in the dataset. 2) The effect of Dtrain ∩
SEED (LoRA) 34.8% ↑ 74.9% 49.2% 65.8% SEED is comparatively weaker, which reveals that SEED is
not simply improved by selecting better problems. 3) SEED
Results. The results of applying SEED to different LLMs ∪ Dtrain is not as effective as SEED, which shows that
are shown in Table III. The results demonstrate that SEED some samples in Dtrain have negative effects with limited
consistently achieves significant improvements across all these samples. 4) The performance of SEED surpasses that of
LLMs, outperforming the Direct Generation, Fine-tuning, and the Human-revised Dtrain . This finding may be attributed
Few-shot Prompting baselines. Moreover, we also find that to a disconnect between the revision made by humans and
the performances of Code LLMs are better than pure LLMs, the model’s learning expectations. While human revisions are
and there is a clear trend indicating that as the number of applied to all code samples in Dtrain , some samples may
parameters in the LLMs increases, the efficacy of the Di- inherently be challenging for the current model. As such,
rect Generation and Few-shot Prompting approaches increases forced learning from these samples may have counterpro-
ductive effects, highlighting a potential limitation in human- We obtain MRevise in both fine-tuning and few-shot prompting
revised Dtrain. settings for comparison. In this experiment, Mθ∗ is consis-
tently fixed as CodeGen-2B.
Summary of RQ3: The revisions of error code through
S ELF -R EVISE surpass other training sample variants, in-
TABLE VI
cluding both samples in the dataset and the samples revised C OMPARISON OF AUTOMATIC CODE REVISION BASED ON DIFFERENT
by humans, in terms of training efficiency and effectiveness. LLM S IN BOTH FINE - TUNING AND FEW- SHOT PROMPTING SETTINGS ,
WHERE MRevise IS REPORTED THE RAW RESULTS AND Mθ ∗ IS
FINE - TUNED WITH FILTERED RESULTS AS DESCRIBED IN §III-A.
G. RQ4: The Effect of Iterations
Baselines. We study the effect of iterations on SEED. We ana- MRevise Mθ ∗
Approach
lyze the progression of SEED’s effectiveness across different Pass@1 Pass@10 Pass@any Pass@1 Pass@10

iterations, starting from 0 iterations (i.e., generated directly Few-shot Prompting


CodeGen-6B 19.4% 60.1% 70.8% 26.8% 59.0%
with LLMs) and extending to one, and up to four iterations. Llama-7B 23.5% 67.7% 81.9% 20.8% 54.2%
CodeLlama-7B 20.2% 64.9% 75.0% 25.2% 59.6%
ChatGPT 61.4% 87.3% 92.1% 27.0% 62.4%
TABLE V Base Model
18.9% 57.1% 69.4% 26.2% 58.2%
P ERFORMANCE OF SEED WITH THE DIFFERENT NUMBER OF ITERATIONS . (S ELF -R EVISE (FSP))
Fine-tuning
Iterations Pass@1 Pass@5 Pass@10 Num. of Revised Code CodeGen-6B 5.0% 20.3% 26.6% 29.4% 64.2%
Llama-7B 2.7% 8.5% 12.6% 23.2% 58.4%
0 15.6% 31.4% 40.2% - CodeLlama-7B 5.1% 21.0% 34.6% 24.0% 60.2%
1 31.6% 46.3% 60.6% 31 (+31) Base Model
3.9% 18.9% 24.6% 32.8% 64.0%
2 32.8% 46.8% 64.0% 41 (+10) (S ELF -R EVISE (FT))
3 33.0% 46.7% 62.6% 43 (+2)
4 33.2% 47.1% 64.0% 44 (+1)
Results. Table VI illustrates the experimental results of au-
tomatic code revision based on different models, and we can
Results. We conduct this experiment on MBPP dataset and observe that: 1) S ELF -R EVISE (FT) employing the same
its results are displayed in Table V. From the results we can model as the base model yields the best performance of
observe a trend: as the number of iteration rounds increases, Mθ∗ . For baselines using other LLMs in fine-tuning, CodeL-
the performance of SEED first shows an increasing trend and lama exhibits superior performance in terms of Pass@k in
then gradually stabilizes. At the same time, the amount of MRevise , but its final effectiveness is somewhat compromised.
revised code in each iteration is also increasing, indicating This limitation is attributed to the divergence in training data
that errors are continuously discovered, corrected, and learned. and architectural frameworks between CodeLlama and the
Considering that Pass@10 has oscillations from the 2nd iter- base model, leading to inconsistencies in the revised code
ation to the 4th iteration, we choose to end after the second with the base model’s expectations. In contrast, CodeGen-
iteration as the final performance of SEED. 6B, which is the same series of the base model with a large
parameter, demonstrates slightly lower Pass@k in MRevise
Summary of RQ4: As iteration rounds increase, the
than CodeLlama but still achieves commendable results for
performance of SEED initially improves and then stabilizes,
Mθ∗ . 2) Although the Pass@k of S ELF -R EVISE (FSP)
and over 98% performance can be achieved in two iteration.
is higher than S ELF -R EVISE (FT) in MRevise , it does
not perform as well on the ultimate Mθ∗ . We find this
H. RQ5: Automatic Code Revision Based on Other Models
discrepancy may be due to the S ELF -R EVISE (FSP)’s tendency
Baselines. to learn superficial forms, i.e., it often resorts to directly
We evaluate the performance of automatic code revision and copying code from the correct solution provided in the prompt,
the impact on the final model Mθ∗ , which is obtained through even when explicitly instructed not to in the prompt. Using
SEED, when using alternative LLMs to substitute the base ChatGPT as MRevise results in substantially higher Pass@k
model as MRevise . The base model and alternative LLMs are compared to using the base model, does not significantly
as follows: enhance the final model Mθ∗ .
• Base Model: We use CodeGen-2B [5] as the base model Qualitative Examples. We use the case study to qualitatively
for automatic code revision, i.e., S ELF -R EVISE. assess the effectiveness of automatic code revision (§III-B),
• CodeGen-6B [5]: A variant in the same series as our i.e., S ELF -R EVISE (FSP) and S ELF -R EVISE (FT) employed
base model but with a larger parameter count. by SEED, examples of which are presented in Figure 5. Upon
• Llama-7B [37]: A LLM with different training data and manual inspection of the outcomes produced by S ELF -R EVISE
architectures of the base model. (FSP), two prevalent modification patterns are identified. First,
• CodeLlama-7B [3]: A code LLM with different training the removal of redundant code is a common alteration. This
data and architectures of the base model. includes the deletion of unnecessary blocks such as “if name
• ChatGPT [38]: A powerful AI chatbot developed based == ‘main’ ” and other test codes, which are often extraneous in
on LLMs. Since ChatGPT is close-sourced and cannot be the context of the desired output. Second, S ELF -R EVISE (FSP)
fine-tuned, we employ ChatGPT for S ELF -R EVISE (FSP). exhibits a tendency to directly copy correct code samples from
CaseⅠ Requirement Case Ⅱ Requirement

Write a function to perform index wise addition of tuple elements in the given Write a function to check if the common elements between
two nested tuples. two given lists are in the same order or not.

Revision of error code Revision of error code

def add_nested_tuples(test_tup1, test_tup2): def same_order(l1, l2):


Self-Revise (FSP)

- return tuple(map(lambda x, y: (x[0] + y[0], x[1] + y[1]), test_tup1, - if len(l1) != len(l2):


test_tup2)) - return False
- - for i in range(len(l1)):
-if __name__ == "__main__": - if l1[i] != l2[i]:
- import doctest - return False
- doctest.testmod() - return True
+ return tuple(tuple(map(sum, zip(tup1, tup2))) for tup1, tup2 in + common_elements = set(l1) & set(l2)
zip(test_tup1, test_tup2)) + l1 = [e for e in l1 if e in common_elements]
+ l2 = [e for e in l2 if e in common_elements]
+ return l1 == l2

Sample code in dataset Sample code indataset COPY

def add_nested_tuples(test_tup1, test_tup2): def same_order(l1, l2):


res = tuple(tuple(a + b for a, b in zip(tup1, tup2)) for common_elements = set(l1) & set(l2)
tup1, tup2 in zip(test_tup1, test_tup2)) l1 = [e for e in l1 if e in common_elements]
return (res) l2 = [e for e in l2 if e in common_elements]
return l1 == l2

CaseⅠ Requirement Case Ⅱ Requirement

Write a function to extract all the pairs which are symmetric in the given tuple list. Write a function to remove odd characters in a string.

Revision of error code Revision of error code


def extract_symmetric(test_list): def remove_odd(str1):
Self-Revise (FT)

symmetric_pairs = set() new_str = ""


for i in range(len(test_list)): for i in range(len(str1)):
- for j in range(i+1, len(test_list)): - if i % 2 == 0:
- if test_list[i] == test_list[j]: + if not i % 2 == 0:
- symmetric_pairs.add((test_list[i], test_list[j])) new_str += str1[i]
+ if test_list[i][::-1] in test_list: return new_str
+ symmetric_pairs.add(tuple(sorted(test_list[i])))
return symmetric_pairs

Sample code in dataset Sample code in dataset

def extract_symmetric(test_list): def remove_odd(str1):


temp = set(test_list) & {(b, a) for a, b in test_list} str2 = ''
res = {(a, b) for a, b in temp if a < b} for i in range(1, len(str1) + 1):
return (res) if(i % 2 == 0):
str2 = str2 + str1[i - 1]
return str2

Fig. 5. Cases for two settings of self-revise, where “-” and “+” respectively indicate lines of code before and after revision.

the prompt. In contrast, S ELF -R EVISE (FT) is capable of mak- ponents, i.e., correct solution, failed test cases, and error
ing minimal yet effective modifications to the model’s initial messages. By removing these components individually, we
error code outputs, thereby generating the correct code. Based observe their specific impact on the performance of automatic
on the observations, S ELF -R EVISE (FT) is recommended as code revision and the final model, and thus evaluate the
the more preferable approach for automatic code revision effectiveness of these components.
within SEED.
TABLE VII
Summary of RQ5: The effectiveness of automatic code R ESULTS OF THE ABLATION STUDY ON S ELF -R EVISE I NPUT.
revision improves with the use of more powerful LLMs.
MRevise Mθ∗
However, using the LLM consistent with the final optimized Approach
LLM and under the fine-tuning setting is most beneficial for Pass@1 Pass@10 Pass@any Pass@1 Pass@10

the final performance. SEED 3.9% 18.9% 24.6% 32.8% 64.0%


- Correct Solution 3.4% 15.4% 19.8% 30.1% 61.9%
- Error Messages 3.1% 14.2% 17.3% 28.6% 58.7%
- Failed Test Cases 2.3% 5.1% 6.3% 26.1% 47.6%
I. RQ6: Ablation Study on SELF-REVISE Input
Baselines. We further perform the ablation study to investigate Results. We conduct the ablation study on MBPP dataset as
the effectiveness of each input component in S ELF -R EVISE. shown in Table VII. First, we find that removing the failed
Requirements and error codes are the indispensable basic test cases resulted in the largest drop in performance of all
inputs for performing automatic code revision. Therefore, we metrics. Failed test cases can demonstrate the inconsistency
perform ablation experiments on the remaining three com- between the model-generated code output and the desired
output, allowing LLMs to reason about and correct erroneous The rise of pre-training techniques has brought new mo-
operations. Experimental results show that this point is most mentum to the field of code generation. Against this back-
helpful for automatic code revision. Second, removing error drop, LLMs such as Codex [1], InCoder [6], CodeGen [62],
messages or the correct code solution also results in a loss of AlphaCode [36], and CodeGeeX [4] have emerged, greatly
performance. Error messages directly indicate surface errors in enhancing the performance of code generation. For LLMs-
the generated code (such as syntax errors and runtime errors) based code generation, there are some approaches to correct
and the location of the errors, which is also helpful for LLMs the LLMs’ errors, such as Self-edit [22], Self-debug [63], and
to revise the code. The correct code samples in the dataset Self-collaboration [23], which use error correction to improve
can provide some reference for revising errors of LLMs, thus the predictions of the LLM without making any improvement
further reducing the difficulty of correction. to the model itself. Recently, Chen et al. [27] propose an
ILF approach focused on using human feedback to refine
Summary of RQ6: The analysis of the ablation study model results. However, this approach necessitates continuous
indicates that all of the input components in S ELF -R EVISE human involvement and the provision of feedback throughout
positively contribute to the effectiveness of automatic code the model’s training phase, which incurs significant costs in
revision and the final model. practical applications. Further, Chen et al. [64] propose a
distillation approach that employs ChatGPT [38] to generate
V. R ELATED W ORK a large amount of refinement to train small models. However,
A. Adaption of LLMs this approach presents two primary limitations. Firstly, it
necessitates a highly performant “teacher” model, significantly
Many tasks rely on adapting LLMs to multiple downstream surpassing the capabilities of the “student” model. Secondly,
applications. Such adaptation is usually done via fine-tuning, commercial constraints and other factors likely prohibit its
which updates all the parameters of the pre-trained model. implementation.
Since LLMs contain a large number of model parameters and In contrast, our work is an adaption approach, focusing on
performing full parameter tuning would be very expensive obtaining a better model adapted to specific domains with
[39], a number of parameter-efficient fine-tuning approaches limited data. By exploiting the inherent potential of LLMs
have been developed, including Adapter Tuning [40], [41], to achieve error-driven learning, we significantly improve the
Prompt Tuning [42], [43], Prefix Tuning [44], [45], and Low- learning efficiency of LLMs.
rank adaptation [25]. These lightweight tuning approaches
typically achieve near-full fine-tuning performance while re- VI. T HREATS TO VALIDITY
ducing the number of parameters required for training. They There are three major threats to the validity of our work.
primarily optimize the efficiency of training the model param- 1) Threats to external validity concern the quality of
eters but are not directly targeted at improving the efficiency experimental datasets and the generalizability of our results.
of data sample usage. First, we simulate specific code generation scenarios with
Another approach of adaptation that does not require train- three public code generation datasets, which are mainstream
ing is prompting [46], which depends on in-context learning benchmarks and have been used in many related works [65]–
[47], [48]. By utilizing natural language instructions and a [70]. Second, SEED can be applied to any LLMs, and we
few examples of a task, this approach enables LLMs to choose four well-known LLMs [71]–[74] of different sizes,
recognize and execute new tasks without explicit gradient training data, and architectures for our experiments.
updates. However, a limitation of this approach is that the 2) Threats to internal validity involve the impact of hyper-
models often merely mimic the surface form of the prompt, parameters. Deep learning models are known to be sensitive to
struggling to deeply understand or adapt to complex and hyperparameters. For our proposed SEED, we only do a small-
abstract task requirements. range grid search on hyper-parameters, including iterations
Our approach is orthogonal to the aforementioned adapta- of SEED, learning rates, and training epochs, leaving other
tion techniques, allowing for its concurrent application with hyperparameters the same as those in previous studies [3], [5],
these approaches to enhance overall effectiveness. [25], [30], [37], [75], which has explored effective settings of
the hyperparameters through extensive experiments. For the
B. Code Generation baselines, their settings are consistent with our approach.
In the era of deep learning, the field of code generation has 3) Threats to construct validity pertain to the reliability
undergone significant evolution. Initially, these technologies of evaluation metrics. To address this threat, we employ
treated code similar to natural language [49]–[51], then grad- Pass@k [36] as the evaluation metric, which leverages test
ually integrated structure specific to code, such as Abstract cases to gauge the functional correctness of code. Additionally,
Syntax Trees (AST) [52]–[56] and API calls [57]–[59]. No- we employ the unbiased version of Pass@k [1] to diminish
tably, Mukherjee et al. [60] proposed a code generation model evaluation errors that arise from sampling. Pass@k is the
that incorporates static analysis tools. Concurrently, Dong et al. mainstream metric for code generation and is widely used in
[61] devised a PDA-based methodology to ensure the syntactic previous studies [4], [76]–[81]. On this basis, each experiment
correctness of code generation. is run five times, and its average result is reported.
VII. C ONCLUSION [13] S. Liu, J. Keung, Z. Yang, F. Liu, Q. Zhou, and Y. Liao, “Delving into
parameter-efficient fine-tuning in code change learning: An empirical
In this work, we have proposed SEED, a Sample-Efficient study,” CoRR, vol. abs/2402.06247, 2024.
adaptation with Error-Driven learning for code generation, [14] S. Chakraborty, T. Ahmed, Y. Ding, P. T. Devanbu, and B. Ray,
“Natgen: generative pre-training by ”naturalizing” source code,” in
improving the code generation performance of LLMs in ESEC/SIGSOFT FSE. ACM, 2022, pp. 18–30.
specific scenarios with limited samples. SEED outperforms [15] M. Ciniselli, N. Cooper, L. Pascarella, A. Mastropaolo, E. Aghajani,
mainstream adaptation approaches (e.g., full-parameter fine- D. Poshyvanyk, M. D. Penta, and G. Bavota, “An empirical study on
the usage of transformer models for code completion,” IEEE Trans.
tuning, LoRA, few-shot prompting) with limited training sam- Software Eng., vol. 48, no. 12, pp. 4818–4837, 2022.
ples on three public code generation benchmarks. Our analysis [16] A. Aghajanyan, A. Shrivastava, A. Gupta, N. Goyal, L. Zettlemoyer, and
indicates that LLMs can learn more efficiently by learning S. Gupta, “Better fine-tuning by reducing representational collapse,” in
ICLR. OpenReview.net, 2021.
from the revisions of their error than by traditionally learning [17] R. Xu, F. Luo, Z. Zhang, C. Tan, B. Chang, S. Huang, and F. Huang,
from code examples in datasets. Extensive results show that the “Raise a child in large language model: Towards effective and gener-
SEED can enhance largely on five different LLMs of varying alizable fine-tuning,” in EMNLP (1). Association for Computational
Linguistics, 2021, pp. 9514–9528.
sizes or series, which proves its generalizability. We believe [18] H. Zhang, G. Li, J. Li, Z. Zhang, Y. Zhu, and Z. Jin, “Fine-tuning
that training LLMs adaptively with few samples may have pre-trained language models effectively by optimizing subnetworks
broader application scenarios. We leave this area of exploration adaptively,” in NeurIPS, 2022.
[19] H. Xu, S. Ebner, M. Yarmohammadi, A. S. White, B. Van Durme, and
for future work. K. Murray, “Gradual fine-tuning for low-resource domain adaptation,”
in Proceedings of the Second Workshop on Domain Adaptation for NLP,
R EFERENCES 2021, pp. 214–221.
[20] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on
[1] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.
H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, [21] G. Casella, C. P. Robert, and M. T. Wells, “Generalized accept-reject
G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, sampling schemes,” Lecture Notes-Monograph Series, pp. 342–347,
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, 2004.
C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, [22] K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin, “Self-edit: Fault-aware code
E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, editor for code generation,” in ACL (1). Association for Computational
J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, Linguistics, 2023, pp. 769–787.
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, [23] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation
M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, via chatgpt,” CoRR, vol. abs/2304.07590, 2023.
B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, [24] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of
“Evaluating large language models trained on code,” CoRR, 2021. deep bidirectional transformers for language understanding,” in NAACL-
[Online]. Available: https://arxiv.org/abs/2107.03374 HLT (1). Association for Computational Linguistics, 2019, pp. 4171–
[2] S. Shen, X. Zhu, Y. Dong, Q. Guo, Y. Zhen, and G. Li, “Incorporating 4186.
domain knowledge through task augmentation for front-end javascript [25] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
code generation,” in ESEC/SIGSOFT FSE. ACM, 2022, pp. 1533–1543. and W. Chen, “Lora: Low-rank adaptation of large language models,”
[3] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, in ICLR. OpenReview.net, 2022.
J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, [26] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan,
M. Bhatt, C. Canton-Ferrer, A. Grattafiori, W. Xiong, A. Défossez, E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton, “Program synthesis
J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and with large language models,” CoRR, vol. abs/2108.07732, 2021.
G. Synnaeve, “Code llama: Open foundation models for code,” CoRR, [27] A. Chen, J. Scheurer, T. Korbak, J. A. Campos, J. S. Chan, S. R.
vol. abs/2308.12950, 2023. Bowman, K. Cho, and E. Perez, “Improving code generation by training
[4] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, with natural language feedback,” CoRR, vol. abs/2303.16749, 2023.
A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang, “Codegeex: A pre-trained [28] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W. Yih,
model for code generation with multilingual evaluations on humaneval- D. Fried, S. I. Wang, and T. Yu, “DS-1000: A natural and reliable
x,” CoRR, vol. abs/2303.17568, 2023. benchmark for data science code generation,” in ICML, ser. Proceedings
[5] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, of Machine Learning Research, vol. 202. PMLR, 2023, pp. 18 319–
and C. Xiong, “Codegen: An open large language model for code with 18 345.
multi-turn program synthesis,” in ICLR. OpenReview.net, 2023. [29] I. Loshchilov and F. Hutter, “Fixing weight decay regularization in
[6] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, adam,” CoRR, vol. abs/1711.05101, 2017.
W. Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative model [30] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
for code infilling and synthesis,” CoRR, vol. abs/2204.05999, 2022. A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
[7] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J. Lou, and W. Chen, Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler,
“CodeT: Code generation with generated tests,” in ICLR, 2023. J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
[8] T. Zhang, T. Yu, T. Hashimoto, M. Lewis, W. Yih, D. Fried, and B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
S. Wang, “Coder reviewer reranking for code generation,” in ICML, ser. and D. Amodei, “Language models are few-shot learners,” in NeurIPS,
Proceedings of Machine Learning Research, vol. 202. PMLR, 2023, 2020.
pp. 41 832–41 846. [31] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious
[9] X. Jiang, Y. Dong, L. Wang, Q. Shang, and G. Li, “Self-planning code case of neural text degeneration,” in ICLR. OpenReview.net, 2020.
generation with large language model,” CoRR, vol. abs/2303.06689, [32] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm
2023. for boltzmann machines,” Cogn. Sci., vol. 9, no. 1, pp. 147–169, 1985.
[10] T. Ahmed, C. Bird, P. Devanbu, and S. Chakraborty, “Studying [33] V. I. Levenshtein et al., “Binary codes capable of correcting deletions,
llm performance on closed-and open-source data,” arXiv preprint insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8.
arXiv:2402.15100, 2024. Soviet Union, 1966, pp. 707–710.
[11] M. Chen, H. Zhang, C. Wan, Z. Wei, Y. Xu, J. Wang, and X. Gu, [34] N. B. Ruparelia, “Software development lifecycle models,” ACM SIG-
“On the effectiveness of large language models in domain-specific code SOFT Softw. Eng. years, vol. 35, no. 3, pp. 8–13, 2010.
generation,” CoRR, vol. abs/2312.01639, 2023. [35] P. Abrahamsson, O. Salo, J. Ronkainen, and J. Warsta, “Agile software
[12] E. Shi, Y. Wang, H. Zhang, L. Du, S. Han, D. Zhang, and H. Sun, “To- development methods: Review and analysis,” 2002.
wards efficient fine-tuning of pre-trained code models: An experimental [36] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,
study and beyond,” in ISSTA. ACM, 2023, pp. 39–51. T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition-
level code generation with alphacode,” Science, vol. 378, no. 6624, pp. [55] Z. Sun, Q. Zhu, Y. Xiong, Y. Sun, L. Mou, and L. Zhang, “Treegen:
1092–1097, 2022. A tree-based transformer architecture for code generation,” in AAAI.
[37] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, AAAI Press, 2020, pp. 8984–8991.
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, [56] P. Yin and G. Neubig, “TRANX: A transition-based neural abstract
C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, syntax parser for semantic parsing and code generation,” in EMNLP
J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, (Demonstration). Association for Computational Linguistics, 2018, pp.
S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, 7–12.
I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee, [57] V. Raychev, M. T. Vechev, and E. Yahav, “Code completion with
D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, statistical language models,” in PLDI. ACM, 2014, pp. 419–428.
I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, [58] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API learning,” in
A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, SIGSOFT FSE. ACM, 2016, pp. 631–642.
R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, [59] ——, “Deepam: Migrate apis with multi-modal sequence to sequence
A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, learning,” in IJCAI. ijcai.org, 2017, pp. 3675–3681.
and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,”
[60] R. Mukherjee, Y. Wen, D. Chaudhari, T. W. Reps, S. Chaudhuri, and
CoRR, vol. abs/2307.09288, 2023.
C. M. Jermaine, “Neural program generation modulo static analysis,” in
[38] OpenAI. (2022) ChatGPT. [Online]. Available: https://openai.com/blog/ NeurIPS, 2021, pp. 18 984–18 996.
chatgpt/ [61] Y. Dong, G. Li, and Z. Jin, “CODEP: grammatical seq2seq model for
[39] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, general-purpose code generation,” in ISSTA. ACM, 2023, pp. 188–198.
C. Chan, W. Chen, J. Yi, W. Zhao, X. Wang, Z. Liu, H. Zheng, J. Chen, [62] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese,
Y. Liu, J. Tang, J. Li, and M. Sun, “Parameter-efficient fine-tuning of and C. Xiong, “Codegen: An open large language model for code with
large-scale pre-trained language models,” Nat. Mac. Intell., vol. 5, no. 3, multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
pp. 220–235, 2023.
[63] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language
[40] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, models to self-debug,” CoRR, vol. abs/2304.05128, 2023.
A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
[64] H. Chen, A. Saha, S. C. H. Hoi, and S. Joty, “Personalised distillation:
learning for NLP,” in ICML, ser. Proceedings of Machine Learning
Empowering open-sourced llms with adaptive learning for code gener-
Research, vol. 97. PMLR, 2019, pp. 2790–2799.
ation,” CoRR, vol. abs/2310.18628, 2023.
[41] Z. Hu, L. Wang, Y. Lan, W. Xu, E. Lim, L. Bing, X. Xu, S. Poria,
[65] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin,
and R. K. Lee, “Llm-adapters: An adapter family for parameter-efficient
and D. Jiang, “Wizardcoder: Empowering code large language models
fine-tuning of large language models,” in EMNLP. Association for
with evol-instruct,” CoRR, vol. abs/2306.08568, 2023.
Computational Linguistics, 2023, pp. 5254–5276.
[66] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou,
[42] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for
M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y.
parameter-efficient prompt tuning,” in EMNLP (1). Association for
Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro,
Computational Linguistics, 2021, pp. 3045–3059.
O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M. Yee, L. K. Umapathi,
[43] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “GPT J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. M. V, J. Stillerman,
understands, too,” CoRR, vol. abs/2103.10385, 2021. S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, N. Moustafa-
[44] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts Fahmy, U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. Villegas,
for generation,” in ACL/IJCNLP (1). Association for Computational M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding,
Linguistics, 2021, pp. 4582–4597. C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu,
[45] X. Liu, K. Ji, Y. Fu, Z. Du, Z. Yang, and J. Tang, “P-tuning v2: Prompt J. Robinson, C. J. Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy,
tuning can be comparable to fine-tuning universally across scales and D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes, T. Wolf,
tasks,” CoRR, vol. abs/2110.07602, 2021. A. Guha, L. von Werra, and H. de Vries, “Starcoder: may the source be
[46] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre- with you!” CoRR, vol. abs/2305.06161, 2023.
train, prompt, and predict: A systematic survey of prompting methods [67] J. P. Inala, C. Wang, M. Yang, A. Codas, M. Encarnación, S. K.
in natural language processing,” ACM Comput. Surv., vol. 55, no. 9, pp. Lahiri, M. Musuvathi, and J. Gao, “Fault-aware neural code rankers,” in
195:1–195:35, 2023. NeurIPS, 2022.
[47] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, [68] D. Huang, Q. Bu, J. Zhang, X. Xie, J. Chen, and H. Cui, “Bias
L. Li, and Z. Sui, “A survey for in-context learning,” CoRR, vol. assessment and mitigation in llm-based code generation,” arXiv preprint
abs/2301.00234, 2023. arXiv:2309.14345, 2023.
[48] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, [69] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum, and C. Gan,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- “Planning with large language models for code generation,” in ICLR.
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, OpenReview.net, 2023.
J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, [70] X. Wei, S. K. Gonugondla, S. Wang, W. U. Ahmad, B. Ray, H. Qian,
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, X. Li, V. Kumar, Z. Wang, Y. Tian, Q. Sun, B. Athiwaratkun, M. Shang,
and D. Amodei, “Language models are few-shot learners,” in NeurIPS, M. K. Ramanathan, P. Bhatia, and B. Xiang, “Towards greener yet
2020. powerful code generation via quantization: An empirical study,” in
[49] W. Ling, P. Blunsom, E. Grefenstette, K. M. Hermann, T. Kociský, ESEC/SIGSOFT FSE. ACM, 2023, pp. 224–236.
F. Wang, and A. W. Senior, “Latent predictor networks for code [71] D. Zan, B. Chen, Z. Lin, B. Guan, Y. Wang, and J. Lou, “When language
generation,” in ACL (1). The Association for Computer Linguistics, model meets private library,” in EMNLP (Findings). Association for
2016. Computational Linguistics, 2022, pp. 277–288.
[50] R. Jia and P. Liang, “Data recombination for neural semantic parsing,” [72] Y. Wen, P. Yin, K. Shi, H. Michalewski, S. Chaudhuri, and A. Polozov,
in ACL (1). The Association for Computer Linguistics, 2016. “Grounding data science code generation with input-output specifica-
[51] B. Wei, G. Li, X. Xia, Z. Fu, and Z. Jin, “Code generation as a dual tions,” CoRR, vol. abs/2402.08073, 2024.
task of code summarization,” in NeurIPS, 2019, pp. 6559–6569. [73] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by
[52] M. Rabinovich, M. Stern, and D. Klein, “Abstract syntax networks for chatgpt really correct? rigorous evaluation of large language models for
code generation and semantic parsing,” in ACL (1). Association for code generation,” in NeurIPS, 2023.
Computational Linguistics, 2017, pp. 1139–1149. [74] D. Zan, B. Chen, D. Yang, Z. Lin, M. Kim, B. Guan, Y. Wang, W. Chen,
[53] P. Yin and G. Neubig, “A syntactic neural model for general-purpose and J. Lou, “CERT: continual pre-training on sketches for library-
code generation,” in ACL (1). Association for Computational Linguis- oriented code generation,” in IJCAI. ijcai.org, 2022, pp. 2369–2375.
tics, 2017, pp. 440–450. [75] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix,
[54] Z. Sun, Q. Zhu, L. Mou, Y. Xiong, G. Li, and L. Zhang, “A grammar- B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin,
based structural CNN decoder for code generation,” in AAAI. AAAI E. Grave, and G. Lample, “Llama: Open and efficient foundation
Press, 2019, pp. 7055–7062. language models,” CoRR, vol. abs/2302.13971, 2023.
[76] Y. Dong, J. Ding, X. Jiang, Z. Li, G. Li, and Z. Jin, “Codescore:
Evaluating code generation by learning code execution,” CoRR, vol.
abs/2301.09043, 2023.
[77] G. Orlanski, K. Xiao, X. Garcia, J. Hui, J. Howland, J. Malmaud,
J. Austin, R. Singh, and M. Catasta, “Measuring the impact of pro-
gramming language distribution,” in ICML, ser. Proceedings of Machine
Learning Research, vol. 202. PMLR, 2023, pp. 26 619–26 645.
[78] X. Wang, X. Liu, P. Zhou, Q. Liu, J. Liu, H. Wu, and X. Cui, “Test-driven
multi-task learning with functionally equivalent code transformation for
neural code generation,” in ASE. ACM, 2022, pp. 188:1–188:6.
[79] F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin,
D. Pinckney, M. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, A. Guha,
M. Greenberg, and A. Jangda, “Multipl-e: A scalable and polyglot ap-
proach to benchmarking neural code generation,” IEEE Trans. Software
Eng., vol. 49, no. 7, pp. 3675–3691, 2023.
[80] A. Ni, S. Iyer, D. Radev, V. Stoyanov, W. Yih, S. I. Wang, and
X. V. Lin, “LEVER: learning to verify language-to-code generation with
execution,” in ICML, ser. Proceedings of Machine Learning Research,
vol. 202. PMLR, 2023, pp. 26 106–26 128.
[81] Y. Dong, X. Jiang, H. Liu, G. Li, and Z. Jin, “Generalization or
memorization: Data contamination and trustworthy evaluation for large
language models,” CoRR, vol. abs/2402.15938, 2024.

You might also like