Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views12 pages

Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code

This paper investigates the effectiveness of few-shot learning with pre-trained language models, specifically Codex, in generating code for various software engineering tasks such as code mutation, test oracle generation, and test case generation. The study finds that FSLM-based tools can complement or even outperform traditional code generation tools while requiring significantly less development effort. The authors provide insights into designing effective prompts and highlight the potential of FSLMs to create versatile code generation tools with minimal resource investment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views12 pages

Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code

This paper investigates the effectiveness of few-shot learning with pre-trained language models, specifically Codex, in generating code for various software engineering tasks such as code mutation, test oracle generation, and test case generation. The study finds that FSLM-based tools can complement or even outperform traditional code generation tools while requiring significantly less development effort. The authors provide insights into designing effective prompts and highlight the potential of FSLMs to create versatile code generation tools with minimal resource investment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Code Generation Tools (Almost) for Free?

A Study of Few-Shot, Pre-Trained Language Models on Code


Patrick Bareiß Beatriz Souza
University of Stuttgart Federal University of Pernambuco
Germany Brasil

Marcelo d’Amorim Michael Pradel


Federal University of Pernambuco University of Stuttgart
Brasil Germany
arXiv:2206.01335v2 [cs.SE] 12 Jun 2022

ABSTRACT given some natural language artifact. For example, some approaches
Few-shot learning with large-scale, pre-trained language models is a generate test oracles based on informal API documentation [10,
powerful way to answer questions about code, e.g., how to complete 11, 22], infer API usage protocols [57], or suggest missing type
a given code example, or even generate code snippets from scratch. annotations [39].
The success of these models raises the question whether they could The traditional way of creating such code manipulation tools
serve as a basis for building a wide range code generation tools. is based on program analysis combined with various rules and
Traditionally, such tools are built manually and separately for each heuristics. Program analysis can, at least in principle, ensure that
task. Instead, few-shot learning may allow to obtain different tools the generated code is guaranteed to have certain properties, e.g.,
from a single pre-trained language model by simply providing a to be type-correct or to pass a given set of test cases. Hand-coded
few examples or a natural language description of the expected rules and heuristics are typically required to enable a technique
tool behavior. This paper studies to what extent a state-of-the-art, to be effective and efficient on real-world software. More recently,
pre-trained language model of code, Codex, may serve this purpose. learning-based approaches have started to complement traditional
We consider three code manipulation and code generation tasks program analysis-based code generation tools [44]. Typically, these
targeted by a range of traditional tools: (i) code mutation; (ii) test approaches formulate the specific code generation task as a super-
oracle generation from natural language documentation; and (iii) vised learning problem, and require large amounts of training data
test case generation. For each task, we compare few-shot learning to to obtain an effective machine learning model. A commonality of
a manually built tool. Our results show that the model-based tools both traditional program analyses and learning-based approaches
complement (code mutation), are on par (test oracle generation), is that creating a new code generation tool involves significant
or even outperform their respective traditionally built tool (test human effort. Even worse, this effort often must be repeated for
case generation), while imposing far less effort to develop them. By each new combination of a task to achieve and a programming
comparing the effectiveness of different variants of the model-based language to target.
tools, we provide insights on how to design an appropriate input A recent trend in the natural language processing (NLP) commu-
(“prompt”) to the model and what influence the size of the model nity promises a form of “general intelligence” that remedies many
has. For example, we find that providing a small natural language of the problems of building task-specific techniques: few-shot learn-
description of the code generation task is an easy way to improve ing with large-scale, pre-trained language models [13], henceforth
predictions. Overall, we conclude that few-shot language models abbreviated with FSLMs. These models are trained on huge amounts
are surprisingly effective, yet there is still more work to be done, of data without focusing on a specific downstream task. Instead,
such as exploring more diverse ways of prompting and tackling the training is based on generic pseudo-tasks for which it is trivial
even more involved tasks. to obtain sufficient training data, e.g., predicting masked words or
whether two sentences belong together. Once trained, FSLMs are
1 INTRODUCTION effective at various question answering and text generation tasks,
e.g., reading comprehension, trivia quizzes, translation between
Various software engineering tools assist developers by generating languages, and text completion [13].
source code. One group of approaches reasons about existing code Applying FSLMs to code is still a relatively sparsely explored
and modifies it in a way suitable to achieve some goal. For example, area. While recent work employs pre-training of models of code as a
code mutation tools [33, 43] introduce mistakes to measure the means to reduce the amount of required training examples [3, 20, 24,
effectiveness of test suites, and automated program repair tools [37, 38], these approaches still fine-tune a model for a specific purpose
41] suggest how to fix programming mistakes. Another group of and hence require moderately large amounts of labeled training
approaches generates new code from scratch, given some existing examples. Noteworthy exceptions include GitHub’s Copilot code
code that the new code is supposed to relate to. For example, test completion system1 , which is based on the Codex FSLM [15], and
case generators [17, 21, 42] automatically create tests that exercise the recently released, open-source PolyCoder model family [55].
a given method under test, and code completion tools [14, 27, 47] While the results of these models are impressive, code comple-
generate code that completes an existing code snippet in a suitable tion is only one of many code generation tasks. Do the abilities of
way. Finally, a third group of code generation tools does not require
any existing code as an input, but instead generates new code 1 https://copilot.github.com/

1
Conference’17, July 2017, Washington, DC, USA Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel

FSLMs generalize to other software engineering tasks that tradi- test generation, Randoop achieves 10% coverage, whereas a
tionally have been addressed by special-purpose code generation simple FSLM-based tool achieves 14%.
techniques? In case of a positive answer, FSLMs offer the poten- • FSLM-based and traditionally-developed tools often comple-
tial to obtain code generation tools (almost) for free, as an FSLM ment each other. For example, our FSLM-based code muta-
gets trained once and can then be applied to many different tasks. tion tool creates various mutants that Major cannot generate.
Despite this potential and the strong interest of the software en- The complementary nature of the two kinds of tools shows
gineering community in automated code generation techniques, the potential of combining traditional and FSLM-based ap-
there currently is no systematic study of the abilities of FSLMs on proaches. For example, combining Randoop-generated and
such tasks. FSLM-generated test cases yields 16% coverage, i.e., it ex-
This paper presents the first systematic study of FSLMs as the ceeds both approaches individually.
key ingredient for creating code generation tools. We describe a • FSLM-based tool do not come completely for free. To be ef-
general framework for creating a code generation tool based on an fective, they need specifically designed prompts and suitable
existing FSLM, apply it to three popular tasks that are representative inputs extracted from the given code or natural language.
for different kinds of code generation problems, and compare the Yet, the effort required to create an FSLM-based tool is clearly
FSLM-based approach against traditionally developed state-of-the- lower than that for building special-purpose code generation
art tools. Instantiating our framework for a specific code generation tools from scratch.
tasks involves three steps. First, develop an extractor of code or In summary, this paper contributes the following:
natural information to use in a query to the model. Second, design
a suitable prompt, i.e., a template of how to present the input to the • The first systematic study of FSLM-based code generation
model, which then gets instantiated for each given example. Finally, tools.
develop a lightweight post-processing module, which, e.g., removes • We are the first to address code mutation, test oracle genera-
generated code that fails to compile. We argue that these steps are tion, and test case generation in an end-to-end manner with
lightweight compared to designing and implementing a traditional general-purpose FSLMs.
program generation technique, as they leave the most challenging • Insights that show the potential and challenges of building
parts of the tasks to the FSLM. As a result, the approach offers an FSLM-based code generation tools, providing guidance for
almost-for-free way of obtaining a code generation tool. future work.
We instantiate these ideas for three code generation tasks: code
mutation, test oracle generation, and test case generation. These 2 BACKGROUND
tasks have received significant interest from the software engineer- A generative language model is designed to predict the next token
ing community, and hence, offer state-of-the-art tools to compare given some previous tokens. For example, if such a model is given
against. The tasks also cover different levels of granularity of the the input “I am Barack Obama. I used to be the president of the
generated code, ranging from manipulating a few tokens in code United States of”, such a language model might predict “America”
mutation to generating entire test cases. Finally, the selected tasks as the next token. This can be used to generate text by repeatedly
are based on different kinds of input: code mutation and test case sampling for the next token. When using such a model for down-
generation are based on existing code, whereas test oracle genera- stream tasks that differ from the next token prediction objective,
tion is based on natural language documentation. Table 1 shows the step of initial training is often referred to as pre-training.
two representative example outputs that FSLM-based tools produce A pre-trained model can be adapted to a specific downstream
for each of these tasks. The examples follow the format 𝑥 | ==> 𝑦, task via fine-tuning, i.e., in additional training step based on labeled
where 𝑥 and 𝑦 denote, respectively, the input and output of the data for the downstream task. A recently proposed alternative is
prompt for the given task. few-shot learning [13], which refers to the ability to perform a task
For each task, we instantiate our general framework to create without any fine-tuning, but given only very few (typically, between
an FSLM-based code generation tool and then apply the tool to one and ten) examples as part of the query to the model. We utilize
real-world software. We then systematically compare the results generative language models as few-shot learners, which we refer
produced by the FSLM-based tool against an existing, state-of-the- to as few-shot learning with large-scale, pre-trained language models
art tool built specifically for the same purpose: the Major [34] code (FSLM). We use OpenAI’s Codex [15] model, which is trained on a
mutation tool, the MeMo [11] test oracle extraction tool, and the large set of GitHub projects. We access the model through its API.
Randoop [42] test case generator. We measure the effectiveness Alternative generative models exist, e.g., GPT-NeoX [9].
of each tool using metrics of success suitable for the task, e.g., The input provided to an FSLM is referred to as the prompt.
code coverage for test case generation, and precision/recall w.r.t. a Prompts typically contain a few examples of inputs with their de-
ground truth for test oracle generation. sired outputs, followed by the input for which the model should
Our key findings include: provide an answer. For the above example, a prompt could start by
giving a few example pairs of head of states and the correspond-
ing country, and then “Barack Obama”, to which the model might
• FSLM-based tools are similarly and sometimes even more respond with “United States”. Prompts are, in principle, unstruc-
effective than existing, special-purpose tools. For example, tured text, and what exactly is provided in a prompt may strongly
for oracle generation, we measure an F1 score of 0.59 and influence the results. When querying an FSLM with a prompt, the
0.60 for MeMo [11] and an FSLM-based tool, respectively. For user can select the temperature, which intuitively speaking controls
2
A Study of Few-Shot, Pre-Trained Language Models on Code Conference’17, July 2017, Washington, DC, USA

Table 1: Representative examples of results obtained with FSLM-based code generation tools.

Task Example #1 Example #2


parsed = ( parsed + " 000000000 ") . substring (0 , 9) ; ofSeconds ( divRem [0]. longValue () , divRem [1]. intValue () )
|== > |==>
Code mutation
parsed = ( parsed + " 000000 " ) . substring (0 , 9) ; ofSeconds ( divRem [0]. longValue () / divRem [1]. intValue () ,
divRem [1]. intValue () )

The returned implementation is immutable , thread - safe and Note for Java 7 and later : this method should be treated
Serializable . It is equivalent to tick ( system ( zone ) , as deprecated ; use the equivalent Long # compare method
Duration . ofSeconds (1) ) . instead .
|==> |==>
Oracle generation java . time . Clock . tick ( java . time . Clock . system ( zone ) , com . google . common . primitives . Longs . compare (a , b )
java . time . Duration . ofSeconds (1) ) IS EQUIVALENT TO
IS EQUIVALENT TO java . lang . Long . compare (a , b)
java . time . Clock . tickSeconds ( zone )

public Vector1D scalarMultiply ( double a ) { public boolean contains ( Object o) {


return new Vector1D ( a * x ) ; } return indexOf ( o ) >= 0; }
|==> |==>
public void testScalarMultiply () { public void testContains () {
Vector1D v1 = new Vector1D (1) ; ArrayList < String > list = new ArrayList < String >() ;
Test generation Vector1D v2 = new Vector1D (2) ; list . add (" a " ) ;
Assert . assertEquals ( v1 . scalarMultiply (2) , v2 ) ; } list . add (" b " ) ;
list . add (" c " ) ;
Assert . assertTrue ( list . contains ("b"));
Assert . assertFalse ( list . contains ("d")); }

the creativity or randomness of the model’s responses. A higher and a part that is instance-specific (e.g., the line of code to
temperature means the model will generate more diverse responses, mutate). Given a prompt for a specific instance, the approach
but be less factual and precise. Repeatedly querying a model may passes the prompt to the FSLM and then obtains a completion
return different results, especially when using a higher temperature. of it.
(3) Post-processing. Finally, the third step is to post-process
3 METHODOLOGY the raw output produced by the model in order to obtain
the final code generation results. The post-processing may
Figure 1 shows a general framework for producing code generation
filter the completions, e.g., to ensure that the generated code
tools for a diverse set of tasks. The framework relies on a large-scale
compiles or to copy the predicted code into a task-specific
language model pre-trained on code, such as Codex [15]. The input
code template.
to the framework is a textual representation of a software artifact,
e.g., source code or documentation. The output is a set of generated Sections 3.2, 3.3, and 3.4 describe the code generation tasks that
code snippets, e.g., a modified version of the given source code, an this paper focuses on according to the three above steps.
executable specification, or a test case. The framework is organized
in three main steps, which we briefly describe in the following. 3.1 Research Questions
The overall goal of this study is to understand the strengths and
(1) Instance extraction. The first step is responsible for ex-
limitations of FSML-based code generation tools. To this end, we
tracting parts of a given software artifact that are relevant
investigate the following research questions.
for the code generation task. We refer to an extracted part
as an instance. For example, for code mutation, the instance RQ1. Accuracy: How accurate are the model’s predictions com-
extraction takes in source code and extracts code lines for pared to existing tools?
which we want to generate mutants. The rationale for not RQ2. Impact of Prompt: What kinds of prompts are most effec-
simply passing in the entire raw software artifact is two-fold. tive at producing accurate results?
First, FSLMs impose a maximum input size, e.g., 4,096 tokens
for the Codex model series. Second, larger inputs take longer RQ3. Impact of Model Size: How much does the size of the FSLM
to process, i.e., the instance extraction reduces the overall influence the accuracy?
time to generate code. The motivation for RQ1 is that building traditional tools by hand
(2) Prompt design. The second step to use our framework is imposes significant human costs. Understanding to what extent a
designing an effective prompt, which is perhaps the most single general-purpose language model could replace these tools
difficult part of creating an FSLM-based code generation may help reducing the cost for creating new tools. The motivation
tool. The prompt contains (i) an instance, as extracted in for RQ2 is that the prompt to query a pre-trained model is the main
the previous step, and (ii) contextual information, such as “knob” to control the quality of the model’s predictions. Under-
examples for addressing the code generation task and/or a standing what prompts are effective (or not) helps in making best
natural language description of the task. The prompts we use of the existing models. Finally, the motivation for RQ3 is that
use in our study include a part that is invariant across all state-of-the-art language models are trained on huge datasets using
instances (e.g., a natural language description of the task) enormous computational resources. Understanding the impact of
3
Conference’17, July 2017, Washington, DC, USA Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel

Huge code corpus

Large-scale
training

Pre-trained
language model
Source code, 1 2 3
Instance Instance Raw completions Code generation
documentation, Prompt Post-processing
extraction results
etc.

Figure 1: Overview of a general framework for generating code analysis tools using few-shot, pre-trained language models.

Generate mutations for the following snippets of code. 3.2.3 Prompt. To generate mutants via an FSLM, we design a
[[Code]] prompt that ask the model to modify one code line at a time. Fig-
long biasedExp = (longBits & DoubleConsts.EXP_BIT_MASK)»
(DoubleConsts.SIGNIFICAND_WIDTH - 1); ure 2 shows the default prompt for our study. The prompt contains
[[Mutations]] a brief natural language description of the task to perform, followed
- longBits & DoubleConsts.EXP_BIT_MASK |==> longBits | by a short list of examples. To help the model understand the dif-
DoubleConsts.EXP_BIT_MASK
- longBits & DoubleConsts.EXP_BIT_MASK) » ferent sections of the prompt, we mark them, e.g., via brackets as
(DoubleConsts.SIGNIFICAND_WIDTH - 1) |==> longBits & in “[[Code]]”. Each example consists of the line of code to modify
DoubleConsts.EXP_BIT_MASK) « (DoubleConsts.SIGNIFICAND_WIDTH - (“[[Code]]”) and a few mutants to generate based on it (“[[Muta-
1)
- 1 |==> 0 tions]]”). Since mutants are small, we ask the model to suggest
- DoubleConsts.SIGNIFICAND_WIDTH - 1 |==> multiple mutants at once. Thus, the temperature can be set low for
DoubleConsts.SIGNIFICAND_WIDTH % 1 consistent but not as diverse results. At the end, the prompt provides
[[Code]]
... the code line we wish to mutate, leaving the task of completing it
(3 more examples) with suitable mutants to the model. For the example in Figure 2,
... the model suggests a mutant that replaces the expression classVal
[[Code]] passed as parameter in the call lhsDist.get() with classVal + 1.
WeightMass mass = lhsDist.get(classVal);
[[Mutations]]
3.2.4 Post-processing. Once the model completes the prompt, we
- classVal |==> classVal + 1
- classVal |==> 0 post-process the raw completion using simple regular expressions
to extract the mutations suggested by the model. Because an FSLM
Figure 2: Prompt used for mutant generation. We shot the natural language
description in purple, any instance-specific parts of the prompt (e.g., general
does not guarantee to produce code that is syntactically or semanti-
examples, separators, etc.) in black, inputs that are specific to a concrete in- cally valid, we filter out any suggested mutants that do not compile.
stance (i.e., the part that changes when the instance changes) in blue, and All remaining code snippets are our final set of mutants generated
parts that the FSLM model generates in response to that input in green.
by the FSLM. Table 2: Projects used in our study.

model size on the model’s effectiveness will help appropriately 3.2.5 Benchmarks. As code Project Version # Classes
allocate computational resources to train models. files to mutate, we ran- Colt 1.2.0 297
domly select 32 classes ElasticSearch 6.1.1 2,821
3.2 Task 1: Code Mutation from the projects listed on GWT 2.5.1 3,178
We address our research questions by studying them on three popu- Table 2. The smallest class GraphStream 1.3 233
has 19 lines of code, while Guava 19.0 464
lar code generation tasks. The first task is code mutation, a popular Hibernate 5.4.2 3,393
technique to assess the quality of a test suite by estimating its ability the longest has 1,346. In to-
tal, the classes have 6,747 JDK 8 4,240
to detect injected faults. Code mutation modifies a given piece of Commons Math 3.6.1 918
code by injecting a programming mistake. As a simple example, a lines of code. Across the
Weka 3.8.0 1,648
code mutation tool may change a comparison x > 5 into x < 5. 32 classes, our instance ex-
tractor yields 1,194 instances (lines of code) to generate mutants
3.2.1 Baseline Tool. We study the effectiveness of an FSLM-based for.
code mutation tool by comparing it against Major [34], a popu-
lar code mutation tool for Java. Major applies different built-in 3.3 Task 2: Generating Oracles from Natural
mutation operators and ensures that all created mutants compile. Language Documentation
3.2.2 Instance Extraction. To create an FSLM-based code mutation As a second code generation task, we consider the problem of gen-
tool, the first step is extracting code snippets to modify. Since muta- erating test oracles from natural language documentation. This
tion operators typically are local code transformations, an instance task represents a class of tasks where an FSLM translates natural
for this task consists of a single line of code. The instance extractor language to code. Specifically, we focus on the task of extracting
takes a Java file as input and returns a list of lines of code that we metamorphic test oracles for API methods from Javadoc comments.
then try to mutate via the FSLM. For a fair comparison, we focus A metamorphic test oracle states that two inputs that are in a spe-
our experiments on those lines where Major applies a mutation. cific relationship are expected to lead to outputs that are in another
4
A Study of Few-Shot, Pre-Trained Language Models on Code Conference’17, July 2017, Washington, DC, USA

### Signature 3.3.3 Prompt. We design the prompt to be a short list of examples
public static java.lang.String toString(java.lang.Object[] a) of what the model is supposed to achieve, as shown in Figure 3.
### Comment
...
Each example consists of four parts: (1) The signature of the method
The value returned by this method is equal to the value that would for which to generate an oracle (“### Signature”), (2) the natural
be returned by Arrays.asList(a).toString(), unless a is null, in language description of the method’s behavior, as extracted from the
which case "null" is returned. available Javadoc (“### Comment”), (3) A small section of natural
### Analysis
language explanation about how the equivalence manifests itself
This method returns the same thing as the expression
Arrays.asList(a).toString(), therefore they are equivalent in the example (“### Analysis”) (This part is motivated by the
(at least as long as a is not null). observation that by letting the model explain its reasoning before
### Equivalence generating the result itself may increase its effectiveness [54].), and
if (a != null) toString(a) <-> Arrays.asList(a).toString() ; (4) the Java code of the metamorphic oracle, which consists of a
...
(3 more examples)
conditional followed by two expressions separated by the symbol
... <->, denoting “equivalent to” (“### Equivalence”). After providing a
### Signature small number of such examples (four by default), we provide the
public double norm2(cern.colt.matrix.DoubleMatrix1D x) signature and comment of the instance we are interested in, and
### Comment
then let the model complete the prompt by providing an analysis
Returns the two-norm (aka euclidean norm) of vector x;
equivalent to mult(x,x). and the oracle. For this task, the temperature is set to zero, as we
### Analysis observe the model to produce too imprecise predictions otherwise.
This method is equivalent to the expression mult(x,x).
### Equivalence 3.3.4 Post-processing. Given the raw completion produced by the
norm2(x) <-> mult(x,x); model in response to our prompt, we extract the generated test
Figure 3: The prompt used for oracle extraction. oracle. The extraction is based on simple regular expressions, e.g.,
anchored around the special <-> symbol. Next, we check whether
the predicted condition (if any) and the code snippets compile
known relationship [16]. In the context of testing API methods, this properly. Finally, the approach expands names of classes, e.g., Math
typically means that some API usage is equivalent (or in some other to java.lang.Math, using JavaSymbolSolver.2
relationship) to some other API usage. As an example of an oracle
we aim to generate, consider this excerpt from the Javadoc of the 3.3.5 Benchmarks. To measure the effectiveness of the FSLM-based
Array.toString method: “The value returned by this method is equal tool, we use a ground truth dataset available from MeMo’s arti-
to the value that would be returned by Arrays.asList(a).toString(), facts [11]. The dataset is based on 5,000 methods from nine open-
unless a is null, in which case null is returned.”. source Java projects, from which 299 metamorphic test oracles have
The equivalence described in this documentation could be speci- been manually extracted. The oracles are diverse and vary in length:
fied as an executable test oracle that states that Arrays.toString(a) The natural language descriptions range between 3 and 500 words,
yields the same as Arrays.asList(a).toString() if a != null. with a mean of 44.3. The code of the oracles ranges between 3 and
81 tokens, with a mean of 21.6.
3.3.1 Baseline Tool. Extracting test oracles from natural language
documentation is an active area of research [10, 11, 22]. As a base- 3.4 Task 3: Test Case Generation
line to compare an FSLM-based tool against, we use the recently
As the third code generation task, we consider the problem of gener-
proposed MeMo [11]. MeMo extracts metamorphic test oracles
ating unit tests. This task represents a class of tasks where the FSLM
by first identifying natural language sentences that could contain
generates a method, i.e., a larger portion of code compared with the
such oracles using simple heuristics, and then translating those
previous examples. Test case generation is a labor-intensive task in
sentences into code. This translation, which is the most intricate
software testing [6], and several techniques have been proposed to
part of MeMo, decomposes the sentence using a dependency parser,
automate unit test case generation [49].
and then converts the parsed sentence into code based on a set
of hard-coded rules and heuristics. Because of the inherent im- 3.4.1 Baseline Tool. There are many automatic test case generation
precision and diversity of natural language, the second step has tools available. Randoop [42] and EvoSuite [21] are popular repre-
to cover many edge cases to be effective. Our study investigates sentatives of such tools. We use Randoop in our study. To generate
whether an FSLM-based tool could replace or complement this sec- test cases with Randoop, for each method under test, we invoke
ond step of MeMo, i.e., replacing the hard-coded rules by queries its main class randoop.main.Main passing the gentests command
to a pre-trained model. and the –methodlist=filename.txt and –generated-limit=100 argu-
ments. The file filename.txt contains the method under test, as well
3.3.2 Instance Extraction. In the context of this task, we define an
as helper methods it depends on. We select helper methods with a
instance to be a method whose description probably contains an
minimum amount of dependencies to include. The generated-limit
oracle. For a fair comparison with the baseline tool, and because
argument defines the maximum number of test method candidates
extracting such sentences is comparatively simple, we MeMo to
generated internally. For a fair comparison, we let Randoop and
identify sentences that likely contain an oracle. We then pass the
entire comment containing such a sentence into our prompt, which
provides the FSLM with some context. 2 http://javaparser.org/

5
Conference’17, July 2017, Washington, DC, USA Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel

the FSLM generate the same number (100) of test cases per method Suggest a test for a method with the DoubleArrayList
quantiles(DoubleArrayList percentages) signature.
under test. Helper constructors and methods:
3.4.2 Instance Extraction. For unit test case generation, we con- DynamicBin1D()
DoubleArrayList()
sider an instance to be a method under test. That is, the instance Method: public synchronized double max() {
extractor takes a Java class as its input, produces a list of public if (! isIncrementalStatValid) updateIncrementalStats();
methods, and randomly selects a method from the list to be tested. return this.max;
}
3.4.3 Prompt. Figure 4 shows an example of the (default) prompt Test: public static void testMax() {
that we use for unit test case generation. The prompt starts with double[] temp = new double[2];
a brief natural language description of the task. Next, we provide temp[0] = 8.9;
temp[1] = 1;
one example of the task. The reason for showing only one exam- DenseDoubleMatrix1D d1Double = new DenseDoubleMatrix1D(temp);
ple is that state-of-the-art FSLMs only support a bounded prompt hep.aida.bin.DynamicBin1D d1ynamicBin =
size. The example consists of three parts: (1) a list of helper meth- cern.colt.matrix.doublealgo.Statistic.bin(d1Double);
double max = d1ynamicBin.max();
ods to assist in the creation of values. (“Helper constructors and
System.out.println("max="+ max);
methods:”), (2) the method under test itself, and (3) a test case that }
invokes the method under test. After the example, the instance, —
consisting of the code of the method (as explained on Section 3.4.2 Method:public DoubleArrayList quantiles(DoubleArrayList percentages)
“Instance Extraction”) is provided, leaving the task of generating {
return Descriptive.quantiles(sortedElements_unsafe(),percentages);
a test case to the model. Since the prompt contains only a single }
example, selecting this example potentially has a large impact on Test:public static void testQuantiles() {
the generated test. Section 4.2 compares different strategies for double[] temp = new double[2];
selecting the example, e.g., selecting another method under test temp[0] = 8.9;
temp[1] = 1;
from the same class and selecting another method under test at
DenseDoubleMatrix1D d1Double = new DenseDoubleMatrix1D(temp);
random. Because each query yields only one test case, we make hep.aida.bin.DynamicBin1D d1ynamicBin =
multiple queries while varying the temperature parameter from 0.0 cern.colt.matrix.doublealgo.Statistic.bin(d1Double);
to 0.9, in steps of 0.1. For each temperature, we make 10 queries. DoubleArrayList quantiles = d1ynamicBin.quantiles(new
DoubleArrayList(new double[] 0.5,0.75));
This way, the model predicts a total of 100 test case candidates. System.out.println("quantiles="+ quantiles);
}
3.4.4 Post-processing. To post-process the raw completions of the
model, we inject each test case candidate into a template of a test Figure 4: Test generation prompt for the method DoubleArrayList
quantiles(DoubleArrayList percentages), declared in the class DynamicBin1D
class, which contains the necessary scaffolding to yield an exe- from project Colt.
cutable test case. Similar to the previous tasks, we discard candi-
dates that do not compile. We also remove any duplicates that may
result from querying the model multiple times.
3.4.5 Benchmarks. As methods under test we use the 18 methods
that Table 5 shows. We select them by randomly identifying two We address this question both quantitatively and qualitatively. As a
public methods from each of the 9 projects in Table 2. quantitative answer, we compute how many of the FSLM-generated
mutants exactly match one of the Major-generated mutants. We
4 RESULTS observe an overlap of around 18% of the FSLM-generated mutants.
This section presents answers to the three research questions posed Under the assumption that Major-generated mutants are useful,
in Section 3.1. Section 5 discusses the results and their broader this means that at least 18% of the FSLM-generated mutants are
impact. also useful.
As a qualitative answer, we manually inspect a random sample of
4.1 RQ1: Accuracy 30 of the compilable mutants our tool generates. For each sampled
mutant, we carefully inspect the code and determine whether the
This section presents results on the accuracy of FSLM-based code
mutation changes the runtime behavior of the code, as opposed
generation compared to traditionally built tools.
to being an equivalent mutant that preserves the semantics of
4.1.1 Code Mutation. Table 3 summarizes our results for the code the original code. The inspection shows that 90% of the mutants
mutation task. Given the 1,194 instances extracted from 32 classes certainly change the behavior, whereas the remaining 10% either
(Section 3.2.5), our FSLM-based tool generates a total of 2,721 mu- preserve the semantics or we could not clearly determine its effects.
tants, whereas the baseline Major tool generates 2,810 mutants. To better understand the mutants generated by the model and
Because the model does not guarantee to generate valid code, only Major, we automatically classify them based on the kind of code
62.5% of the FSLM-generated mutants are compilable, giving a total transformation. We distinguish four classes, as shown in the four
of 1,701 usable mutants. On average, our tool changes 3.97 tokens right-most columns of Table 3: (i) deleting a statement, (ii) replacing
of the original code, which roughly equals the 4.28 tokens changed one operator with another, (iii) replacing one value with another,
by Major. Besides the raw amount of mutants generated, it is also and (iv) some other transformation. The table shows that the distri-
important to understand whether the generated mutants are useful. bution of mutants that the FSLM and Major generate clearly differ:
6
A Study of Few-Shot, Pre-Trained Language Models on Code Conference’17, July 2017, Washington, DC, USA

Table 3: Mutants generated by our FSLM-based tool and by Major [34].

Generated mutants Kind of transformation


Total Overlap w/ Major Compilable Delete statement Replace operator Replace value Other
FSLM 2,721 18.4% 62.5% 31.8% 9.0% 34.7% 24.5%
FSLM (NL-only) 0 - - - - - -
FSLM (Ex-only) 2,645 17.7% 57.6% 35.2% 7.0% 36.3% 21.6%
FSLM (Bad-ex) 2,595 19.0% 64.5% 29.4% 9.1% 37.1% 24.4%
FSLM (Small model) 2,487 15.4% 53.8% 30.0% 4.2% 33.1% 32.8%
Major 2,810 100.0% 100.0% 4.9% 48.6% 35.8% 0.0%

Table 4: Effectiveness of test oracle generation.


While Major mostly replaces operators and values, the model gen-
erates a more diverse set of mutants, suggesting that the two tools
complement each other. FSLM MeMo [11]
Finally, we manually study another random sample of 30 mutants Project
Precision Recall F1 Precision Recall F1
produced by each tool to get qualitative insights into the differences
between the two tools. We make two interesting observations: JDK 0.84 0.54 0.66 0.50 0.55 0.52
Colt 0.65 0.42 0.51 0.61 0.42 0.50
• The FSLM model generates mutants that Major cannot gener-
Guava 0.88 0.52 0.65 0.83 0.67 0.74
ate based on its built-in mutation operators [34]. For example,
GWT 0.79 0.27 0.40 0.65 0.27 0.38
these FSLM-generated mutants include adding a constant to
Graphstream 1.00 0.70 0.82 1.00 0.70 0.82
an integer (e.g., turning nanos into nanos + 1) and changing
Apache commons 0.79 0.58 0.67 0.66 0.82 0.73
methods to semantically similar ones (e.g., turning Math.min
Hibernate 0.00 0.00 0.00 0.00 0.00 0.00
into Math.max).
Elasticsearch 0.00 0.00 0.00 0.00 0.00 0.00
• A relatively large amount of the FSLM-generated mutants
Weka 0.50 0.50 0.50 0.43 0.50 0.46
(7/30=23%) replace an expression with null. While this yields
mutant that change the semantics, the high amount is still Total 0.82 0.47 0.60 0.64 0.54 0.59
surprising.
Overall, these results show that our FSLM-based tool, while not remedy this limitation by prompting for a list of oracles, similar
generating exactly the same mutants as an existing tool, nonetheless to the prompt for code mutations, or by querying the model mul-
creates a large number of useful mutants with minimal effort. tiple times with a higher temperature, similar to what we do for
4.1.2 Generating Oracles from Natural Language Documentation. test generation. The remaining six oracles are all related to MeMo
This section compares the accuracy of (metamorphic) test oracle incorrectly capturing longer or nested pieces of code. For exam-
generators, namely, the state-of-the-art MeMo [11] and its FSLM- ple, the documentation “Calling getSource() is equivalent to calling
based counterpart. To measure accuracy, we compare all generated getSnapshot().getSource()” is translated by MeMo to an equiva-
oracles against a ground truth consisting of 299 test oracles that lence between getSource() and getSnapshot(), which is incorrect.
we wish to extract from the documentation of methods from the In contrast, the model correctly predicts the equivalence between
projects listed on Table 2. Specifically, we measure precision (𝑃𝑟 ) getSource() and getSnapshot().getSource().
and recall (𝑅𝑒) as follows: On the other hand, we also inspect the six instances where
# of correctly generated oracles
𝑃𝑟 = # of all generated oracles
# of correctly generated oracles
𝑅𝑒 = # of all ground truth oracles the model misses an oracle that MeMo can predict. For two
of these oracles, the model “invents” code seemingly out of
In addition, we report the F1-score, defined as the harmonic
thin air. For example, the documentation “This is equivalent
mean of precision and recall. Table 4 shows the results for each
to, but not necessarily implemented as, !(Float.isInfinite(value)
of the studied libraries. Across all projects, the FSLM-based oracle
| | Float.isNaN(value)).” leads to an incorrect prediction of the
generator achieves an F1-score of 0.60, which slightly outperforms
model saying that com.google.common.primitives.Float.isFinite
MeMo’s F1-score of 0.59. Comparing precision and recall shows
and java.lang.Float.isFinite are equivalent.
that the model tends to generate oracles much more precisely, with
Overall, the FSLM-based oracle generator achieves results that
a precision of 0.82 instead of MeMo’s precision of 0.64.
are on par, and even slightly better, than those of a state-of-the-art
To understand the strengths and weakness of the two approaches,
tool based on a set hard-coded rules and heuristics.
we manually study some of the oracles. On the one hand, we in-
spect those oracles that the model predicts correctly while MeMo 4.1.3 Test Case Generation. Table 5 summarizes the results of gen-
misses them, which are nine oracles in total. Three of the nine erating test cases with our FSLM-based approach and with Ran-
oracles are cases where there exist multiple oracles for a single doop [2] on 18 methods. The table reports the amount of compilable
method, and the model discovers one, whereas MeMo discovers tests (column “CT”), the average size of tests in number of lines of
the other. This highlights a limitation of our prompt design, which code (column “TS”), and the line coverage that the tests achieve (col-
enables the model to predict only one oracle per method. One could umn “LC”). We measure coverage using JaCoCo [1]. We notice from
7
Conference’17, July 2017, Washington, DC, USA Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel

Table 5: Analysis of the test cases generated by our FSLM-based test generator and Randoop for the considered methods. The table presents (1) the number of
compilable test (CT) cases; (2) the average test size (TS) of the generated tests and (3) the line coverage (LC) achieved by them.

FSLM Randoop Combined


Project Method
# CT TS LC # CT TS LC LC
Colt DoubleArrayList quantiles(DoubleArrayList percentages) 12 11 29% 29 15 26% 39%
Colt double moment(int k, double c) 7 13 17% 49 16 15% 19%
ElasticSearch String parent() 4 4 8% 1 202 5% 8%
ElasticSearch IndexRequest source(Map source, XContentType contentType) 8 10 13% 66 35 5% 13%
GWT boolean isClient() 3 7 6% 1 7 6% 6%
GWT UncaughtExceptionHandler getUncaughtExceptionHandler() 0 0 0% 1 7 6% 6%
Graphstream boolean contains(Edge edge) 1 12 12% 49 85 9% 13%
Graphstream boolean equals(Path p) 42 17 31% 44 112 11% 31%
Guava HashCode fromLong(long hash) 6 8 12% 28 12 7% 12%
Guava int writeBytesTo(byte[] dest, int offset, int maxLength) 37 10 32% 50 23 17% 34%
Hibernate Short toShort(Boolean value) 3 4 24% 3 6 20% 24%
Hibernate Boolean fromString(String string) 17 7 22% 35 19 22% 22%
JDK Object remove(int index) 13 15 5% 70 19 3% 5%
JDK boolean contains(Object o) 7 9 3% 44 26 1% 3%
Math Vector1D scalarMultiply(double a) 28 7 30% 58 10 24% 32%
Math double distanceSq(Vector1D p1, Vector1D p2) 9 6 23% 51 17 18% 24%
Weka AlgVector add(AlgVector other) 5 10 28% 47 35 17% 28%
Weka Instance getAsInstance(Instances model, Random random) 0 0 0% 56 15 8% 8%
Total 202 11 14% 682 31 10% 16%

these results that, overall, the model achieves higher code coverage Summary of RQ1: FSLM-based tools perform surprisingly well
than Randoop (14% vs. 10%). This result is particularly remarkable across the different tasks, being on par or complimentary, and
as Randoop generates more than three times the number of tests for test generation even better, than the handcrafted tools we
the model generates (202 vs. 682 tests). Moreover, on average, the compare against.
size of the tests generated by the model are much smaller than the
tests generated by Randoop (11 vs. 31 lines). 4.2 RQ2: Impact of Prompt
On a closer analysis of the tests generated by each approach, By default, our prompts contain both natural language task de-
for each of the 18 methods, we can see that Randoop successfully scriptions and input-output examples. This section reports on the
generates tests for all 18 methods under test. In contrast, the model impact of using different prompt variants. For each of the tasks, we
successfully generates tests for only 16 of them. More specifically, consider the following prompt variants: Only natural language de-
(i) for 14 methods, the tests generated by the model achieve higher scription (NL-only); Only input-output examples (Ex-only); Poorly
coverage than the tests generated by Randoop; (ii) for two methods, chosen examples (Bad-ex).
the tests generated by both approaches achieve the same cover-
age; (iii) for two methods, the tests generated by Randoop achieve 4.2.1 Code Mutation. For mutant generation, NL-only means the
higher coverage than the tests generated by the model. These are prompt includes only the natural language text at the top of Figure 2,
exactly the two methods for which the model fails to generate any Ex-only means we keep everything but the NL description, and Bad-
compilable tests. ex means we include additional examples where our FSLM-based
These results provide initial evidence indicating that FSLM-based tool should not generate mutants. For example, we add an import
tools can outperform state-of-the-art test generation tools. We also statement as an example, but leave the mutants section empty. The
calculate the coverage achieved by combining the tests generated idea is to test how robust the model is to adversarial or poorly
by both approaches. The results can be seen in the last column of Ta- chosen examples.
ble 5. Interestingly, the coverage achieved by the combination of the The middle rows in Table 3 show the results obtained with these
tests (16%) is superior to the coverage achieved by the tests of each variants. Running NL-only does not produce promising results
approach individually. As an example, the coverage achieved by the since it is missing the guiding output format from the examples. We
combination of the tests is considerably higher when considering attempt to "fix" the prompt by including more detailed descriptions
the quantiles method of the Colt project. In this case, individually, on how to format the output (i.e. we add "Return the result in the
the tests generated by the model achieve 29% line coverage and the format original |==> replacement as part of a list numbered using
tests generated by Randoop achieve 26% line coverage. Combined, ’-’." to the prompt), but the output format remains inconsistent,
the tests generated by both approaches achieve 39% line coverage. giving no results. This means examples play a large part in solving
this task using an FSLM. Looking at the results for Ex-only reveals
8
A Study of Few-Shot, Pre-Trained Language Models on Code Conference’17, July 2017, Washington, DC, USA

Table 6: Line coverage achieved by the tests generated by our FSLM-based test
generator with different prompts and a smaller model.

 Variant of the approach Line coverage


 FSLM 14%
 FSLM w/o example (NL-only) 12%

FSLM w/o NL descr. (Ex-only) 9%
9DOXH

FSLM w/ random example (Bad-ex) 8%


 FSLM w/ small model 10%


A likely explanation for this discrepancy across the tasks is that
 the way natural language descriptions are used in the second task
 differs from how it is used in the other two tasks (Section 3.3.3).
0H0R )6/0 )6/0 )6/0 )6/0 )6/0 Consequently, to more uniformly compare the tasks, we also run an
1/RQO\ ([RQO\ %DGH[ 6PDOOPRGHO
experiment with a prompt where the natural language description
3UHFLVLRQ 5HFDOO )6FRUH
is in the form of a task description. This prompt yields an F1 score
of 0.54, i.e., substantially worse than our default prompt. These
Figure 5: Oracle generation results. Comparison of prompt variations.
results suggest that the quality of the examples are relatively more
important than the NL description for this task.
that less generated mutants compile, with a margin of 5%. This is
interesting as the textual description is only a single sentence in this 4.2.3 Test Case Generation. In this task, the method under test
task and shows an easy way to improve performance over using a and a list of helper constructors and methods is always provided
prompt without it. Moreover, we observe the following behavior to the prompt. Therefore, for NL-only we remove the input-output
for the Bad-ex variant of the prompt. The overlap with Major and example, for Ex-only we remove the natural language description,
percentage of mutants that compile are actually slightly higher and for Bad-ex we provide a different input-output example, which
than for our default approach. This is surprising in that a deliberate we randomly select from a different project.
attempt to worsen the predictions instead slightly improves the Table 6 reports the results of these experiments. Overall, we can
results. see that, regarding line coverage, the default prompt achieves a
higher line coverage of 14%, followed by variation (NL-only) with
4.2.2 Generating Oracles from Natural Language Documentation. 12% coverage, then variation (Ex-only) with 12% coverage, and
For oracle generation, NL-only means we only add a natural lan- finally variation (Bad-ex) with only 8% coverage. These results
guage description of the task and some information about format- indicate that a natural language description can be even more im-
ting (e.g., "Extract equivalent pieces of code from the following portant than an input-output example for test generation (12% vs.
comments. Format the result as Code snippet A <-> Code snippet 9%). Moreover, an input-output example more related to the method
B."). For Ex-only we remove the part of the prompt that describes under test, i.e., from the same class in our case, can add more value
the task in NL (see purple text on Figure 3). This is different from than a random example unrelated to the method under test (14% vs.
the style employed for mutant generation though, as in the oracle 8%).
extraction prompt the natural language description is part of each
example and not just a general task description. For Bad-ex, we Summary of RQ2: Adding a brief natural language description
once again add examples designed to throw off the FSLM by in- is an easy way to help (or at least not hurt) the FSLM-based
cluding examples that the model should not generate anything for. tools. Furthermore, we find that providing suitable examples is
For example, we add a method with comment “Returns the largest crucial for the model to make effective predictions.
value of the given array.” and leave the oracle section empty.
Figure 5 shows results of the FSLM for the oracle generation 4.3 RQ3: Impact of Model Size
task, when using different prompt variations. The accuracy is not Training larger language models on more data often results in
significantly affected by the different styles of prompt used, except performance improvements for downstream tasks [13]. By default,
for NL-only. As for mutant generation, NL-only yields incorrectly we use the “Davinci” model of Codex, which currently is the largest
formatted responses, giving no usable results. Again, examples model offered via the OpenAI API. Since larger models come with
appear necessary to be able to successfully use FSLMs for this task. a hefty computational price [28], we also measure the impact of
Considering the prompt variant where we remove NL description, using a smaller model. To this end, we repeat our experiments with
Ex-only, we observe that the difference in performance is negligible the “Cushman” model of Codex, which is a derivative of a small
compared to the default prompt, indicating that the helping text is model trained by Chen et al. [15].
not as important as it was for mutation generation. Considering the
prompt variant Bad-ex, we observe that the use of bad examples 4.3.1 Code Mutation. The “FSLM w/ small model” row of Table 3
performs worse compared to other types of prompts. This indicates shows the impact of using a smaller model on code mutation. Sev-
that the quality of examples for this task is more important than eral metrics of success clearly drop, e.g., the total number of gen-
for mutant generation. erated mutants (from 2,721 to 2,487) and the number of mutants
9
Conference’17, July 2017, Washington, DC, USA Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel

that compile (from 62.5% to 52.8%). These results show that using a 6 RELATED WORK
larger model is beneficial for this task. Studies of neural models of code. As neural models of code be-
4.3.2 Generating Oracles from Natural Language Documentation. come more popular for a diverse set of tasks, many, similar to us,
When running the test oracle generation using a smaller model, we have begun investigating the details of these models. This comes in
discover, surprisingly, that the results we obtain are nearly identical multiple forms, such as evaluating a series of similar models [55] or
to the larger model with an F1 score of 0.58 (as compared to 0.6). models with the same architecture but differing size [15]. Another
Hence, it seems some tasks can be handled by smaller models almost approach is to apply a model of code to multiple downstream tasks
as well as with larger models. and compare its performance, e.g., by fine-tuning a transformer
model to perform tasks similar to the ones we explore in our re-
4.3.3 Test Case Generation. For test case generation, we observe search [40]. What sets this paper apart is that (1) we investigate
a significant drop in effectiveness when using the smaller model few-shot learning, requiring less training data as compared to fine-
(Table 6). The line coverage drops to 10%, i.e., four percent points tuning, (2) we compare against commonly used traditional tools,
less than with the larger model and about the same as with tests while others compare neural approaches against each other, and
generated by Randoop. (3) we target a different set of tasks.
Summary of RQ3: Increasing the model size improves effec-
tiveness, or at least does not negatively affect it, for all three Language models in software engineering. Degiovanni and Pa-
code generation tasks. For one of the three tasks (oracle gen- padakis [18] use a pre-trained language model for mutation testing
eration), the effect is small though. Given the computational by masking one token at a time and asking the model to predict an
cost of large models, carefully selecting them for each task is alternative, which is then considered a mutation. Instead, we study
recommended. using a generative model for end-to-end mutant generation, which
often changes multiple tokens at a time. Several papers [7, 15] study
5 DISCUSSION language model-based code generation from short natural language
descriptions. In contrast to our work, there offer no comparison
Prompt design. Designing “good” prompts is central to the cre-
to traditional tools and focus only on this single task. Jain et al.
ation of FSLM-based tools. When answering RQ2, we observe that
[32] use generative language models for program synthesis given
examples are very important in prompt design and that natural
a natural description of the desired functionality and some code
language descriptions are often helpful. There are, however, ques-
examples that are likely similar to the expected code. They propose
tions that remain to be evaluated, including (i) how to mine good
a “context bank” of examples to provide in the prompt, which is an
examples to create prompts, (ii) whether or not alternating through
idea one could also adapt for our tasks.
examples is useful when the user queries the model multiple times,
and (iii) how sensible the prompts are to the data format.
Generative language models in general. Since the introduction of
Model size. Training large-scale models of code may easily cost Transformers [51], generative language modeling has seen huge
hundreds, or even millions, of dollars [28]. Additionally, these large- progress. Large models, such as GPT-2 [46], shown generative lan-
scale models are hard to use due to their sheer size, or not being open guage models to perform well across different tasks when fine-
to the public in the first place. For our work, we find these models tuning them or in a few-shot setting [13, 50]. Predictions of future
to be effective, but obtaining the same results with an improved performance promise that these models have the potential to even
smaller, open model would make the tools more accessible in the further improve their abilities [31, 35]. While these models are eval-
long run. uated on various tasks, we are not aware of any other systematic
study of few-shot models on different code generation tasks.
Integrating FSLM-based and traditional tools. The conjunction of
low effort to create new code generation tools and the promising
results we obtain indicate that integrating FSLM-based tools with Neural software analysis. Our study is part of a larger stream of
existing tools can be helpful. For example, the results for the oracle work on neural models of software [44]. An important question is
generation task (Table 4) show different precision-recall tradeoffs how to embed code into a vector representation. Several approaches,
of the two tools. Blending FSLM-based and traditional techniques e.g., based on AST paths [5], control flow graphs [52], ASTs [56],
seems a promising direction to explore in the future. and a combination of token sequences and a graph representation
of code [29] have been proposed. The general-purpose generative
Threats to Validity. We do not compare our results across differ- model used here does not explicitly embed code into a vector repre-
ent models (except by size), potentially limiting the generalizability sentation, but instead relies on the ability of transformers [51] to rea-
of our findings. While we try to evaluate on a diverse set of tasks, son about long-range dependencies. Neural models of code address
there are obviously many more code generation tasks not studied a wide range of problems, e.g., code completion [4, 8, 36], type pre-
here. The fact that the FSLM-based approach is able to provide diction [26, 39, 45, 53], program repair [19, 25], code search [23, 48],
promising results on the first three tasks we study, gives at least and making predictions about code changes [12, 30]. All the above
some indication about the potential for other tasks. Finally, we only approaches address a specific problem with a model designed for
evaluated Java-based tools, i.e., our results might not generalize this problem. Instead, our work studies how successful a general-
beyond this language. Prior research shows that large-scale models purpose model is at competing with non-neural code manipulation
perform well across many differing languages [15]. tools.
10
A Study of Few-Shot, Pre-Trained Language Models on Code Conference’17, July 2017, Washington, DC, USA

7 CONCLUSIONS Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens
Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert,
This paper studies the strengths and limitations of few-shot, pre- Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss,
trained language models for three popular code generation tasks. By Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji,
Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike,
systematically comparing the recently proposed Codex model [15] Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight,
against three traditionally built tools, we find that our model-based Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario
tools complement, are on par with, or even exceed the baseline Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Eval-
uating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021).
tools. At the same time, creating a new FSLM-based tool based arXiv:2107.03374 https://arxiv.org/abs/2107.03374
on our methodology is relatively simple. While our study shows [16] Tsong Y Chen, Shing C Cheung, and Shiu Ming Yiu. 1998. Metamorphic testing: a
promising results, we believe these are only first steps in applying new approach for generating next test cases. Technical Report. Technical Report
HKUST-CS98-01, Department of Computer Science, Hong Kong.
few-shot learning to software engineering problems. [17] Christoph Csallner and Yannis Smaragdakis. 2004. JCrasher: an automatic ro-
bustness tester for Java. Software Prac. Experience 34, 11 (2004), 1025–1050.
[18] Renzo Degiovanni and Mike Papadakis. 2022. 𝜇 BERT: Mutation Testing using
REFERENCES Pre-Trained Language Models. CoRR abs/2203.03289 (2022). https://doi.org/10.
[1] 2022. JaCoCo Java Code Coverage Library (2022). https://www.jacoco.org/jacoco/ 48550/arXiv.2203.03289 arXiv:2203.03289
Accessed: 2022-05-02. [19] Elizabeth Dinella, Hanjun Dai, Ziyang Li, Mayur Naik, Le Song, and Ke Wang.
[2] 2022. Randoop: Automatic unit test generation for Java (2022). https://randoop. 2020. Hoppity: Learning Graph Transformations to Detect and Fix Bugs in
github.io/randoop/ Accessed: 2022-05-02. Programs. In 8th International Conference on Learning Representations, ICLR 2020,
[3] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.
2021. Unified Pre-training for Program Understanding and Generation. CoRR net/forum?id=SJeqs6EFvB
abs/2103.06333 (2021). arXiv:2103.06333 https://arxiv.org/abs/2103.06333 [20] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong,
[4] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT:
Sequences from Structured Representations of Code. In 7th International Con- A Pre-Trained Model for Programming and Natural Languages. In Findings of
ference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20
2019. https://openreview.net/forum?id=H1gKYo09tX November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yulan He, and
[5] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: learning Yang Liu (Eds.). Association for Computational Linguistics, 1536–1547. https:
distributed representations of code. Proc. ACM Program. Lang. 3, POPL (2019), //doi.org/10.18653/v1/2020.findings-emnlp.139
40:1–40:29. https://doi.org/10.1145/3290353 [21] Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation
[6] Saswat Anand, Edmund K. Burke, Tsong Yueh Chen, John Clark, Myra B. Cohen, for object-oriented software. In SIGSOFT/FSE’11 19th ACM SIGSOFT Symposium
Wolfgang Grieskamp, Mark Harman, Mary Jean Harrold, Phil McMinn, Antonia on the Foundations of Software Engineering (FSE-19) and ESEC’11: 13th European
Bertolino, J. Jenny Li, and Hong Zhu. 2013. An orchestrated survey of methodolo- Software Engineering Conference (ESEC-13), Szeged, Hungary, September 5-9, 2011.
gies for automated software test case generation. Journal of Systems and Software 416–419.
86, 8 (2013), 1978–2001. https://doi.org/10.1016/j.jss.2013.02.061 [22] Alberto Goffi, Alessandra Gorla, Michael D. Ernst, and Mauro Pezzè. 2016. Au-
[7] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk tomatic generation of oracles for exceptional behaviors. In Proceedings of the
Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, 25th International Symposium on Software Testing and Analysis, ISSTA 2016, Saar-
and Charles Sutton. 2021. Program Synthesis with Large Language Models. CoRR brücken, Germany, July 18-20, 2016, Andreas Zeller and Abhik Roychoudhury
abs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732 (Eds.). ACM, 213–224. https://doi.org/10.1145/2931037.2931061
[8] Gareth Ari Aye and Gail E. Kaiser. 2020. Sequence Model Design for Code [23] Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In
Completion in the Modern IDE. CoRR abs/2004.05249 (2020). arXiv:2004.05249 Proceedings of the 40th International Conference on Software Engineering, ICSE 2018,
https://arxiv.org/abs/2004.05249 Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic,
[9] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Marsha Chechik, and Mark Harman (Eds.). ACM, 933–944. https://doi.org/10.
Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. 2022. 1145/3180155.3180167
GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv preprint [24] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long
arXiv:2204.06745 (2022). Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun
[10] Arianna Blasi, Alberto Goffi, Konstantin Kuznetsov, Alessandra Gorla, Michael D. Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang,
Ernst, Mauro Pezzè, and Sergio Delgado Castellanos. 2018. Translating code and Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with
comments to procedure specifications. In Proceedings of the 27th ACM SIGSOFT Data Flow. In 9th International Conference on Learning Representations, ICLR 2021,
International Symposium on Software Testing and Analysis, ISSTA 2018, Amsterdam, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/
The Netherlands, July 16-21, 2018, Frank Tip and Eric Bodden (Eds.). ACM, 242–253. forum?id=jLoC4ez43PZ
https://doi.org/10.1145/3213846.3213872 [25] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish K. Shevade. 2017. DeepFix:
[11] Arianna Blasi, Alessandra Gorla, Michael D. Ernst, Mauro Pezzè, and Antonio Fixing Common C Language Errors by Deep Learning. In Proceedings of the Thirty-
Carzaniga. 2021. MeMo: Automatically identifying metamorphic relations in First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco,
Javadoc comments for test automation. Journal of Systems and Software (2021). California, USA, Satinder P. Singh and Shaul Markovitch (Eds.). AAAI Press,
[12] Shaked Brody, Uri Alon, and Eran Yahav. 2020. A Structural Model for Contextual 1345–1351. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603
Code Changes. In OOPSLA. [26] Vincent J. Hellendoorn, Christian Bird, Earl T. Barr, and Miltiadis Allamanis. 2018.
[13] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Deep learning type inference. In Proceedings of the 2018 ACM Joint Meeting on
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda European Software Engineering Conference and Symposium on the Foundations of
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, No-
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, vember 04-09, 2018, Gary T. Leavens, Alessandro Garcia, and Corina S. Pasareanu
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin (Eds.). ACM, 152–162. https://doi.org/10.1145/3236024.3236051
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya [27] Vincent J. Hellendoorn, Sebastian Proksch, Harald C. Gall, and Alberto Bacchelli.
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. 2019. When Code Completion Fails: a Case Study on Real-World Completions.
In Advances in Neural Information Processing Systems 33: Annual Conference on In ICSE.
Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, [28] Vincent J. Hellendoorn and Anand Ashok Sawant. 2022. The growing cost
virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina of deep learning for source code. Commun. ACM 65, 1 (2022), 31–33. https:
Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/ //doi.org/10.1145/3501261
hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html [29] Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and
[14] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples David Bieber. 2020. Global Relational Models of Source Code. In 8th International
to improve code completion systems. In European Software Engineering Conference Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April
and International Symposium on Foundations of Software Engineering (ESEC/FSE). 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=B1lnbRNtwr
ACM, 213–222. [30] Thong Hoang, Hong Jin Kang, David Lo, and Julia Lawall. 2020. CC2Vec: Dis-
[15] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de tributed Representations of Code Changes. In ICSE.
Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, [31] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya,
Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes
Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Welbl, Aidan Clark, et al. 2022. Training Compute-Optimal Large Language
11
Conference’17, July 2017, Washington, DC, USA Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel

Models. arXiv preprint arXiv:2203.15556 (2022). [54] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le,
[32] Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large
Parthasarathy, Sriram Rajamani, and Rahul Sharma. 2022. Jigsaw: Large Language language models. arXiv preprint arXiv:2201.11903 (2022).
Models meet Program Synthesis. In ICSE. [55] Frank F. Xu, Uri Alon, Graham Neubig, and Vincent J. Hellendoorn. 2022. A
[33] Yue Jia and Mark Harman. 2010. An analysis and survey of the development of Systematic Evaluation of Large Language Models of Code. CoRR abs/2202.13169
mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678. (2022). arXiv:2202.13169 https://arxiv.org/abs/2202.13169
[34] René Just. 2014. The major mutation framework: efficient and scalable mutation [56] Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong
analysis for Java. In International Symposium on Software Testing and Analysis, Liu. 2019. A Novel Neural Source Code Representation based on Abstract Syntax
ISSTA ’14, San Jose, CA, USA - July 21 - 26, 2014, Corina S. Pasareanu and Darko Tree. In ICSE.
Marinov (Eds.). ACM, 433–436. https://doi.org/10.1145/2610384.2628053 [57] Hao Zhong, Lu Zhang, Tao Xie, and Hong Mei. 2009. Inferring Resource Specifi-
[35] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, cations from Natural Language API Documentation. In International Conference
Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. on Automated Software Engineering (ASE). 307–318.
Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
[36] Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Predic-
tion by Feeding Trees to Transformers. In 43rd IEEE/ACM International Conference
on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 150–162.
https://doi.org/10.1109/ICSE43902.2021.00026
[37] Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated
program repair. Commun. ACM 62, 12 (2019), 56–65. https://doi.org/10.1145/
3318162
[38] Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-Task Learning based
Pre-Trained Language Model for Code Completion. In ASE.
[39] Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type: Inferring
JavaScript function types from natural language information. In Proceedings of
the 41st International Conference on Software Engineering, ICSE 2019, Montreal,
QC, Canada, May 25-31, 2019. 304–315. https://doi.org/10.1109/ICSE.2019.00045
[40] Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader-Palacio,
Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the
Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks.
In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021,
Madrid, Spain, 22-30 May 2021. IEEE, 336–347. https://doi.org/10.1109/ICSE43902.
2021.00041
[41] Martin Monperrus. 2018. Automatic software repair: a bibliography. ACM
Computing Surveys (CSUR) 51, 1 (2018), 1–24.
[42] Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007.
Feedback-Directed Random Test Generation. In International Conference on Soft-
ware Engineering (ICSE). IEEE, 75–84.
[43] Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark
Harman. 2019. Mutation testing advances: an analysis and survey. In Advances
in Computers. Vol. 112. Elsevier, 275–378.
[44] Michael Pradel and Satish Chandra. 2022. Neural software analysis. Commun.
ACM 65, 1 (2022), 86–96. https://doi.org/10.1145/3460348
[45] Michael Pradel, Georgios Gousios, Jason Liu, and Satish Chandra. 2020. Type-
Writer: Neural Type Prediction with Search-based Validation. In ESEC/FSE ’20:
28th ACM Joint European Software Engineering Conference and Symposium on
the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020.
209–220. https://doi.org/10.1145/3368089.3409715
[46] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
1, 8 (2019), 9.
[47] Veselin Raychev, Martin T. Vechev, and Eran Yahav. 2014. Code completion
with statistical language models. In ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June
09 - 11, 2014. 44.
[48] Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish
Chandra. 2018. Retrieval on source code: a neural code search. In Proceedings of the
2nd ACM SIGPLAN International Workshop on Machine Learning and Programming
Languages. ACM, 31–41.
[49] Domenico Serra, Giovanni Grano, Fabio Palomba, Filomena Ferrucci, Harald C.
Gall, and Alberto Bacchelli. 2019. On the Effectiveness of Manual and Automatic
Unit Test Generation: Ten Years Later. In 2019 IEEE/ACM 16th International
Conference on Mining Software Repositories (MSR). 121–125. https://doi.org/10.
1109/MSR.2019.00028
[50] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam
Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay
Korthikanti, et al. 2022. Using deepspeed and megatron to train megatron-turing
nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990
(2022).
[51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In NIPS. 6000–6010. http://papers.nips.cc/paper/7181-attention-is-all-
you-need
[52] Yu Wang, Fengjuan Gao, Linzhang Wang, and Ke Wang. 2020. Learning Semantic
Program Embeddings with Graph Interval Neural Network. In OOPSLA.
[53] Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. 2020. LambdaNet: Proba-
bilistic Type Inference using Graph Neural Networks. In 8th International Con-
ference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
2020. OpenReview.net. https://openreview.net/forum?id=Hkx6hANtwH
12

You might also like