Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
ABSTRACT given some natural language artifact. For example, some approaches
Few-shot learning with large-scale, pre-trained language models is a generate test oracles based on informal API documentation [10,
powerful way to answer questions about code, e.g., how to complete 11, 22], infer API usage protocols [57], or suggest missing type
a given code example, or even generate code snippets from scratch. annotations [39].
The success of these models raises the question whether they could The traditional way of creating such code manipulation tools
serve as a basis for building a wide range code generation tools. is based on program analysis combined with various rules and
Traditionally, such tools are built manually and separately for each heuristics. Program analysis can, at least in principle, ensure that
task. Instead, few-shot learning may allow to obtain different tools the generated code is guaranteed to have certain properties, e.g.,
from a single pre-trained language model by simply providing a to be type-correct or to pass a given set of test cases. Hand-coded
few examples or a natural language description of the expected rules and heuristics are typically required to enable a technique
tool behavior. This paper studies to what extent a state-of-the-art, to be effective and efficient on real-world software. More recently,
pre-trained language model of code, Codex, may serve this purpose. learning-based approaches have started to complement traditional
We consider three code manipulation and code generation tasks program analysis-based code generation tools [44]. Typically, these
targeted by a range of traditional tools: (i) code mutation; (ii) test approaches formulate the specific code generation task as a super-
oracle generation from natural language documentation; and (iii) vised learning problem, and require large amounts of training data
test case generation. For each task, we compare few-shot learning to to obtain an effective machine learning model. A commonality of
a manually built tool. Our results show that the model-based tools both traditional program analyses and learning-based approaches
complement (code mutation), are on par (test oracle generation), is that creating a new code generation tool involves significant
or even outperform their respective traditionally built tool (test human effort. Even worse, this effort often must be repeated for
case generation), while imposing far less effort to develop them. By each new combination of a task to achieve and a programming
comparing the effectiveness of different variants of the model-based language to target.
tools, we provide insights on how to design an appropriate input A recent trend in the natural language processing (NLP) commu-
(“prompt”) to the model and what influence the size of the model nity promises a form of “general intelligence” that remedies many
has. For example, we find that providing a small natural language of the problems of building task-specific techniques: few-shot learn-
description of the code generation task is an easy way to improve ing with large-scale, pre-trained language models [13], henceforth
predictions. Overall, we conclude that few-shot language models abbreviated with FSLMs. These models are trained on huge amounts
are surprisingly effective, yet there is still more work to be done, of data without focusing on a specific downstream task. Instead,
such as exploring more diverse ways of prompting and tackling the training is based on generic pseudo-tasks for which it is trivial
even more involved tasks. to obtain sufficient training data, e.g., predicting masked words or
whether two sentences belong together. Once trained, FSLMs are
1 INTRODUCTION effective at various question answering and text generation tasks,
e.g., reading comprehension, trivia quizzes, translation between
Various software engineering tools assist developers by generating languages, and text completion [13].
source code. One group of approaches reasons about existing code Applying FSLMs to code is still a relatively sparsely explored
and modifies it in a way suitable to achieve some goal. For example, area. While recent work employs pre-training of models of code as a
code mutation tools [33, 43] introduce mistakes to measure the means to reduce the amount of required training examples [3, 20, 24,
effectiveness of test suites, and automated program repair tools [37, 38], these approaches still fine-tune a model for a specific purpose
41] suggest how to fix programming mistakes. Another group of and hence require moderately large amounts of labeled training
approaches generates new code from scratch, given some existing examples. Noteworthy exceptions include GitHub’s Copilot code
code that the new code is supposed to relate to. For example, test completion system1 , which is based on the Codex FSLM [15], and
case generators [17, 21, 42] automatically create tests that exercise the recently released, open-source PolyCoder model family [55].
a given method under test, and code completion tools [14, 27, 47] While the results of these models are impressive, code comple-
generate code that completes an existing code snippet in a suitable tion is only one of many code generation tasks. Do the abilities of
way. Finally, a third group of code generation tools does not require
any existing code as an input, but instead generates new code 1 https://copilot.github.com/
1
Conference’17, July 2017, Washington, DC, USA Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel
FSLMs generalize to other software engineering tasks that tradi- test generation, Randoop achieves 10% coverage, whereas a
tionally have been addressed by special-purpose code generation simple FSLM-based tool achieves 14%.
techniques? In case of a positive answer, FSLMs offer the poten- • FSLM-based and traditionally-developed tools often comple-
tial to obtain code generation tools (almost) for free, as an FSLM ment each other. For example, our FSLM-based code muta-
gets trained once and can then be applied to many different tasks. tion tool creates various mutants that Major cannot generate.
Despite this potential and the strong interest of the software en- The complementary nature of the two kinds of tools shows
gineering community in automated code generation techniques, the potential of combining traditional and FSLM-based ap-
there currently is no systematic study of the abilities of FSLMs on proaches. For example, combining Randoop-generated and
such tasks. FSLM-generated test cases yields 16% coverage, i.e., it ex-
This paper presents the first systematic study of FSLMs as the ceeds both approaches individually.
key ingredient for creating code generation tools. We describe a • FSLM-based tool do not come completely for free. To be ef-
general framework for creating a code generation tool based on an fective, they need specifically designed prompts and suitable
existing FSLM, apply it to three popular tasks that are representative inputs extracted from the given code or natural language.
for different kinds of code generation problems, and compare the Yet, the effort required to create an FSLM-based tool is clearly
FSLM-based approach against traditionally developed state-of-the- lower than that for building special-purpose code generation
art tools. Instantiating our framework for a specific code generation tools from scratch.
tasks involves three steps. First, develop an extractor of code or In summary, this paper contributes the following:
natural information to use in a query to the model. Second, design
a suitable prompt, i.e., a template of how to present the input to the • The first systematic study of FSLM-based code generation
model, which then gets instantiated for each given example. Finally, tools.
develop a lightweight post-processing module, which, e.g., removes • We are the first to address code mutation, test oracle genera-
generated code that fails to compile. We argue that these steps are tion, and test case generation in an end-to-end manner with
lightweight compared to designing and implementing a traditional general-purpose FSLMs.
program generation technique, as they leave the most challenging • Insights that show the potential and challenges of building
parts of the tasks to the FSLM. As a result, the approach offers an FSLM-based code generation tools, providing guidance for
almost-for-free way of obtaining a code generation tool. future work.
We instantiate these ideas for three code generation tasks: code
mutation, test oracle generation, and test case generation. These 2 BACKGROUND
tasks have received significant interest from the software engineer- A generative language model is designed to predict the next token
ing community, and hence, offer state-of-the-art tools to compare given some previous tokens. For example, if such a model is given
against. The tasks also cover different levels of granularity of the the input “I am Barack Obama. I used to be the president of the
generated code, ranging from manipulating a few tokens in code United States of”, such a language model might predict “America”
mutation to generating entire test cases. Finally, the selected tasks as the next token. This can be used to generate text by repeatedly
are based on different kinds of input: code mutation and test case sampling for the next token. When using such a model for down-
generation are based on existing code, whereas test oracle genera- stream tasks that differ from the next token prediction objective,
tion is based on natural language documentation. Table 1 shows the step of initial training is often referred to as pre-training.
two representative example outputs that FSLM-based tools produce A pre-trained model can be adapted to a specific downstream
for each of these tasks. The examples follow the format 𝑥 | ==> 𝑦, task via fine-tuning, i.e., in additional training step based on labeled
where 𝑥 and 𝑦 denote, respectively, the input and output of the data for the downstream task. A recently proposed alternative is
prompt for the given task. few-shot learning [13], which refers to the ability to perform a task
For each task, we instantiate our general framework to create without any fine-tuning, but given only very few (typically, between
an FSLM-based code generation tool and then apply the tool to one and ten) examples as part of the query to the model. We utilize
real-world software. We then systematically compare the results generative language models as few-shot learners, which we refer
produced by the FSLM-based tool against an existing, state-of-the- to as few-shot learning with large-scale, pre-trained language models
art tool built specifically for the same purpose: the Major [34] code (FSLM). We use OpenAI’s Codex [15] model, which is trained on a
mutation tool, the MeMo [11] test oracle extraction tool, and the large set of GitHub projects. We access the model through its API.
Randoop [42] test case generator. We measure the effectiveness Alternative generative models exist, e.g., GPT-NeoX [9].
of each tool using metrics of success suitable for the task, e.g., The input provided to an FSLM is referred to as the prompt.
code coverage for test case generation, and precision/recall w.r.t. a Prompts typically contain a few examples of inputs with their de-
ground truth for test oracle generation. sired outputs, followed by the input for which the model should
Our key findings include: provide an answer. For the above example, a prompt could start by
giving a few example pairs of head of states and the correspond-
ing country, and then “Barack Obama”, to which the model might
• FSLM-based tools are similarly and sometimes even more respond with “United States”. Prompts are, in principle, unstruc-
effective than existing, special-purpose tools. For example, tured text, and what exactly is provided in a prompt may strongly
for oracle generation, we measure an F1 score of 0.59 and influence the results. When querying an FSLM with a prompt, the
0.60 for MeMo [11] and an FSLM-based tool, respectively. For user can select the temperature, which intuitively speaking controls
2
A Study of Few-Shot, Pre-Trained Language Models on Code Conference’17, July 2017, Washington, DC, USA
Table 1: Representative examples of results obtained with FSLM-based code generation tools.
The returned implementation is immutable , thread - safe and Note for Java 7 and later : this method should be treated
Serializable . It is equivalent to tick ( system ( zone ) , as deprecated ; use the equivalent Long # compare method
Duration . ofSeconds (1) ) . instead .
|==> |==>
Oracle generation java . time . Clock . tick ( java . time . Clock . system ( zone ) , com . google . common . primitives . Longs . compare (a , b )
java . time . Duration . ofSeconds (1) ) IS EQUIVALENT TO
IS EQUIVALENT TO java . lang . Long . compare (a , b)
java . time . Clock . tickSeconds ( zone )
the creativity or randomness of the model’s responses. A higher and a part that is instance-specific (e.g., the line of code to
temperature means the model will generate more diverse responses, mutate). Given a prompt for a specific instance, the approach
but be less factual and precise. Repeatedly querying a model may passes the prompt to the FSLM and then obtains a completion
return different results, especially when using a higher temperature. of it.
(3) Post-processing. Finally, the third step is to post-process
3 METHODOLOGY the raw output produced by the model in order to obtain
the final code generation results. The post-processing may
Figure 1 shows a general framework for producing code generation
filter the completions, e.g., to ensure that the generated code
tools for a diverse set of tasks. The framework relies on a large-scale
compiles or to copy the predicted code into a task-specific
language model pre-trained on code, such as Codex [15]. The input
code template.
to the framework is a textual representation of a software artifact,
e.g., source code or documentation. The output is a set of generated Sections 3.2, 3.3, and 3.4 describe the code generation tasks that
code snippets, e.g., a modified version of the given source code, an this paper focuses on according to the three above steps.
executable specification, or a test case. The framework is organized
in three main steps, which we briefly describe in the following. 3.1 Research Questions
The overall goal of this study is to understand the strengths and
(1) Instance extraction. The first step is responsible for ex-
limitations of FSML-based code generation tools. To this end, we
tracting parts of a given software artifact that are relevant
investigate the following research questions.
for the code generation task. We refer to an extracted part
as an instance. For example, for code mutation, the instance RQ1. Accuracy: How accurate are the model’s predictions com-
extraction takes in source code and extracts code lines for pared to existing tools?
which we want to generate mutants. The rationale for not RQ2. Impact of Prompt: What kinds of prompts are most effec-
simply passing in the entire raw software artifact is two-fold. tive at producing accurate results?
First, FSLMs impose a maximum input size, e.g., 4,096 tokens
for the Codex model series. Second, larger inputs take longer RQ3. Impact of Model Size: How much does the size of the FSLM
to process, i.e., the instance extraction reduces the overall influence the accuracy?
time to generate code. The motivation for RQ1 is that building traditional tools by hand
(2) Prompt design. The second step to use our framework is imposes significant human costs. Understanding to what extent a
designing an effective prompt, which is perhaps the most single general-purpose language model could replace these tools
difficult part of creating an FSLM-based code generation may help reducing the cost for creating new tools. The motivation
tool. The prompt contains (i) an instance, as extracted in for RQ2 is that the prompt to query a pre-trained model is the main
the previous step, and (ii) contextual information, such as “knob” to control the quality of the model’s predictions. Under-
examples for addressing the code generation task and/or a standing what prompts are effective (or not) helps in making best
natural language description of the task. The prompts we use of the existing models. Finally, the motivation for RQ3 is that
use in our study include a part that is invariant across all state-of-the-art language models are trained on huge datasets using
instances (e.g., a natural language description of the task) enormous computational resources. Understanding the impact of
3
Conference’17, July 2017, Washington, DC, USA Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel
Large-scale
training
Pre-trained
language model
Source code, 1 2 3
Instance Instance Raw completions Code generation
documentation, Prompt Post-processing
extraction results
etc.
Figure 1: Overview of a general framework for generating code analysis tools using few-shot, pre-trained language models.
Generate mutations for the following snippets of code. 3.2.3 Prompt. To generate mutants via an FSLM, we design a
[[Code]] prompt that ask the model to modify one code line at a time. Fig-
long biasedExp = (longBits & DoubleConsts.EXP_BIT_MASK)»
(DoubleConsts.SIGNIFICAND_WIDTH - 1); ure 2 shows the default prompt for our study. The prompt contains
[[Mutations]] a brief natural language description of the task to perform, followed
- longBits & DoubleConsts.EXP_BIT_MASK |==> longBits | by a short list of examples. To help the model understand the dif-
DoubleConsts.EXP_BIT_MASK
- longBits & DoubleConsts.EXP_BIT_MASK) » ferent sections of the prompt, we mark them, e.g., via brackets as
(DoubleConsts.SIGNIFICAND_WIDTH - 1) |==> longBits & in “[[Code]]”. Each example consists of the line of code to modify
DoubleConsts.EXP_BIT_MASK) « (DoubleConsts.SIGNIFICAND_WIDTH - (“[[Code]]”) and a few mutants to generate based on it (“[[Muta-
1)
- 1 |==> 0 tions]]”). Since mutants are small, we ask the model to suggest
- DoubleConsts.SIGNIFICAND_WIDTH - 1 |==> multiple mutants at once. Thus, the temperature can be set low for
DoubleConsts.SIGNIFICAND_WIDTH % 1 consistent but not as diverse results. At the end, the prompt provides
[[Code]]
... the code line we wish to mutate, leaving the task of completing it
(3 more examples) with suitable mutants to the model. For the example in Figure 2,
... the model suggests a mutant that replaces the expression classVal
[[Code]] passed as parameter in the call lhsDist.get() with classVal + 1.
WeightMass mass = lhsDist.get(classVal);
[[Mutations]]
3.2.4 Post-processing. Once the model completes the prompt, we
- classVal |==> classVal + 1
- classVal |==> 0 post-process the raw completion using simple regular expressions
to extract the mutations suggested by the model. Because an FSLM
Figure 2: Prompt used for mutant generation. We shot the natural language
description in purple, any instance-specific parts of the prompt (e.g., general
does not guarantee to produce code that is syntactically or semanti-
examples, separators, etc.) in black, inputs that are specific to a concrete in- cally valid, we filter out any suggested mutants that do not compile.
stance (i.e., the part that changes when the instance changes) in blue, and All remaining code snippets are our final set of mutants generated
parts that the FSLM model generates in response to that input in green.
by the FSLM. Table 2: Projects used in our study.
model size on the model’s effectiveness will help appropriately 3.2.5 Benchmarks. As code Project Version # Classes
allocate computational resources to train models. files to mutate, we ran- Colt 1.2.0 297
domly select 32 classes ElasticSearch 6.1.1 2,821
3.2 Task 1: Code Mutation from the projects listed on GWT 2.5.1 3,178
We address our research questions by studying them on three popu- Table 2. The smallest class GraphStream 1.3 233
has 19 lines of code, while Guava 19.0 464
lar code generation tasks. The first task is code mutation, a popular Hibernate 5.4.2 3,393
technique to assess the quality of a test suite by estimating its ability the longest has 1,346. In to-
tal, the classes have 6,747 JDK 8 4,240
to detect injected faults. Code mutation modifies a given piece of Commons Math 3.6.1 918
code by injecting a programming mistake. As a simple example, a lines of code. Across the
Weka 3.8.0 1,648
code mutation tool may change a comparison x > 5 into x < 5. 32 classes, our instance ex-
tractor yields 1,194 instances (lines of code) to generate mutants
3.2.1 Baseline Tool. We study the effectiveness of an FSLM-based for.
code mutation tool by comparing it against Major [34], a popu-
lar code mutation tool for Java. Major applies different built-in 3.3 Task 2: Generating Oracles from Natural
mutation operators and ensures that all created mutants compile. Language Documentation
3.2.2 Instance Extraction. To create an FSLM-based code mutation As a second code generation task, we consider the problem of gen-
tool, the first step is extracting code snippets to modify. Since muta- erating test oracles from natural language documentation. This
tion operators typically are local code transformations, an instance task represents a class of tasks where an FSLM translates natural
for this task consists of a single line of code. The instance extractor language to code. Specifically, we focus on the task of extracting
takes a Java file as input and returns a list of lines of code that we metamorphic test oracles for API methods from Javadoc comments.
then try to mutate via the FSLM. For a fair comparison, we focus A metamorphic test oracle states that two inputs that are in a spe-
our experiments on those lines where Major applies a mutation. cific relationship are expected to lead to outputs that are in another
4
A Study of Few-Shot, Pre-Trained Language Models on Code Conference’17, July 2017, Washington, DC, USA
### Signature 3.3.3 Prompt. We design the prompt to be a short list of examples
public static java.lang.String toString(java.lang.Object[] a) of what the model is supposed to achieve, as shown in Figure 3.
### Comment
...
Each example consists of four parts: (1) The signature of the method
The value returned by this method is equal to the value that would for which to generate an oracle (“### Signature”), (2) the natural
be returned by Arrays.asList(a).toString(), unless a is null, in language description of the method’s behavior, as extracted from the
which case "null" is returned. available Javadoc (“### Comment”), (3) A small section of natural
### Analysis
language explanation about how the equivalence manifests itself
This method returns the same thing as the expression
Arrays.asList(a).toString(), therefore they are equivalent in the example (“### Analysis”) (This part is motivated by the
(at least as long as a is not null). observation that by letting the model explain its reasoning before
### Equivalence generating the result itself may increase its effectiveness [54].), and
if (a != null) toString(a) <-> Arrays.asList(a).toString() ; (4) the Java code of the metamorphic oracle, which consists of a
...
(3 more examples)
conditional followed by two expressions separated by the symbol
... <->, denoting “equivalent to” (“### Equivalence”). After providing a
### Signature small number of such examples (four by default), we provide the
public double norm2(cern.colt.matrix.DoubleMatrix1D x) signature and comment of the instance we are interested in, and
### Comment
then let the model complete the prompt by providing an analysis
Returns the two-norm (aka euclidean norm) of vector x;
equivalent to mult(x,x). and the oracle. For this task, the temperature is set to zero, as we
### Analysis observe the model to produce too imprecise predictions otherwise.
This method is equivalent to the expression mult(x,x).
### Equivalence 3.3.4 Post-processing. Given the raw completion produced by the
norm2(x) <-> mult(x,x); model in response to our prompt, we extract the generated test
Figure 3: The prompt used for oracle extraction. oracle. The extraction is based on simple regular expressions, e.g.,
anchored around the special <-> symbol. Next, we check whether
the predicted condition (if any) and the code snippets compile
known relationship [16]. In the context of testing API methods, this properly. Finally, the approach expands names of classes, e.g., Math
typically means that some API usage is equivalent (or in some other to java.lang.Math, using JavaSymbolSolver.2
relationship) to some other API usage. As an example of an oracle
we aim to generate, consider this excerpt from the Javadoc of the 3.3.5 Benchmarks. To measure the effectiveness of the FSLM-based
Array.toString method: “The value returned by this method is equal tool, we use a ground truth dataset available from MeMo’s arti-
to the value that would be returned by Arrays.asList(a).toString(), facts [11]. The dataset is based on 5,000 methods from nine open-
unless a is null, in which case null is returned.”. source Java projects, from which 299 metamorphic test oracles have
The equivalence described in this documentation could be speci- been manually extracted. The oracles are diverse and vary in length:
fied as an executable test oracle that states that Arrays.toString(a) The natural language descriptions range between 3 and 500 words,
yields the same as Arrays.asList(a).toString() if a != null. with a mean of 44.3. The code of the oracles ranges between 3 and
81 tokens, with a mean of 21.6.
3.3.1 Baseline Tool. Extracting test oracles from natural language
documentation is an active area of research [10, 11, 22]. As a base- 3.4 Task 3: Test Case Generation
line to compare an FSLM-based tool against, we use the recently
As the third code generation task, we consider the problem of gener-
proposed MeMo [11]. MeMo extracts metamorphic test oracles
ating unit tests. This task represents a class of tasks where the FSLM
by first identifying natural language sentences that could contain
generates a method, i.e., a larger portion of code compared with the
such oracles using simple heuristics, and then translating those
previous examples. Test case generation is a labor-intensive task in
sentences into code. This translation, which is the most intricate
software testing [6], and several techniques have been proposed to
part of MeMo, decomposes the sentence using a dependency parser,
automate unit test case generation [49].
and then converts the parsed sentence into code based on a set
of hard-coded rules and heuristics. Because of the inherent im- 3.4.1 Baseline Tool. There are many automatic test case generation
precision and diversity of natural language, the second step has tools available. Randoop [42] and EvoSuite [21] are popular repre-
to cover many edge cases to be effective. Our study investigates sentatives of such tools. We use Randoop in our study. To generate
whether an FSLM-based tool could replace or complement this sec- test cases with Randoop, for each method under test, we invoke
ond step of MeMo, i.e., replacing the hard-coded rules by queries its main class randoop.main.Main passing the gentests command
to a pre-trained model. and the –methodlist=filename.txt and –generated-limit=100 argu-
ments. The file filename.txt contains the method under test, as well
3.3.2 Instance Extraction. In the context of this task, we define an
as helper methods it depends on. We select helper methods with a
instance to be a method whose description probably contains an
minimum amount of dependencies to include. The generated-limit
oracle. For a fair comparison with the baseline tool, and because
argument defines the maximum number of test method candidates
extracting such sentences is comparatively simple, we MeMo to
generated internally. For a fair comparison, we let Randoop and
identify sentences that likely contain an oracle. We then pass the
entire comment containing such a sentence into our prompt, which
provides the FSLM with some context. 2 http://javaparser.org/
5
Conference’17, July 2017, Washington, DC, USA Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel
the FSLM generate the same number (100) of test cases per method Suggest a test for a method with the DoubleArrayList
quantiles(DoubleArrayList percentages) signature.
under test. Helper constructors and methods:
3.4.2 Instance Extraction. For unit test case generation, we con- DynamicBin1D()
DoubleArrayList()
sider an instance to be a method under test. That is, the instance Method: public synchronized double max() {
extractor takes a Java class as its input, produces a list of public if (! isIncrementalStatValid) updateIncrementalStats();
methods, and randomly selects a method from the list to be tested. return this.max;
}
3.4.3 Prompt. Figure 4 shows an example of the (default) prompt Test: public static void testMax() {
that we use for unit test case generation. The prompt starts with double[] temp = new double[2];
a brief natural language description of the task. Next, we provide temp[0] = 8.9;
temp[1] = 1;
one example of the task. The reason for showing only one exam- DenseDoubleMatrix1D d1Double = new DenseDoubleMatrix1D(temp);
ple is that state-of-the-art FSLMs only support a bounded prompt hep.aida.bin.DynamicBin1D d1ynamicBin =
size. The example consists of three parts: (1) a list of helper meth- cern.colt.matrix.doublealgo.Statistic.bin(d1Double);
double max = d1ynamicBin.max();
ods to assist in the creation of values. (“Helper constructors and
System.out.println("max="+ max);
methods:”), (2) the method under test itself, and (3) a test case that }
invokes the method under test. After the example, the instance, —
consisting of the code of the method (as explained on Section 3.4.2 Method:public DoubleArrayList quantiles(DoubleArrayList percentages)
“Instance Extraction”) is provided, leaving the task of generating {
return Descriptive.quantiles(sortedElements_unsafe(),percentages);
a test case to the model. Since the prompt contains only a single }
example, selecting this example potentially has a large impact on Test:public static void testQuantiles() {
the generated test. Section 4.2 compares different strategies for double[] temp = new double[2];
selecting the example, e.g., selecting another method under test temp[0] = 8.9;
temp[1] = 1;
from the same class and selecting another method under test at
DenseDoubleMatrix1D d1Double = new DenseDoubleMatrix1D(temp);
random. Because each query yields only one test case, we make hep.aida.bin.DynamicBin1D d1ynamicBin =
multiple queries while varying the temperature parameter from 0.0 cern.colt.matrix.doublealgo.Statistic.bin(d1Double);
to 0.9, in steps of 0.1. For each temperature, we make 10 queries. DoubleArrayList quantiles = d1ynamicBin.quantiles(new
DoubleArrayList(new double[] 0.5,0.75));
This way, the model predicts a total of 100 test case candidates. System.out.println("quantiles="+ quantiles);
}
3.4.4 Post-processing. To post-process the raw completions of the
model, we inject each test case candidate into a template of a test Figure 4: Test generation prompt for the method DoubleArrayList
quantiles(DoubleArrayList percentages), declared in the class DynamicBin1D
class, which contains the necessary scaffolding to yield an exe- from project Colt.
cutable test case. Similar to the previous tasks, we discard candi-
dates that do not compile. We also remove any duplicates that may
result from querying the model multiple times.
3.4.5 Benchmarks. As methods under test we use the 18 methods
that Table 5 shows. We select them by randomly identifying two We address this question both quantitatively and qualitatively. As a
public methods from each of the 9 projects in Table 2. quantitative answer, we compute how many of the FSLM-generated
mutants exactly match one of the Major-generated mutants. We
4 RESULTS observe an overlap of around 18% of the FSLM-generated mutants.
This section presents answers to the three research questions posed Under the assumption that Major-generated mutants are useful,
in Section 3.1. Section 5 discusses the results and their broader this means that at least 18% of the FSLM-generated mutants are
impact. also useful.
As a qualitative answer, we manually inspect a random sample of
4.1 RQ1: Accuracy 30 of the compilable mutants our tool generates. For each sampled
mutant, we carefully inspect the code and determine whether the
This section presents results on the accuracy of FSLM-based code
mutation changes the runtime behavior of the code, as opposed
generation compared to traditionally built tools.
to being an equivalent mutant that preserves the semantics of
4.1.1 Code Mutation. Table 3 summarizes our results for the code the original code. The inspection shows that 90% of the mutants
mutation task. Given the 1,194 instances extracted from 32 classes certainly change the behavior, whereas the remaining 10% either
(Section 3.2.5), our FSLM-based tool generates a total of 2,721 mu- preserve the semantics or we could not clearly determine its effects.
tants, whereas the baseline Major tool generates 2,810 mutants. To better understand the mutants generated by the model and
Because the model does not guarantee to generate valid code, only Major, we automatically classify them based on the kind of code
62.5% of the FSLM-generated mutants are compilable, giving a total transformation. We distinguish four classes, as shown in the four
of 1,701 usable mutants. On average, our tool changes 3.97 tokens right-most columns of Table 3: (i) deleting a statement, (ii) replacing
of the original code, which roughly equals the 4.28 tokens changed one operator with another, (iii) replacing one value with another,
by Major. Besides the raw amount of mutants generated, it is also and (iv) some other transformation. The table shows that the distri-
important to understand whether the generated mutants are useful. bution of mutants that the FSLM and Major generate clearly differ:
6
A Study of Few-Shot, Pre-Trained Language Models on Code Conference’17, July 2017, Washington, DC, USA
Table 5: Analysis of the test cases generated by our FSLM-based test generator and Randoop for the considered methods. The table presents (1) the number of
compilable test (CT) cases; (2) the average test size (TS) of the generated tests and (3) the line coverage (LC) achieved by them.
these results that, overall, the model achieves higher code coverage Summary of RQ1: FSLM-based tools perform surprisingly well
than Randoop (14% vs. 10%). This result is particularly remarkable across the different tasks, being on par or complimentary, and
as Randoop generates more than three times the number of tests for test generation even better, than the handcrafted tools we
the model generates (202 vs. 682 tests). Moreover, on average, the compare against.
size of the tests generated by the model are much smaller than the
tests generated by Randoop (11 vs. 31 lines). 4.2 RQ2: Impact of Prompt
On a closer analysis of the tests generated by each approach, By default, our prompts contain both natural language task de-
for each of the 18 methods, we can see that Randoop successfully scriptions and input-output examples. This section reports on the
generates tests for all 18 methods under test. In contrast, the model impact of using different prompt variants. For each of the tasks, we
successfully generates tests for only 16 of them. More specifically, consider the following prompt variants: Only natural language de-
(i) for 14 methods, the tests generated by the model achieve higher scription (NL-only); Only input-output examples (Ex-only); Poorly
coverage than the tests generated by Randoop; (ii) for two methods, chosen examples (Bad-ex).
the tests generated by both approaches achieve the same cover-
age; (iii) for two methods, the tests generated by Randoop achieve 4.2.1 Code Mutation. For mutant generation, NL-only means the
higher coverage than the tests generated by the model. These are prompt includes only the natural language text at the top of Figure 2,
exactly the two methods for which the model fails to generate any Ex-only means we keep everything but the NL description, and Bad-
compilable tests. ex means we include additional examples where our FSLM-based
These results provide initial evidence indicating that FSLM-based tool should not generate mutants. For example, we add an import
tools can outperform state-of-the-art test generation tools. We also statement as an example, but leave the mutants section empty. The
calculate the coverage achieved by combining the tests generated idea is to test how robust the model is to adversarial or poorly
by both approaches. The results can be seen in the last column of Ta- chosen examples.
ble 5. Interestingly, the coverage achieved by the combination of the The middle rows in Table 3 show the results obtained with these
tests (16%) is superior to the coverage achieved by the tests of each variants. Running NL-only does not produce promising results
approach individually. As an example, the coverage achieved by the since it is missing the guiding output format from the examples. We
combination of the tests is considerably higher when considering attempt to "fix" the prompt by including more detailed descriptions
the quantiles method of the Colt project. In this case, individually, on how to format the output (i.e. we add "Return the result in the
the tests generated by the model achieve 29% line coverage and the format original |==> replacement as part of a list numbered using
tests generated by Randoop achieve 26% line coverage. Combined, ’-’." to the prompt), but the output format remains inconsistent,
the tests generated by both approaches achieve 39% line coverage. giving no results. This means examples play a large part in solving
this task using an FSLM. Looking at the results for Ex-only reveals
8
A Study of Few-Shot, Pre-Trained Language Models on Code Conference’17, July 2017, Washington, DC, USA
Table 6: Line coverage achieved by the tests generated by our FSLM-based test
generator with different prompts and a smaller model.