Codexity: Secure AI-assisted Code Generation
Sung Yong Kim Zhiyu Fan
National University of Singapore National University of Singapore
Singapore Singapore
[email protected] [email protected] Yannic Noller Abhik Roychoudhury
Singapore University of Technology and Design National University of Singapore
Singapore Singapore
[email protected] [email protected]arXiv:2405.03927v1 [cs.SE] 7 May 2024
ABSTRACT code completion attempts, we demonstrate that, compared to Chat-
Despite the impressive performance of Large Language Models GPT, Codexity prevents the generation of 60% of the vulnerabilities.
(LLMs) in software development activities, recent studies show the Our core contributions are as follows:
concern of introducing vulnerabilities into software codebase by • Codexity, a security-focused code generation framework pro-
AI programming assistants (e.g., Copilot, CodeWhisperer). In this viding timely vulnerability-free LLM-generated programs for
work, we present Codexity, a security-focused code generation developers. Additionally, Codexity employs a flexible architec-
framework integrated with five LLMs. Codexity leverages the feed- ture that allows user-customized LLM deployment.
back of static analysis tools such as Infer and CppCheck to mitigate • Real-world benchmark for LLM-generated programs, consist-
security vulnerabilities in LLM-generated programs. Our evalua- ing of 751 vulnerable subjects from 90 publicly available prompts.
tion in a real-world benchmark with 751 automatically generated
vulnerable subjects demonstrates Codexity can prevent 60% of the
2 MOTIVATING EXAMPLE
vulnerabilities being exposed to the software developer.
Listing 1 displays a program generated by an LLM. It is meant to
• Video: https://youtu.be/om6CYGfbNnM answer a StackOverflow inquiry1 about storing user-input strings
• Tool and Data: https://github.com/Codexity-APR/Codexity, in variables. Our assembly of Static Application Security Testing
https://doi.org/10.5281/zenodo.10572275 (SAST) tools identified a potential vulnerability (highlighted in red
in) that is related to CWE-119:
KEYWORDS
"CWE-119 identified at lines 7 and 9. The triggered concern:
security, code generation, large language model
scanf() without field width limits can crash with huge input data."
To prevent such vulnerabilities from being exposed to the software
1 INTRODUCTION
developer, we provide this additional feedback to the LLM in the
The recent advance of fundamental Large Language Models (LLMs) form of a structured prompt, emulating a dialogue with the goal
for code has shown surprising results in many software engineering of crafting code devoid of these flaws (see Figure 1). Consequently,
applications like code completion, code summarization, etc. Con- the LLM responded by producing the revised code, eliminating the
sequently, AI programming assistants like Copilot and CodeWhis- buffer overflow issues, as evidenced in Listing 2. As highlighted in
perer have reshaped the way of software development. Despite green, the code is no longer vulnerable because the format specifier
their advantages and conveniences, researchers have expressed %49s is used in both scanf calls, meaning that scanf will add at
concerns that LLMs can introduce security vulnerabilities into the most 49 characters into the arrays. By specifying the maximum
auto-generated code that developers could overlook [10, 15, 22]. number of characters, the code prevents buffer overflow vulnera-
Without a secure integration mechanism for these LLM-generated bilities if the input data exceeds the allocated buffer size.
codes, they could thus compromise the entire software system. Au-
tomated Program Repair (APR) [16] is a technique that aims to fix 3 DESIGN AND METHODOLOGY
bugs automatically, with many successful applications in fixing
security vulnerabilities [11, 14, 19, 24, 26]. However, existing APR This section introduces the workflow of Codexity and two vulner-
approaches primarily focus on fixing human-written programs and ability-preventing repair strategies.
typically assume timeouts of ≥ 1 hour [20]. Such assumptions might
fit in CI workflows or nightly build scenarios but do not satisfy the 3.1 Codexity Workflow
tight requirements for an in-time vulnerability prevention needed In Codexity’s workflow, the user first needs to select a repair strat-
for the automated code generation/completion inside the IDE. egy in the configuration setting to activate the system. Codexity
To tackle these challenges, we present Codexity, the first security- currently offers the two strategies Iteration Repair and Preshot Re-
focused code generation framework. Codexity integrates LLMs pair catering for different requirements in the computation power.
with static analyzers [6, 18] to establish security awareness and act Then, the user can invoke Codexity to complete their code while
as the first guard to prevent potential vulnerabilities introduced by
AI programming assistants. In our experiments with 990 real-world 1 https://stackoverflow.com/questions/19063213/string-ask-and-store
Kim et al.
1 # include < stdio .h > 1 # include < stdio .h >
2 # include < errno .h > 2 # include < errno .h >
3 # include < string .h > 3 # include < string .h >
4 int main ( void ){ 4 int main ( void ) {
5 char reference [50] , query [50]; 5 char reference [50] , query [50];
6 printf ("\n Enter reference genome file name : " ) ; 6 printf ( " \ n Enter reference genome file name : ");
7 scanf ("% s", reference ) ; 7 scanf ("%49 s " , reference ) ;
8 printf ("\n Enter query genome file name : " ) ; 8 printf ( " \ n Enter query genome file name : " );
9 scanf ("% s", query ) ; 9 scanf ("%49 s " , query ) ;
10 return 0; 10 return 0;
11 } 11 }
Listing 1: Vulnerable code generated by LLM Listing 2: Safe code generated by LLM
(a) Iteration Repair (b) Preshot Repair
Figure 2: Codexity’s Repair Strategies
Figure 1: Conversational Prompt (Vulnerable Code is a place-
holder for the code in Listing 1)
to provide the most secure code generation service. Moreover, the
API-accessible LLMs do not require users to be locally equipped
programming. Codexity will take the existing code snippet to ini- with a high-end GPU for model inference. After generating a code,
tiate a prompt and generate an initial completion with the selected if a vulnerability is detected, the vulnerable code is sent back to the
LLM. The completed code will be routed to a vulnerability detection commercial LLM using its API. We repeat this until it generates
phase by a series of static analysis tools. Codexity is integrated one program for which the static analyzers report no vulnerability.
with two state-of-the-art static analyzers CppCheck [18] and In-
fer [6]. We selected these two tools to examine a wide range of 3.3 Strategy 2: Preshot Repair
vulnerabilities. Like Arusoaie et al. [5], we also found CppCheck to Iteration repair aims to generate the best result by iteratively query-
be a generally good candidate. We added Infer because of its great ing the model until a secure answer is found or the maximum
ability to find memory-related bugs. If static analysis tools report allowed queries are reached. Although commercial LLMs provide
any vulnerability, Codexity extracts the error/warning message better results, they might introduce high costs to the users. To
and location information, along with the vulnerable program, to balance the trade-off between cost and quality, we present an al-
formulate a vulnerability-exposing prompt (see Figure 1). Finally, ternative vulnerability-preventing strategy called Preshot Repair
Codexity sends the vulnerability-exposing prompt to the LLM in shown in Figure 2b. We hypothesize that a weaker local (and hence
the background and requests a vulnerable-free program. cheaper) LLM is more prone to generate vulnerabilities. Such vul-
nerability information can be included in the query to the more
3.2 Strategy 1: Iteration Repair costly and powerful LLM. Consequently, we aim to anticipate and
We now describe the default vulnerability-preventing strategy in prevent the flaws likely to surface in our target LLM’s output, thus
Codexity. The Iteration Repair strategy employs commercial LLMs the term Preshot Repair. So, we first query a local LLM to generate
to generate vulnerable-free programs and provides an interface an initial (vulnerable) code. We employ static analyzers to detect
that allows users to specify the exact background LLM engine. The any vulnerability and, subsequently, use the analysis report in the
goal is to leverage the most powerful LLM that the user can access prompt generation for the target LLM.
Codexity: Secure AI-assisted Code Generation
In the second round, we extracted the non-vulnerable part of the
code snippet from the 124 posts and asked ChatGPT to complete
the programming questions (using two temperature configurations)
to confirm further whether they are prone to generate vulnerable
programs. First, ChatGPT generates one completion of each prompt
for the temperature at 0. Then, we executed ChatGPT again to
generate ten completions of each prompt for temperature at 0.8.
Finally, we reran the static analyzers on the 124×11=1,364 LLM-
generated code completions. We specified a prompt as vulnerable if
(1) ChatGPT produces a vulnerable program when the temperature
is set to 0, or (2) ChatGPT produces at least one vulnerable program
in one of the ten runs when the temperature is set to 0.8. Finally, we
ended up with 90 vulnerable prompts and generated 751 vulnerable
code completions with 1645 vulnerabilities.
Table 1 shows the details of our benchmark, categorized by
vulnerabilities identified by CppCheck and Infer. We employed Cp-
pCheck for common CWEs and Infer for specific types like Memory
leaks. Note that subjects could have multiple vulnerabilities. We
defaulted to CWE classification when vulnerabilities overlapped.
Table 1: Vulnerabilities Categories
Vulnerability Total Vulnerability Total
Null Dereference 384 CWE-686 15
Figure 3: Dataset Construction Nullptr Dereference 363 Integer Overflow L2 13
Resource Leak 196 CWE-758 12
Buffer Overrun L2 119 CWE-197 12
4 EVALUATION Memory Leak 109 CWE-562 11
We show Codexity’s effectiveness to securely handle developers’ CWE-119 90 Buffer Overrun L1 11
code completion requests based on two real-world datasets and CWE-457 81 CWE-467 10
compare Codexity’s performance with FootPatch [25] and GitHub CWE-401 66 Use After Lifetime 8
Copilot [1]. In particular, we explore three research questions: Buffer Overrun L3 62 CWE-685 7
RQ1: How effective is Codexity in preventing the generation of CWE-775 32 CWE-476 4
vulnerable programs? CWE-788 22 Use After Free 2
RQ2: How does Codexity perform compared to FootPatch and Buffer Overrun S2 15 Inferbo Alloc Is Zero 1
GitHub Copilot?
RQ3: What are the benefits/drawbacks of the proposed workflows?
Baseline Tools. We use FootPatch and GitHub Copilot as baseline
tools in our comparison. We selected FootPatch because it also uses
4.1 Evaluation Setup a SAST tool, Infer, to detect vulnerabilities. Also, we decided to
Benchmark Construction. To simulate real-world scenarios of how compare with GitHub Copilot because it is one of the most famous
AI-assisted programming may help software developers, we cu- code completion tools and runs with a similar GPT model.
rated a benchmark of developer prompts that are prone to generate
vulnerable programs from ShareGPT [2] and StackOverflow [3]. Models. Throughout the experiment, we used the latest ChatGPT
ShareGPT is a platform where LLM users can post their experience (gpt-3.5-turbo) as a commercial LLM code completion and fixing
in formulating prompts and interactions with LLMs, whereas Stack- engine for Codexity. For the preshot repair, we used two different
Overflow is a popular programming-oriented Question-Answering specialized code completion models: StarCoder [17] in GPTQ [9]
platform. Figure 3 shows the benchmark construction process. We format and SantaCoder [4] in GGML [12] format. These formats
first collected 403 user posts relevant to C programming queries allow quantization, suitable for CPUs or less powerful GPUs. The
by filtering all the posts in the ShareGPT dataset and the first 700 models were run with a maximum prediction token of 1024 to be
posts in StackOverflow using ‘c’ and ‘int main’ as keywords. Each able to complete the whole code and with a temperature of 0.2
user post consists of a question and an LLM response. We then (following the same evaluation setup as the authors of StarCoder).
employ a two-round vulnerable prompts detection strategy. In the For the GPTQ model, we used the publicly available model at the
first round, we extracted the code snippet from the LLM response Hugging Face repository2 . Hugging Face is the equivalent of GitHub
and ran the static analyzers in Codexity on the extracted code for sharing models. We obtained the GGML model by following the
snippet of the 403 programming posts. As a result, we identified instructions in their GitHub repository.
124 posts containing vulnerabilities in their response. 2 https://huggingface.co/TheBloke/starcoder-GPTQ
Kim et al.
Table 2: Number of Vulnerable Codes generated
# of vulnerable codes % of vulnerable codes Avg Time (s) LoC
ChatGPT 751/990 75.9 39.2s 55.2
IR-ChatGPT 157/990 15.9 82.8s 60.6
PR-StarCoder-15.5B 389/990 39.3 45.4s 54.9
PR-SantaCoder-1.1B 459/990 46.4 72.4s 44.4
Technical Setup. Our experiments were conducted on a MacBook FootPatch is described in [25] as where the tool failed to resolve
Pro 2019 (Intel Core i9 with 16GB RAM) for iteration repair and the program variable. We also compared Codexity with GitHub
preshot repair with the GGML model. For the GPTQ model in the Copilot using its chat functionality. We asked Copilot to "write the
preshot strategy, we used a Ubuntu 18.04 Server with RTX 4090. next lines of the code" on our 90 prompts and took the first answer.
Infer and CppCheck found 76 (84.44%) programs to be vulnerable.
4.2 Experiment Result
RQ1: Vulnerability Prevention. We evaluated our tool on our 90
prompts that led to a vulnerable code generation and measured the RQ3: Tradeoffs. The iteration repair avoids generating vulnerabili-
number of vulnerable codes generated. We initially set a tempera- ties with a high success rate but requires multiple commercial LLM
ture of 0 for repairs in the iteration repair, as a high temperature calls, increasing costs and generation time. On average, there are
can lead to a random output as mentioned in OpenAI documenta- 1.8 repair iterations when a vulnerability is detected, and a gen-
tion3 . A code is determined vulnerable when the SAST tools detect eration takes 98.6 seconds (see Table 2). This can be explained by
a vulnerability. We also set the maximum number of iterations to 3 the introduction of new vulnerabilities after the iteration by the
for the iteration repair. Table 2 shows the results of using Codexity LLM, leading to additional iterations. On the other hand, preshot
in completing the vulnerable prompts of our benchmark. As men- repair utilizes a local LLM, minimizing API calls and time. For the
tioned in Section ??, the baseline tool ChatGPT generated for the 990 GPTQ model, the repair adds about 6 seconds. For the GGML model,
code completion attempts in our dataset 751 vulnerable programs the generation time is slower because the model is run on a CPU.
(751/990=75.9%) with 1645 detected vulnerabilities. In the iteration Moreover, the vulnerable code generation rate is higher than the
repair setting, Codexity only generated 157 (157/990=15.9%) vul- iteration repair for all the models because we do not give a second
nerable programs with 338 detected vulnerabilities. Therefore, it chance to the preshot repair in case a vulnerability is detected after
achieves a reduction of 60% vulnerable programs. The results in the commercial LLM generation. The iterative repair strategy is
Table 2 also show that preshot repair decreases the production of optimal for users prioritizing secure code generation, albeit with
vulnerable codes by 36.6% when using StarCoder and 29.5% for a longer generation time. Conversely, the preshot repair strategy
SantaCoder. In some cases (6.1% for StarCoder and 8.5% for Santa- benefits those needing swift code generation but with a heightened
Coder), the tool did not output a code but a comment, e.g., asking a level of security. For instance, preshot repair using ChatGPT cou-
question about the prompt. In this case, because it is impossible to pled with StarCoder offers an average generation time similar to
say if the code is secure, we considered the output vulnerable. ChatGPT yet provides improved security.
RQ2: Comparison with FootPatch and GitHub Copilot. We compare
Codexity with FootPatch on the 751 vulnerable subjects detected
by Infer. FootPatch’s Infer found 20 vulnerabilities, including 19 5 RELATED WORK
null dereferences and one memory leak but patched none. The
He et al. [13] introduced an innovative learning strategy for im-
analysis of the log data revealed that FootPatch failed to address
proved secure code generation. Other research includes works
Null Dereferences when capturing the variable name:
by Olausson et al. [21], Pearce et al. [23], and Charalambous et
1 Found error : NULL_DEREFERENCE al. [7], each exploring LLMs efficacy in fixing code vulnerabilities.
2 [+] Patchable error : [ vulnerable_code . c ]:[96] :[
pointer newNode last assigned on line 95 could be However, these studies often adopt vulnerability detection tech-
null and is dereferenced at line 96 , column 9] niques that are not suitable for IDE-integrated code completion
3 [+] Patch generation routine started for bug " tasks. Pearce et al. [23] used CodeQL, a vulnerability detection
NULL_DEREF " .
4 [+] Looking for pvar last in pname addSpecific tool that scales to large codebases but is inefficient for analyzing
5 [=] I found these typs for pvar last short, single code snippets. Charalambous et al. [7] implemented
6 [ -] No type for pvar found
the Bounded Model Checking method, limiting the vulnerability
Instead of extracting the pointer "newNode", FootPatch extracted spectrum. Conversely, Olausson et al. [21] chose to engage the
"last", leading to an incorrect variable identification and a subse- model for analyzing the error message; adding an extra call to a
quent failure in the patch development. Similarly, Infer identified model can increase the required detection time. Chen et al. [8] dis-
a vulnerability for the category Memory Leak but omitted the cussed using multiple LLMs to boost result quality. In contrast to
variable’s name, complicating the patch process. This behavior of these existing works, we integrate LLMs and SAST tools, specifi-
cally for tackling vulnerabilities, while considering the efficiency
3 https://platform.openai.com/docs/api-reference/chat/create and practicality of the code completion tasks in the IDE.
Codexity: Secure AI-assisted Code Generation
6 CONCLUSION [20] Yannic Noller, Ridwan Shariffdeen, Xiang Gao, and Abhik Roychoudhury. 2022.
Trust Enhancement Issues in Program Repair. In Proceedings of the 44th Inter-
We introduced Codexity that addresses vulnerabilities in LLM- national Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE
generated code. It deploys two distinct repair strategies: iteration ’22). Association for Computing Machinery, New York, NY, USA, 2228–2240.
https://doi.org/10.1145/3510003.3510040
repair and preshot repair. Our evaluation shows that they can re- [21] Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Ar-
duce the generation of vulnerable code by 60%. In future, it will be mando Solar-Lezama. 2023. Demystifying GPT Self-Repair for Code Generation.
essential to improve the code generation efficiency, e.g., by combin- arXiv preprint arXiv:2306.09896 (2023).
[22] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and
ing Codexity and LLM fine-tuning to reduce the need for repair Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub
iterations. Furthermore, it will be interesting to explore Codexity Copilot’s Code Contributions. In 2022 IEEE Symposium on Security and Privacy
for other popular programming languages, such as Java or Python. (SP). 754–768. https://doi.org/10.1109/SP46214.2022.9833571
[23] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan
Dolan-Gavitt. 2023. Examining Zero-Shot Vulnerability Repair with Large Lan-
ACKNOWLEDGMENTS guage Models. In 2023 IEEE Symposium on Security and Privacy (SP). 2339–2356.
https://doi.org/10.1109/SP46215.2023.10179324
This work was partially supported by a Singapore Ministry of Ed- [24] Ridwan Shariffdeen, Yannic Noller, Lars Grunske, and Abhik Roychoudhury. 2021.
ucation (MoE) Tier 3 grant "Automated Program Repair", MOE- Concolic program repair. In Proceedings of the 42nd ACM SIGPLAN International
Conference on Programming Language Design and Implementation. 390–405.
MOET32021-0001. [25] Rijnard van Tonder and Claire Le Goues. 2018. Static Automated Program Repair
for Heap Properties. In Proceedings of the 40th International Conference on Software
Engineering (Gothenburg, Sweden) (ICSE ’18). Association for Computing Ma-
REFERENCES chinery, New York, NY, USA, 151–162. https://doi.org/10.1145/3180155.3180250
[1] [n. d.]. GitHub Copilot. https://github.com/features/copilot, last accessed on [26] Yuntong Zhang, Xiang Gao, Gregory J Duck, and Abhik Roychoudhury. 2022.
01/01/24. Program vulnerability repair via inductive inference. In Proceedings of the 31st
[2] [n. d.]. ShareGPT. https://sharegpt.com/, last accessed on 01/01/24. ACM SIGSOFT International Symposium on Software Testing and Analysis. 691–
[3] [n. d.]. Stack overflow. https://stackoverflow.com/, last accessed on 01/01/24. 702.
[4] Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher
Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu,
Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint
arXiv:2301.03988 (2023).
[5] Andrei Arusoaie, Stefan Ciobâca, Vlad Craciun, Dragos Gavrilut, and Dorel
Lucanu. 2017. A Comparison of Open-Source Static Analysis Tools for Vul-
nerability Detection in C/C++ Code. In 2017 19th International Symposium on
Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). 161–168.
https://doi.org/10.1109/SYNASC.2017.00035
[6] Cristiano Calcagno and Dino Distefano. 2011. Infer: An Automatic Program
Verifier for Memory Safety of C Programs. In NASA Formal Methods, Mihaela
Bobaru, Klaus Havelund, Gerard J. Holzmann, and Rajeev Joshi (Eds.). Springer
Berlin Heidelberg, Berlin, Heidelberg, 459–465.
[7] Yiannis Charalambous, Norbert Tihanyi, Ridhi Jain, Youcheng Sun, Mo-
hamed Amine Ferrag, and Lucas C Cordeiro. 2023. A New Era in Software
Security: Towards Self-Healing Software via Large Language Models and Formal
Verification. arXiv preprint arXiv:2305.14752 (2023).
[8] Lingjiao Chen, Matei Zaharia, and James Zou. 2023. FrugalGPT: How to Use
Large Language Models While Reducing Cost and Improving Performance. arXiv
preprint arXiv:2305.05176 (2023).
[9] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq:
Accurate post-training quantization for generative pre-trained transformers.
arXiv preprint arXiv:2210.17323 (2022).
[10] Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, and Jiaxin Yu.
2023. Security Weaknesses of Copilot Generated Code in GitHub. arXiv preprint
arXiv:2310.02059 (2023).
[11] Xiang Gao, Bo Wang, Gregory J Duck, Ruyi Ji, Yingfei Xiong, and Abhik Roy-
choudhury. 2021. Beyond tests: Program vulnerability repair via crash constraint
extraction. ACM Transactions on Software Engineering and Methodology (TOSEM)
30, 2 (2021), 1–27.
[12] Georgi Gerganov. [n. d.]. GGML - AI at the edge. https://ggml.ai/, last accessed
on 23/10/23.
[13] Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security
Hardening and Adversarial Testing. (2023). https://arxiv.org/abs/2302.05319
[14] Zhen Huang, David Lie, Gang Tan, and Trent Jaeger. 2019. Using safety properties
to generate vulnerability patches. In 2019 IEEE Symposium on Security and Privacy
(SP). IEEE, 539–554.
[15] Raphaël Khoury, Anderson R Avila, Jacob Brunelle, and Baba Mamadou Ca-
mara. 2023. How Secure is Code Generated by ChatGPT? arXiv preprint
arXiv:2304.09655 (2023).
[16] Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Satish Chandra.
2021. Automatic Program Repair. IEEE Software 38, 4 (2021), 22–27. https:
//doi.org/10.1109/MS.2021.3072577
[17] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov,
Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023.
StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
[18] Daniel Marjamaki. [n. d.]. Cppcheck: A Tool for Static C/C++ Code Analysis.
https://sourceforge.net/projects/cppcheck/, last accessed on 18/10/23.
[19] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chan-
dra. 2013. Semfix: Program repair via semantic analysis. In 2013 35th International
Conference on Software Engineering (ICSE). IEEE, 772–781.
Kim et al.
Launching Codexity is similar to GitHub Copilot. The user
needs to highlight the prompt that should be completed (e.g., a
code comment or some partial code snippet) and then launch the
extension, which will trigger our guided code completion using
ChatGPT in the background. We conclude the demonstration by
comparing the outcomes of the three generated code variants. The
code generated by GitHub Copilot exhibited a potential buffer over-
(a) User Prompt flow issue, as illustrated in Figure 4b at line 9, due to the possibility
of exceeding the maximum size of the variable ’w’ based on the
input. However, the iteration and preshot repair methods produce
code that does not contain vulnerabilities (see Figures 4c and 4d).
(b) Code generated by GitHub Copilot
(c) Code generated by Codexity with Iteration Repair
(d) Code generated by Codexity with Preshot Repair
Figure 4: Demo Screenshots
A DEMO WALKTHROUGH
The demonstration of our tool is showcased in our video4 , where
we discuss the motivation, various strategies, and the practical
use of Codexity. Codexity’s demonstration involves the code
completion for the snippet depicted in Figure 4a. We incorporated
our tool into a Visual Studio Code extension for practical usage.
In our demonstration, we showcase three ways of automated code
completion in Visual Studio Code:
(1) utilizing GitHub Copilot,
(2) utilizing Codexity with the Iteration Repair strategy, and
(3) utilizing Codexity with the Preshot Repair strategy.
4 https://www.youtube.com/watch?v=om6CYGfbNnM