0% found this document useful (0 votes)

14 views6 pages

Codexity: Secure AI-assisted Code Generation: Sung Yong Kim Zhiyu Fan

Codexity is a security-focused code generation framework that integrates five large language models (LLMs) and static analysis tools to prevent vulnerabilities in AI-generated code. It has demonstrated the ability to eliminate 60% of vulnerabilities compared to existing tools like ChatGPT, through two strategies: Iteration Repair and Preshot Repair. The framework aims to enhance software security by providing developers with timely, vulnerability-free code completions.

Uploaded by

dang682000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views6 pages

Codexity: Secure AI-assisted Code Generation: Sung Yong Kim Zhiyu Fan

Uploaded by

dang682000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Codexity: Secure AI-assisted Code Generation

Sung Yong Kim Zhiyu Fan

National University of Singapore National University of Singapore
Singapore Singapore
[email protected] [email protected]

Yannic Noller Abhik Roychoudhury

Singapore University of Technology and Design National University of Singapore
Singapore Singapore
[email protected] [email protected]
arXiv:2405.03927v1 [cs.SE] 7 May 2024

ABSTRACT code completion attempts, we demonstrate that, compared to Chat-

Despite the impressive performance of Large Language Models GPT, Codexity prevents the generation of 60% of the vulnerabilities.
(LLMs) in software development activities, recent studies show the Our core contributions are as follows:
concern of introducing vulnerabilities into software codebase by • Codexity, a security-focused code generation framework pro-
AI programming assistants (e.g., Copilot, CodeWhisperer). In this viding timely vulnerability-free LLM-generated programs for
work, we present Codexity, a security-focused code generation developers. Additionally, Codexity employs a flexible architec-
framework integrated with five LLMs. Codexity leverages the feed- ture that allows user-customized LLM deployment.
back of static analysis tools such as Infer and CppCheck to mitigate • Real-world benchmark for LLM-generated programs, consist-
security vulnerabilities in LLM-generated programs. Our evalua- ing of 751 vulnerable subjects from 90 publicly available prompts.
tion in a real-world benchmark with 751 automatically generated
vulnerable subjects demonstrates Codexity can prevent 60% of the
2 MOTIVATING EXAMPLE
vulnerabilities being exposed to the software developer.
Listing 1 displays a program generated by an LLM. It is meant to
• Video: https://youtu.be/om6CYGfbNnM answer a StackOverflow inquiry1 about storing user-input strings
• Tool and Data: https://github.com/Codexity-APR/Codexity, in variables. Our assembly of Static Application Security Testing
https://doi.org/10.5281/zenodo.10572275 (SAST) tools identified a potential vulnerability (highlighted in red
in) that is related to CWE-119:
KEYWORDS
"CWE-119 identified at lines 7 and 9. The triggered concern:
security, code generation, large language model
scanf() without field width limits can crash with huge input data."
To prevent such vulnerabilities from being exposed to the software
1 INTRODUCTION
developer, we provide this additional feedback to the LLM in the
The recent advance of fundamental Large Language Models (LLMs) form of a structured prompt, emulating a dialogue with the goal
for code has shown surprising results in many software engineering of crafting code devoid of these flaws (see Figure 1). Consequently,
applications like code completion, code summarization, etc. Con- the LLM responded by producing the revised code, eliminating the
sequently, AI programming assistants like Copilot and CodeWhis- buffer overflow issues, as evidenced in Listing 2. As highlighted in
perer have reshaped the way of software development. Despite green, the code is no longer vulnerable because the format specifier
their advantages and conveniences, researchers have expressed %49s is used in both scanf calls, meaning that scanf will add at
concerns that LLMs can introduce security vulnerabilities into the most 49 characters into the arrays. By specifying the maximum
auto-generated code that developers could overlook [10, 15, 22]. number of characters, the code prevents buffer overflow vulnera-
Without a secure integration mechanism for these LLM-generated bilities if the input data exceeds the allocated buffer size.
codes, they could thus compromise the entire software system. Au-
tomated Program Repair (APR) [16] is a technique that aims to fix 3 DESIGN AND METHODOLOGY
bugs automatically, with many successful applications in fixing
security vulnerabilities [11, 14, 19, 24, 26]. However, existing APR This section introduces the workflow of Codexity and two vulner-
approaches primarily focus on fixing human-written programs and ability-preventing repair strategies.
typically assume timeouts of ≥ 1 hour [20]. Such assumptions might
fit in CI workflows or nightly build scenarios but do not satisfy the 3.1 Codexity Workflow
tight requirements for an in-time vulnerability prevention needed In Codexity’s workflow, the user first needs to select a repair strat-
for the automated code generation/completion inside the IDE. egy in the configuration setting to activate the system. Codexity
To tackle these challenges, we present Codexity, the first security- currently offers the two strategies Iteration Repair and Preshot Re-
focused code generation framework. Codexity integrates LLMs pair catering for different requirements in the computation power.
with static analyzers [6, 18] to establish security awareness and act Then, the user can invoke Codexity to complete their code while
as the first guard to prevent potential vulnerabilities introduced by
AI programming assistants. In our experiments with 990 real-world 1 https://stackoverflow.com/questions/19063213/string-ask-and-store
Kim et al.

1 # include < stdio .h > 1 # include < stdio .h >

2 # include < errno .h > 2 # include < errno .h >
3 # include < string .h > 3 # include < string .h >
4 int main ( void ){ 4 int main ( void ) {
5 char reference [50] , query [50]; 5 char reference [50] , query [50];
6 printf ("\n Enter reference genome file name : " ) ; 6 printf ( " \ n Enter reference genome file name : ");
7 scanf ("% s", reference ) ; 7 scanf ("%49 s " , reference ) ;
8 printf ("\n Enter query genome file name : " ) ; 8 printf ( " \ n Enter query genome file name : " );
9 scanf ("% s", query ) ; 9 scanf ("%49 s " , query ) ;
10 return 0; 10 return 0;
11 } 11 }

Listing 1: Vulnerable code generated by LLM Listing 2: Safe code generated by LLM

(a) Iteration Repair (b) Preshot Repair

Figure 2: Codexity’s Repair Strategies

Figure 1: Conversational Prompt (Vulnerable Code is a place-

holder for the code in Listing 1)
to provide the most secure code generation service. Moreover, the
API-accessible LLMs do not require users to be locally equipped
programming. Codexity will take the existing code snippet to ini- with a high-end GPU for model inference. After generating a code,
tiate a prompt and generate an initial completion with the selected if a vulnerability is detected, the vulnerable code is sent back to the
LLM. The completed code will be routed to a vulnerability detection commercial LLM using its API. We repeat this until it generates
phase by a series of static analysis tools. Codexity is integrated one program for which the static analyzers report no vulnerability.
with two state-of-the-art static analyzers CppCheck [18] and In-
fer [6]. We selected these two tools to examine a wide range of 3.3 Strategy 2: Preshot Repair
vulnerabilities. Like Arusoaie et al. [5], we also found CppCheck to Iteration repair aims to generate the best result by iteratively query-
be a generally good candidate. We added Infer because of its great ing the model until a secure answer is found or the maximum
ability to find memory-related bugs. If static analysis tools report allowed queries are reached. Although commercial LLMs provide
any vulnerability, Codexity extracts the error/warning message better results, they might introduce high costs to the users. To
and location information, along with the vulnerable program, to balance the trade-off between cost and quality, we present an al-
formulate a vulnerability-exposing prompt (see Figure 1). Finally, ternative vulnerability-preventing strategy called Preshot Repair
Codexity sends the vulnerability-exposing prompt to the LLM in shown in Figure 2b. We hypothesize that a weaker local (and hence
the background and requests a vulnerable-free program. cheaper) LLM is more prone to generate vulnerabilities. Such vul-
nerability information can be included in the query to the more
3.2 Strategy 1: Iteration Repair costly and powerful LLM. Consequently, we aim to anticipate and
We now describe the default vulnerability-preventing strategy in prevent the flaws likely to surface in our target LLM’s output, thus
Codexity. The Iteration Repair strategy employs commercial LLMs the term Preshot Repair. So, we first query a local LLM to generate
to generate vulnerable-free programs and provides an interface an initial (vulnerable) code. We employ static analyzers to detect
that allows users to specify the exact background LLM engine. The any vulnerability and, subsequently, use the analysis report in the
goal is to leverage the most powerful LLM that the user can access prompt generation for the target LLM.
Codexity: Secure AI-assisted Code Generation

In the second round, we extracted the non-vulnerable part of the

code snippet from the 124 posts and asked ChatGPT to complete
the programming questions (using two temperature configurations)
to confirm further whether they are prone to generate vulnerable
programs. First, ChatGPT generates one completion of each prompt
for the temperature at 0. Then, we executed ChatGPT again to
generate ten completions of each prompt for temperature at 0.8.
Finally, we reran the static analyzers on the 124×11=1,364 LLM-
generated code completions. We specified a prompt as vulnerable if
(1) ChatGPT produces a vulnerable program when the temperature
is set to 0, or (2) ChatGPT produces at least one vulnerable program
in one of the ten runs when the temperature is set to 0.8. Finally, we
ended up with 90 vulnerable prompts and generated 751 vulnerable
code completions with 1645 vulnerabilities.
Table 1 shows the details of our benchmark, categorized by
vulnerabilities identified by CppCheck and Infer. We employed Cp-
pCheck for common CWEs and Infer for specific types like Memory
leaks. Note that subjects could have multiple vulnerabilities. We
defaulted to CWE classification when vulnerabilities overlapped.

Table 1: Vulnerabilities Categories

Vulnerability Total Vulnerability Total

Null Dereference 384 CWE-686 15
Figure 3: Dataset Construction Nullptr Dereference 363 Integer Overflow L2 13
Resource Leak 196 CWE-758 12
Buffer Overrun L2 119 CWE-197 12
4 EVALUATION Memory Leak 109 CWE-562 11
We show Codexity’s effectiveness to securely handle developers’ CWE-119 90 Buffer Overrun L1 11
code completion requests based on two real-world datasets and CWE-457 81 CWE-467 10
compare Codexity’s performance with FootPatch [25] and GitHub CWE-401 66 Use After Lifetime 8
Copilot [1]. In particular, we explore three research questions: Buffer Overrun L3 62 CWE-685 7
RQ1: How effective is Codexity in preventing the generation of CWE-775 32 CWE-476 4
vulnerable programs? CWE-788 22 Use After Free 2
RQ2: How does Codexity perform compared to FootPatch and Buffer Overrun S2 15 Inferbo Alloc Is Zero 1
GitHub Copilot?
RQ3: What are the benefits/drawbacks of the proposed workflows?
Baseline Tools. We use FootPatch and GitHub Copilot as baseline
tools in our comparison. We selected FootPatch because it also uses
4.1 Evaluation Setup a SAST tool, Infer, to detect vulnerabilities. Also, we decided to
Benchmark Construction. To simulate real-world scenarios of how compare with GitHub Copilot because it is one of the most famous
AI-assisted programming may help software developers, we cu- code completion tools and runs with a similar GPT model.
rated a benchmark of developer prompts that are prone to generate
vulnerable programs from ShareGPT [2] and StackOverflow [3]. Models. Throughout the experiment, we used the latest ChatGPT
ShareGPT is a platform where LLM users can post their experience (gpt-3.5-turbo) as a commercial LLM code completion and fixing
in formulating prompts and interactions with LLMs, whereas Stack- engine for Codexity. For the preshot repair, we used two different
Overflow is a popular programming-oriented Question-Answering specialized code completion models: StarCoder [17] in GPTQ [9]
platform. Figure 3 shows the benchmark construction process. We format and SantaCoder [4] in GGML [12] format. These formats
first collected 403 user posts relevant to C programming queries allow quantization, suitable for CPUs or less powerful GPUs. The
by filtering all the posts in the ShareGPT dataset and the first 700 models were run with a maximum prediction token of 1024 to be
posts in StackOverflow using ‘c’ and ‘int main’ as keywords. Each able to complete the whole code and with a temperature of 0.2
user post consists of a question and an LLM response. We then (following the same evaluation setup as the authors of StarCoder).
employ a two-round vulnerable prompts detection strategy. In the For the GPTQ model, we used the publicly available model at the
first round, we extracted the code snippet from the LLM response Hugging Face repository2 . Hugging Face is the equivalent of GitHub
and ran the static analyzers in Codexity on the extracted code for sharing models. We obtained the GGML model by following the
snippet of the 403 programming posts. As a result, we identified instructions in their GitHub repository.
124 posts containing vulnerabilities in their response. 2 https://huggingface.co/TheBloke/starcoder-GPTQ
Kim et al.

Table 2: Number of Vulnerable Codes generated

# of vulnerable codes % of vulnerable codes Avg Time (s) LoC

ChatGPT 751/990 75.9 39.2s 55.2
IR-ChatGPT 157/990 15.9 82.8s 60.6
PR-StarCoder-15.5B 389/990 39.3 45.4s 54.9
PR-SantaCoder-1.1B 459/990 46.4 72.4s 44.4

Technical Setup. Our experiments were conducted on a MacBook FootPatch is described in [25] as where the tool failed to resolve
Pro 2019 (Intel Core i9 with 16GB RAM) for iteration repair and the program variable. We also compared Codexity with GitHub
preshot repair with the GGML model. For the GPTQ model in the Copilot using its chat functionality. We asked Copilot to "write the
preshot strategy, we used a Ubuntu 18.04 Server with RTX 4090. next lines of the code" on our 90 prompts and took the first answer.
Infer and CppCheck found 76 (84.44%) programs to be vulnerable.
4.2 Experiment Result
RQ1: Vulnerability Prevention. We evaluated our tool on our 90
prompts that led to a vulnerable code generation and measured the RQ3: Tradeoffs. The iteration repair avoids generating vulnerabili-
number of vulnerable codes generated. We initially set a tempera- ties with a high success rate but requires multiple commercial LLM
ture of 0 for repairs in the iteration repair, as a high temperature calls, increasing costs and generation time. On average, there are
can lead to a random output as mentioned in OpenAI documenta- 1.8 repair iterations when a vulnerability is detected, and a gen-
tion3 . A code is determined vulnerable when the SAST tools detect eration takes 98.6 seconds (see Table 2). This can be explained by
a vulnerability. We also set the maximum number of iterations to 3 the introduction of new vulnerabilities after the iteration by the
for the iteration repair. Table 2 shows the results of using Codexity LLM, leading to additional iterations. On the other hand, preshot
in completing the vulnerable prompts of our benchmark. As men- repair utilizes a local LLM, minimizing API calls and time. For the
tioned in Section ??, the baseline tool ChatGPT generated for the 990 GPTQ model, the repair adds about 6 seconds. For the GGML model,
code completion attempts in our dataset 751 vulnerable programs the generation time is slower because the model is run on a CPU.
(751/990=75.9%) with 1645 detected vulnerabilities. In the iteration Moreover, the vulnerable code generation rate is higher than the
repair setting, Codexity only generated 157 (157/990=15.9%) vul- iteration repair for all the models because we do not give a second
nerable programs with 338 detected vulnerabilities. Therefore, it chance to the preshot repair in case a vulnerability is detected after
achieves a reduction of 60% vulnerable programs. The results in the commercial LLM generation. The iterative repair strategy is
Table 2 also show that preshot repair decreases the production of optimal for users prioritizing secure code generation, albeit with
vulnerable codes by 36.6% when using StarCoder and 29.5% for a longer generation time. Conversely, the preshot repair strategy
SantaCoder. In some cases (6.1% for StarCoder and 8.5% for Santa- benefits those needing swift code generation but with a heightened
Coder), the tool did not output a code but a comment, e.g., asking a level of security. For instance, preshot repair using ChatGPT cou-
question about the prompt. In this case, because it is impossible to pled with StarCoder offers an average generation time similar to
say if the code is secure, we considered the output vulnerable. ChatGPT yet provides improved security.
RQ2: Comparison with FootPatch and GitHub Copilot. We compare
Codexity with FootPatch on the 751 vulnerable subjects detected
by Infer. FootPatch’s Infer found 20 vulnerabilities, including 19 5 RELATED WORK
null dereferences and one memory leak but patched none. The
He et al. [13] introduced an innovative learning strategy for im-
analysis of the log data revealed that FootPatch failed to address
proved secure code generation. Other research includes works
Null Dereferences when capturing the variable name:
by Olausson et al. [21], Pearce et al. [23], and Charalambous et
1 Found error : NULL_DEREFERENCE al. [7], each exploring LLMs efficacy in fixing code vulnerabilities.
2 [+] Patchable error : [ vulnerable_code . c ]:[96] :[
pointer newNode last assigned on line 95 could be However, these studies often adopt vulnerability detection tech-
null and is dereferenced at line 96 , column 9] niques that are not suitable for IDE-integrated code completion
3 [+] Patch generation routine started for bug " tasks. Pearce et al. [23] used CodeQL, a vulnerability detection
NULL_DEREF " .
4 [+] Looking for pvar last in pname addSpecific tool that scales to large codebases but is inefficient for analyzing
5 [=] I found these typs for pvar last short, single code snippets. Charalambous et al. [7] implemented
6 [ -] No type for pvar found
the Bounded Model Checking method, limiting the vulnerability
Instead of extracting the pointer "newNode", FootPatch extracted spectrum. Conversely, Olausson et al. [21] chose to engage the
"last", leading to an incorrect variable identification and a subse- model for analyzing the error message; adding an extra call to a
quent failure in the patch development. Similarly, Infer identified model can increase the required detection time. Chen et al. [8] dis-
a vulnerability for the category Memory Leak but omitted the cussed using multiple LLMs to boost result quality. In contrast to
variable’s name, complicating the patch process. This behavior of these existing works, we integrate LLMs and SAST tools, specifi-
cally for tackling vulnerabilities, while considering the efficiency
3 https://platform.openai.com/docs/api-reference/chat/create and practicality of the code completion tasks in the IDE.
Codexity: Secure AI-assisted Code Generation

6 CONCLUSION [20] Yannic Noller, Ridwan Shariffdeen, Xiang Gao, and Abhik Roychoudhury. 2022.
Trust Enhancement Issues in Program Repair. In Proceedings of the 44th Inter-
We introduced Codexity that addresses vulnerabilities in LLM- national Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE
generated code. It deploys two distinct repair strategies: iteration ’22). Association for Computing Machinery, New York, NY, USA, 2228–2240.
https://doi.org/10.1145/3510003.3510040
repair and preshot repair. Our evaluation shows that they can re- [21] Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Ar-
duce the generation of vulnerable code by 60%. In future, it will be mando Solar-Lezama. 2023. Demystifying GPT Self-Repair for Code Generation.
essential to improve the code generation efficiency, e.g., by combin- arXiv preprint arXiv:2306.09896 (2023).
[22] Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and
ing Codexity and LLM fine-tuning to reduce the need for repair Ramesh Karri. 2022. Asleep at the Keyboard? Assessing the Security of GitHub
iterations. Furthermore, it will be interesting to explore Codexity Copilot’s Code Contributions. In 2022 IEEE Symposium on Security and Privacy
for other popular programming languages, such as Java or Python. (SP). 754–768. https://doi.org/10.1109/SP46214.2022.9833571
[23] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan
Dolan-Gavitt. 2023. Examining Zero-Shot Vulnerability Repair with Large Lan-
ACKNOWLEDGMENTS guage Models. In 2023 IEEE Symposium on Security and Privacy (SP). 2339–2356.
https://doi.org/10.1109/SP46215.2023.10179324
This work was partially supported by a Singapore Ministry of Ed- [24] Ridwan Shariffdeen, Yannic Noller, Lars Grunske, and Abhik Roychoudhury. 2021.
ucation (MoE) Tier 3 grant "Automated Program Repair", MOE- Concolic program repair. In Proceedings of the 42nd ACM SIGPLAN International
Conference on Programming Language Design and Implementation. 390–405.
MOET32021-0001. [25] Rijnard van Tonder and Claire Le Goues. 2018. Static Automated Program Repair
for Heap Properties. In Proceedings of the 40th International Conference on Software
Engineering (Gothenburg, Sweden) (ICSE ’18). Association for Computing Ma-
REFERENCES chinery, New York, NY, USA, 151–162. https://doi.org/10.1145/3180155.3180250
[1] [n. d.]. GitHub Copilot. https://github.com/features/copilot, last accessed on [26] Yuntong Zhang, Xiang Gao, Gregory J Duck, and Abhik Roychoudhury. 2022.
01/01/24. Program vulnerability repair via inductive inference. In Proceedings of the 31st
[2] [n. d.]. ShareGPT. https://sharegpt.com/, last accessed on 01/01/24. ACM SIGSOFT International Symposium on Software Testing and Analysis. 691–
[3] [n. d.]. Stack overflow. https://stackoverflow.com/, last accessed on 01/01/24. 702.
[4] Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher
Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu,
Manan Dey, et al. 2023. SantaCoder: don’t reach for the stars! arXiv preprint
arXiv:2301.03988 (2023).
[5] Andrei Arusoaie, Stefan Ciobâca, Vlad Craciun, Dragos Gavrilut, and Dorel
Lucanu. 2017. A Comparison of Open-Source Static Analysis Tools for Vul-
nerability Detection in C/C++ Code. In 2017 19th International Symposium on
Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). 161–168.
https://doi.org/10.1109/SYNASC.2017.00035
[6] Cristiano Calcagno and Dino Distefano. 2011. Infer: An Automatic Program
Verifier for Memory Safety of C Programs. In NASA Formal Methods, Mihaela
Bobaru, Klaus Havelund, Gerard J. Holzmann, and Rajeev Joshi (Eds.). Springer
Berlin Heidelberg, Berlin, Heidelberg, 459–465.
[7] Yiannis Charalambous, Norbert Tihanyi, Ridhi Jain, Youcheng Sun, Mo-
hamed Amine Ferrag, and Lucas C Cordeiro. 2023. A New Era in Software
Security: Towards Self-Healing Software via Large Language Models and Formal
Verification. arXiv preprint arXiv:2305.14752 (2023).
[8] Lingjiao Chen, Matei Zaharia, and James Zou. 2023. FrugalGPT: How to Use
Large Language Models While Reducing Cost and Improving Performance. arXiv
preprint arXiv:2305.05176 (2023).
[9] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq:
Accurate post-training quantization for generative pre-trained transformers.
arXiv preprint arXiv:2210.17323 (2022).
[10] Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, and Jiaxin Yu.
2023. Security Weaknesses of Copilot Generated Code in GitHub. arXiv preprint
arXiv:2310.02059 (2023).
[11] Xiang Gao, Bo Wang, Gregory J Duck, Ruyi Ji, Yingfei Xiong, and Abhik Roy-
choudhury. 2021. Beyond tests: Program vulnerability repair via crash constraint
extraction. ACM Transactions on Software Engineering and Methodology (TOSEM)
30, 2 (2021), 1–27.
[12] Georgi Gerganov. [n. d.]. GGML - AI at the edge. https://ggml.ai/, last accessed
on 23/10/23.
[13] Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security
Hardening and Adversarial Testing. (2023). https://arxiv.org/abs/2302.05319
[14] Zhen Huang, David Lie, Gang Tan, and Trent Jaeger. 2019. Using safety properties
to generate vulnerability patches. In 2019 IEEE Symposium on Security and Privacy
(SP). IEEE, 539–554.
[15] Raphaël Khoury, Anderson R Avila, Jacob Brunelle, and Baba Mamadou Ca-
mara. 2023. How Secure is Code Generated by ChatGPT? arXiv preprint
arXiv:2304.09655 (2023).
[16] Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Satish Chandra.
2021. Automatic Program Repair. IEEE Software 38, 4 (2021), 22–27. https:
//doi.org/10.1109/MS.2021.3072577
[17] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov,
Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023.
StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
[18] Daniel Marjamaki. [n. d.]. Cppcheck: A Tool for Static C/C++ Code Analysis.
https://sourceforge.net/projects/cppcheck/, last accessed on 18/10/23.
[19] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chan-
dra. 2013. Semfix: Program repair via semantic analysis. In 2013 35th International
Conference on Software Engineering (ICSE). IEEE, 772–781.
Kim et al.

Launching Codexity is similar to GitHub Copilot. The user

needs to highlight the prompt that should be completed (e.g., a
code comment or some partial code snippet) and then launch the
extension, which will trigger our guided code completion using
ChatGPT in the background. We conclude the demonstration by
comparing the outcomes of the three generated code variants. The
code generated by GitHub Copilot exhibited a potential buffer over-
(a) User Prompt flow issue, as illustrated in Figure 4b at line 9, due to the possibility
of exceeding the maximum size of the variable ’w’ based on the
input. However, the iteration and preshot repair methods produce
code that does not contain vulnerabilities (see Figures 4c and 4d).

(b) Code generated by GitHub Copilot

(c) Code generated by Codexity with Iteration Repair

(d) Code generated by Codexity with Preshot Repair

Figure 4: Demo Screenshots

A DEMO WALKTHROUGH
The demonstration of our tool is showcased in our video4 , where
we discuss the motivation, various strategies, and the practical
use of Codexity. Codexity’s demonstration involves the code
completion for the snippet depicted in Figure 4a. We incorporated
our tool into a Visual Studio Code extension for practical usage.
In our demonstration, we showcase three ways of automated code
completion in Visual Studio Code:
(1) utilizing GitHub Copilot,
(2) utilizing Codexity with the Iteration Repair strategy, and
(3) utilizing Codexity with the Preshot Repair strategy.
4 https://www.youtube.com/watch?v=om6CYGfbNnM

How Secure Is AI-generated Code: A Large-Scale Comparison of Large Language Models
No ratings yet
How Secure Is AI-generated Code: A Large-Scale Comparison of Large Language Models
47 pages
C L L M F F V S ?: AN Arge Anguage Odels Ind and IX Ulnerable Oftware
No ratings yet
C L L M F F V S ?: AN Arge Anguage Odels Ind and IX Ulnerable Oftware
18 pages
How Well Do Large Language Models Serve As End-to-End Secure
No ratings yet
How Well Do Large Language Models Serve As End-to-End Secure
13 pages
AI for Zero-Shot Code Vulnerability Fixes
No ratings yet
AI for Zero-Shot Code Vulnerability Fixes
18 pages
Refactoring Vs Refuctoring Advancing The State of AI Automated Code Improvements 1
No ratings yet
Refactoring Vs Refuctoring Advancing The State of AI Automated Code Improvements 1
10 pages
QLPro
No ratings yet
QLPro
6 pages
Catching Bugs Before The Hack How AI Can Predict Security Flaws in Code
No ratings yet
Catching Bugs Before The Hack How AI Can Predict Security Flaws in Code
3 pages
Auto Patch 080525
No ratings yet
Auto Patch 080525
20 pages
(Research) Can LLMs Patch Security Issues?
No ratings yet
(Research) Can LLMs Patch Security Issues?
9 pages
Assessing Cybersecurity Vulnerabilities in Code Large Language Models - 2404.18567v1
No ratings yet
Assessing Cybersecurity Vulnerabilities in Code Large Language Models - 2404.18567v1
12 pages
Understanding The Effectiveness of Large Language Models in Detecting Security Vulnerabilities
No ratings yet
Understanding The Effectiveness of Large Language Models in Detecting Security Vulnerabilities
18 pages
LLMXCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models
No ratings yet
LLMXCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models
19 pages
LLM Code Reviews
No ratings yet
LLM Code Reviews
25 pages
AI - Research - Paper - Updated
No ratings yet
AI - Research - Paper - Updated
8 pages
One Size Does Not Fit All: Investigating Efficacy of Perplexity in Detecting LLM-Generated Code
No ratings yet
One Size Does Not Fit All: Investigating Efficacy of Perplexity in Detecting LLM-Generated Code
33 pages
LLM-Enhanced Software Patch Localization
No ratings yet
LLM-Enhanced Software Patch Localization
20 pages
Codeagent:: Autonomous Communicative Agents For Code Review
No ratings yet
Codeagent:: Autonomous Communicative Agents For Code Review
35 pages
On Hardware Security Bug Code Fixes by Prompting Large Language Models
No ratings yet
On Hardware Security Bug Code Fixes by Prompting Large Language Models
15 pages
Project Draft 1.2
No ratings yet
Project Draft 1.2
11 pages
Can We Trust Large Language Models Generated Code A
No ratings yet
Can We Trust Large Language Models Generated Code A
27 pages
Kantek DP
No ratings yet
Kantek DP
100 pages
Together, We Can Make Open-Source Software Safer, One Line of Code at A Time.
No ratings yet
Together, We Can Make Open-Source Software Safer, One Line of Code at A Time.
10 pages
2411.06493v1 - LLM-driven Vulnerability Detection C&C++
No ratings yet
2411.06493v1 - LLM-driven Vulnerability Detection C&C++
5 pages
OpenAI O1 CyberSec
No ratings yet
OpenAI O1 CyberSec
12 pages
Purple Llama CYBERSECEVAL
No ratings yet
Purple Llama CYBERSECEVAL
13 pages
Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches On SWE-bench
No ratings yet
Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches On SWE-bench
12 pages
Security Flaws in Code Models
No ratings yet
Security Flaws in Code Models
14 pages
Haseeb Tahir Report
No ratings yet
Haseeb Tahir Report
40 pages
Sans Fixing What You Broke Owen Slubowski
No ratings yet
Sans Fixing What You Broke Owen Slubowski
25 pages
Trust in Auto Code Generation
No ratings yet
Trust in Auto Code Generation
33 pages
AIBugHunter: C/C++ Vulnerability Tool
No ratings yet
AIBugHunter: C/C++ Vulnerability Tool
34 pages
Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
No ratings yet
Machine Learning For Source Code Vulnerability Detection: What Works and What Isn't There Yet
17 pages
Vulnerability Detection in Popular Programming Languages With Language Models
No ratings yet
Vulnerability Detection in Popular Programming Languages With Language Models
21 pages
Vul-RAG: LLM-Enhanced Vulnerability Detection
No ratings yet
Vul-RAG: LLM-Enhanced Vulnerability Detection
12 pages
Large Language Models For Cybersecurity New Opportunities
No ratings yet
Large Language Models For Cybersecurity New Opportunities
8 pages
Security Attacks On LLM-based Code Completion Tools: Wcheng@smail - Nju.edu - CN Kesuniot@umich - Edu
No ratings yet
Security Attacks On LLM-based Code Completion Tools: Wcheng@smail - Nju.edu - CN Kesuniot@umich - Edu
18 pages
Llms
No ratings yet
Llms
17 pages
Cleanvul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics
No ratings yet
Cleanvul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics
25 pages
Po CGe N
No ratings yet
Po CGe N
12 pages
Automated Atomicity-Violation Fixing
No ratings yet
Automated Atomicity-Violation Fixing
12 pages
LLM Oakland2024
No ratings yet
LLM Oakland2024
19 pages
Mishra Thesis AI Augmented Vulnerability
No ratings yet
Mishra Thesis AI Augmented Vulnerability
96 pages
Buffer Overflow
No ratings yet
Buffer Overflow
12 pages
The Hidden Risks of LLM-Generated Web
No ratings yet
The Hidden Risks of LLM-Generated Web
9 pages
Patna
No ratings yet
Patna
9 pages
Streamlining Security Vulnerability Triage With Large Language Models
No ratings yet
Streamlining Security Vulnerability Triage With Large Language Models
16 pages
Large Language Models For Code Analysis - Do LLMs Really Do Their Job
No ratings yet
Large Language Models For Code Analysis - Do LLMs Really Do Their Job
18 pages
Studying The Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
No ratings yet
Studying The Quality of Source Code Generated by Different AI Generative Engines An Empirical Evaluation
19 pages
2034 - 999 - DOC - LAB Exploring AI-Powered Coding
No ratings yet
2034 - 999 - DOC - LAB Exploring AI-Powered Coding
3 pages
Cybersecurity Risks of AI-Generated Code
No ratings yet
Cybersecurity Risks of AI-Generated Code
41 pages
AIAuto Fix
No ratings yet
AIAuto Fix
30 pages
Codepori: Large Scale Model For Autonomous Software Development by Using Multi-Agents
No ratings yet
Codepori: Large Scale Model For Autonomous Software Development by Using Multi-Agents
10 pages
GitHub Copilot's Security Risks
No ratings yet
GitHub Copilot's Security Risks
15 pages
Fine Tuning LLMs Vs Non-Generative Machine Learning Models in Malware Detection
No ratings yet
Fine Tuning LLMs Vs Non-Generative Machine Learning Models in Malware Detection
11 pages
Intelligent Chatbot For Secure Code Analysis
No ratings yet
Intelligent Chatbot For Secure Code Analysis
14 pages
Paper 1
No ratings yet
Paper 1
13 pages
2024 Acl-Long 737
No ratings yet
2024 Acl-Long 737
16 pages
Code Tree
No ratings yet
Code Tree
16 pages
شرح مخطط backup
100% (1)
شرح مخطط backup
31 pages
Business Client Information Form
No ratings yet
Business Client Information Form
5 pages
Baxi Roca
No ratings yet
Baxi Roca
3 pages
The Things They Carry
No ratings yet
The Things They Carry
9 pages
Journal of Materials Processing Tech.: Harikrishna Rana, Vishvesh Badheka
No ratings yet
Journal of Materials Processing Tech.: Harikrishna Rana, Vishvesh Badheka
13 pages
Bridge Works - Miscellaneous
No ratings yet
Bridge Works - Miscellaneous
26 pages
Bifold SH06 PDF
50% (2)
Bifold SH06 PDF
8 pages
Enderlein and Pleomorphism
No ratings yet
Enderlein and Pleomorphism
1 page
2 JHA On Shot Grit Blasting1
No ratings yet
2 JHA On Shot Grit Blasting1
3 pages
STD V Intl Syllabus 2024 25
No ratings yet
STD V Intl Syllabus 2024 25
10 pages
Manual Servico
No ratings yet
Manual Servico
461 pages
La 111 Sessional 2023
No ratings yet
La 111 Sessional 2023
3 pages
5 PG TRB Unit 10 Phrases and Patterns
No ratings yet
5 PG TRB Unit 10 Phrases and Patterns
109 pages
Solutions Ch08 4e Probs01 14
No ratings yet
Solutions Ch08 4e Probs01 14
20 pages
1999-2000 SUSPENSION Front - Avalon, Camry, Camry Solara, Celica, Corolla, Echo, RAV4 & SiennaFront Suspension
75% (4)
1999-2000 SUSPENSION Front - Avalon, Camry, Camry Solara, Celica, Corolla, Echo, RAV4 & SiennaFront Suspension
22 pages
演讲技巧与主题选择
100% (1)
演讲技巧与主题选择
6 pages
IDEALS Essay Framework
No ratings yet
IDEALS Essay Framework
1 page
Emotional Intelligence Brochure PLI
100% (1)
Emotional Intelligence Brochure PLI
2 pages
Prime and Composite Numbers PDF
No ratings yet
Prime and Composite Numbers PDF
6 pages
Irislocker
No ratings yet
Irislocker
23 pages
Ensayo Sobre El Patriotismo
100% (1)
Ensayo Sobre El Patriotismo
6 pages
Witsoc Reviewer
No ratings yet
Witsoc Reviewer
11 pages
Uninterruptible Power Supply (UPS)
No ratings yet
Uninterruptible Power Supply (UPS)
11 pages
What Happened To You Book Discussion Guide-National Version
No ratings yet
What Happened To You Book Discussion Guide-National Version
7 pages
Presentation SEM
No ratings yet
Presentation SEM
25 pages
Tent Pole Project Report
No ratings yet
Tent Pole Project Report
6 pages
40 ICAGRI SlsiRevisi LukmanHakim
No ratings yet
40 ICAGRI SlsiRevisi LukmanHakim
10 pages
Acgih Manual 1998 (401-500)
No ratings yet
Acgih Manual 1998 (401-500)
100 pages
4A Lesson Plan in English Grade 2: Valencia City Central School
No ratings yet
4A Lesson Plan in English Grade 2: Valencia City Central School
3 pages
Elemental Battle Armor (AP Gauss) (Sqd6)
No ratings yet
Elemental Battle Armor (AP Gauss) (Sqd6)
1 page

Codexity: Secure AI-assisted Code Generation: Sung Yong Kim Zhiyu Fan

Uploaded by

Codexity: Secure AI-assisted Code Generation: Sung Yong Kim Zhiyu Fan

Uploaded by

Codexity: Secure AI-assisted Code Generation

Sung Yong Kim Zhiyu Fan

Yannic Noller Abhik Roychoudhury

ABSTRACT code completion attempts, we demonstrate that, compared to Chat-

1 # include < stdio .h > 1 # include < stdio .h >

(a) Iteration Repair (b) Preshot Repair

Figure 2: Codexity’s Repair Strategies

Figure 1: Conversational Prompt (Vulnerable Code is a place-

In the second round, we extracted the non-vulnerable part of the

Table 1: Vulnerabilities Categories

Vulnerability Total Vulnerability Total

Table 2: Number of Vulnerable Codes generated

# of vulnerable codes % of vulnerable codes Avg Time (s) LoC

Launching Codexity is similar to GitHub Copilot. The user

(b) Code generated by GitHub Copilot

(c) Code generated by Codexity with Iteration Repair

(d) Code generated by Codexity with Preshot Repair

Figure 4: Demo Screenshots

You might also like