(Research) Can LLMs Patch Security Issues?
(Research) Can LLMs Patch Security Issues?
1) Use Parameterized Queries: Parameterized queries ensure that user input is treated as a literal value rather than
executable code. Most database libraries provide a way to create these queries, also known as prepared statements.
2) Manual Escape and Quote Table Names: Since parameterized queries do not support table or column names, you
can manually ensure that table names are valid, using a whitelist approach where only approved table names are used.
This strategy can be risky and should be used with caution, and only when other strategies are not applicable.
3) Use an ORM (Object-Relational Mapping) Library: ORMs provide an abstraction over SQL by allowing you to interact
with the database using your programming language's constructs, which mitigates the risk of SQL injection. Libraries such
as SQLAlchemy for Python handle escaping and quoting internally in a secure manner.
Figure 1: Overview of our approach: Initially, the model generates code. This code is subsequently analyzed for
security vulnerabilities using Bandit, a tool for static code analysis, to determine if there are any security issues.
Following this, feedback on any identified issues is incorporated into the model to generate possible solutions for
resolving the security issues. Finally, each proposed solution is sent back to the model for code refinement.
an extensive dataset from real-world applications experiment is to generate code from natural lan-
that includes natural language prompts, with each guages, such as in the text-to-code generation task.
prompt comes with a unit test. These tests are de- Several works have proposed pre-trained models
signed to verify the generated and refine code for specialized for code generation (Wang et al., 2020;
correctness. Scholak et al., 2021). Other works have shown
In summary, this work: that LLMs achieve state-of-the-art performance in
code generation tasks without fine-tuning (Nijkamp
• We introduce Feedback-Driven Security et al., 2023; Athiwaratkun et al., 2023). In our
Patching (FDSP), a technique that enhances work, we consider two powerful LLMs to generate
LLMs to generate potential solutions for ad- and refine code.
dressing security issues in code by receiving Refinement of LLMs: Recently, studies have
feedback from Bandit and LLMs. demonstrated that LLMs can refine their own out-
put or adapt based on feedback from external tools
• We evaluate the abilities of the most advanced
or human input. Aman, as highlighted in (Madaan
LLMs, including GPT-4, GPT-3.5, and CodeL-
et al., 2023), introduced Self-Refine. In this ap-
lama, to generate and refine insecure code. We
proach, the models produce an initial output, which
use three benchmarks and employ five base-
is then re-input into the model to generate feedback.
line techniques for this evaluation.
This feedback is subsequently used to enhance the
• We present PythonSecurityEval, a dataset for initial output from the model. The paper presents a
evaluating the ability of LLMs to generate comprehensive evaluation of their approach across
secure code. Our dataset includes natural lan- 7 tasks and demonstrates significant improvements
guage prompts paired with unit tests. in refining the output. Additionally, there’s a sim-
ilar technique called self-debug. In this approach,
• We evaluate the generated code using Bandit the model generates code and then feeds this gener-
to determine whether the code has any security ated code back into itself to produce an explanation.
issues. We report the percentage of secure This explanation uses as feedback, which is then
code for each dataset and approach in Table 1. used to refine the generated code with compiler
errors (Chen et al., 2023). Another study (Gou
et al., 2023) introduced CRITIC, which enables
2 Related work
the model to engage with external tools to obtain
Language models for code: Applying deep learn- feedback, thereby improving the model’s ability
ing methods in source code has demonstrated re- to refine its output. These studies directly utilize
markable effectiveness across various coding tasks, feedback from either the model or external tools,
including generating code (Zhou et al., 2023), de- feeding it back to the model to enhance the output.
bugging (Alrashedy et al., 2023), and repairing In our work, we feed the model with the output
code (Shypula et al., 2023). The first step in our and feedback from the external tool, and instruct
Algorithm 1 Feedback-Driven Security Patching (FDSP) algorithm
Require: Input x, model M, prompt {pVF, pBF}, number of potential solutions K, number of iterations N
Ensure: Refine the output C from the model M
1: Initialize output C from M(x) ▷ Initial generate code
2: FDSP ← M(C, pVF , pBF ) ▷ Generate potential solutions (Eqn. 2)
3: for iteration t ∈ FDSP do ▷ Iteration for each potential solution
4: for n ← 1 to N do
5: Ref ine_code ← M(C||(pVF, pBF, pt))
6: if Bandit(Ref ine_code) is secure then ▷ Stop condition
7: Return Ref ine_code
8: end if
9: end for
10: end for
11: Return C
the model to generate potential solutions to fix the volves sending the generated code back to the LLM
security issues. itself to create an explanation about the code. This
Source of feedback: There are various methods explanation, along with compiler error messages,
to procure feedback to improve a model’s output. is then fed back to the LLM as feedback.
While human feedback is often the most accurate, Studies also highlight the role of external tools in
it is also quite costly and time-intensive (Elgohary improving model outputs. Given LLMs’ capabili-
et al., 2021) (Yuntao Bai, 2023). Another method ties not only to refine outputs based on feedback but
demonstrates how the model self-generates feed- also to generate code and unit testing, and create
back to enhance its output, as seen in (Chen et al., documentation, as well as to review and complete
2023) and (Madaan et al., 2023). Additionally, code, our interest is in exploring how LLMs can
some research illustrates how more powerful mod- generate potential solutions for addressing security
els like GPT-4 provide feedback to smaller models issues in code.
to refine their outputs (Olausson et al., 2023). In
this paper, we receive feedback from both the exter- 3.2 Feedback-Driven Security Patching
nal tool and the model, and then use this feedback (FDSP)
to enhance the model’s ability to generate solutions The core idea of Feedback-Driven Security Patch-
for fixing the buggy code. ing (FDSP) is to enable the model to generate mul-
tiple potential solutions to address vulnerabilities.
3 Methodology The input to the model includes (1) an insecure
code snippet, (2) feedback from Bandit, which is a
3.1 Background
static analysis tool designed for Python code, and
Recent research has shown that LLMs can re- (3) feedback from the LLMs that verbalizes Ban-
fine and enhance their outputs through feedback, dit’s feedback with additional explanatory details.
whether it’s self-refined, from external tools, or The model then proposes K potential solutions to
from human input. As demonstrated in (Madaan address the identified vulnerabilities. Each solution
et al., 2023), the LLM generates initial output, is repeatedly fed back into the model along with the
which is then sent back to the LLM for feedback to insecure code and Bandit’s feedback across mul-
enhance the output. This iterative process between tiple iterations, denoted as N , with the objective
generating the initial output and receiving feedback of fixing the security issue. When Bandit provides
helps the LLM improve its performance. The pa- feedback about the security issues, as shown in
per shows improved results in seven different tasks, 2, this feedback is sent to the LLM for detailed
including two related to coding: code optimization verbalization of the issues. Then, the verbalized
and code readability. Another study shows that the feedback, along with code, is sent back to the LLM
LLM can generate and debug code by itself, named to generate solutions. In each iteration, we test the
self-debugging (Chen et al., 2023). The concept in- fixed code with Bandit. If there is no feedback from
Bandit, which means the issue is resolved, we stop M generates K possible solutions to refine the
the iteration before reaching N . This study focuses code:
on the Python language due to its popularity and
its significance in evaluating LLMs’ capabilities in FDSP ← M(C, pVF , pBF ) (2)
generating and understanding code.
4 Experimental Settings
3.3 Preliminaries
We test the generated code using Bandit to pro- The goal of this paper is to evaluate how LLMs
vide feedback, as shown in Fig. 2. However, this can address security issues in code, and to identify
feedback is not very informative for the LLM. To the limitations of current datasets and approaches
enhance its usefulness, we send the Bandit feed- while proposing new solutions. Two well-known
back along with the vulnerable code to the LLMs datasets, LLMSecEval and SecurityEval, contain
for verbalization. The advantage of this approach a limited number of code samples, insufficient for
is that the LLMs offer more detailed explanations large-scale evaluation . A significant challenge
about the security issues, thereby providing more is that once an LLM generates a fix, there is no
information to the LLM. unit test to verify the code’s functionality, which
raises concerns that while security issues may be
The verbalization approach begins by sending
addressed, the functionality of the code might be
the Bandit feedback, denoted as BF , along with
altered. To overcome this limitation, we introduce
the vulnerable code, represented as C. Instructions
a new, large dataset comprising 470 natural lan-
are then given to the model to verbalize the feed-
guage (NL) prompts collected from Stack Over-
back. Following this, the Large Language Model
flow, each accompanied by a unit test to ensure
(LLM) generates new feedback that includes more
the correctness of the generated and refined code.
detailed explanations and potentially offers solu-
Furthermore, current methods for fixing security
tions for addressing the security issues. The for-
issues through LLMs, such as direct prompts, self-
mula for Verbalization is shown below:
debugging, and feedback from static code analysis,
Verbalization: Given vulnerable code C, Bandit
are inadequate for effective repair and improve-
feedback pBF , and model M. The objective for the
ment. We empirically evaluate these datasets using
model is to verbalize the feedback as shown in
the current in-context learning approach and pro-
Equation 1:
pose a novel method to enhance LLMs’ ability to
generate solutions and fix security issues. Our re-
pVF ← M(C, pBF ) (1) sults demonstrate that our approach significantly
The drawback of verbalization is that it only improves performance in resolving these issues.
provides explanations about the bandit feedback,
4.1 Benchmarks
which does not always offer potential solutions to
fix the issues. We propose a novel approach which LLMSecEval: A dataset containing natural lan-
prompts the LLMs to generate potential solutions guage prompts evaluates LLms on generating
to these issues. Our key idea is to give the LLM secure source code (Tony et al., 2023). This
vulnerable code, denoted as C, along with Bandit dataset cover the majority of the Top 25 Common
feedback, BF , and verbalization feedback, de- Weakness Enumeration (CWE) scenarios from
noted as V F . Then, the model proposes K distinct 2021, tackling various security concerns with 150
solutions to fix the security issues. Subsequently, prompts.
each potential solution is sent back to the LLMs, SecurityEval: This dataset, proposed by (Sid-
along with C, BF , and V F for refinement. As diq and Santos, 2022), is used to evaluate LLMs on
shown in Algorithm 1, we iterate the process for their ability to generate secure Python programs. It
each potential solution until the conditions are met, comprises 121 natural language (NL) prompts, fea-
which occur when the security issue is fixed by turing a diverse range of vulnerability types, known
testing it with Bandit or when the iteration reaches as Common Weakness Enumerations (CWEs), cov-
the maximum number of iterations, N . The key ering 75 different categories. Each prompt includes
formula of our approach is presented in Equation the header of a Python function along with com-
FDSP: Given vulnerable code C, verbalization ments that describe the function’s purpose.
feedback pVF , Bandit feedback pBF and the model PythonSecurityEval: collect a real-world
1 # Write a Python function to return 4.2 Baselines
the total number of rows in SQLite
2 Direct Prompting: This approach involves send-
3 import sqlite3
4 def get_total_rows ( database_name , ing the generated code back to the LLMs with the
table_name ): instruction: Does the provided function have a se-
5 conn = sqlite3 . connect ( curity issue? If yes, please fix the issue. If the
database_name )
6 cursor = conn . cursor () LLMs detect any security issues in the code, they
7 cursor . execute (" SELECT COUNT (*) will fix the issue and generate secure code. This
FROM {} ". format ( table_name )) serves as a baseline to demonstrate the LLM’s abil-
8 result = cursor . fetchone ()
9 conn . close () ity to detect and resolve security issues without
10 return result [0] additional feedback.
Code Snippet 1: Example of a prompt and vulnerable Self-Debugging: The concept of self-debugging
code from our benchmark: This function connects to involves LLMs generating and then debugging
the database and requests a query.
code, without human or external tool feedback. Af-
ter code generation, the models explain their code.
This explanation, paired with the code, is then an-
dataset for evaluating LLMs ability to generate
alyzed by the models to identify and resolve any
secure code and their capacity to fix security vul-
security vulnerabilities. The process of explaining
nerabilities. This dataset was manually curated
the code could improve the models’ ability to un-
from Stack Overflow. Our benchmark includes nat-
derstand the code and their capacity to fix security
ural language (NL) prompts intended to generate
flaws.
Python functions, each comes with corresponding
unit tests. Our dataset is four times larger than Bandit feedback: Bandit is a static code analy-
those used in LLMSecEval and SecurityEval. sis tool designed to detect common security issues
in Python code.3 In our baseline, we provide the
As our benchmark is to generate Python pro-
LLM with direct feedback from Bandit, which in-
grams that interact with various systems or ser-
cludes the type of security issue and indicates the
vices, unit testing can be somewhat challenging.
specific line of code where the issue is found. Ban-
Fortunately, the Python unittest library includes a
dit does not provide a solution to fix the issue; it
submodule named unittest.mock, which offers a
simply highlights the problematic line and type of
way to isolate parts of the system under test us-
the issue. Figure 2 shows an example of Bandit
ing Mock objects. For example, when our Python
feedback for the code snippet in Figure 1.
program interacts with an SQL dataset, the Mock
library can create a virtual object that simulates Verbalization: The technique involves sending
the behavior of a real database. In our case, each feedback from Bandit to LLMs, which then the
natural language (NL) prompt in our dataset comes LLMs verbalize the feedback from Bandit with a
with its own unit test. The primary benefit of im- detailed explanation in natural language. This ex-
plementing unit tests is to ensure that the LLM panded explanation provides deeper insights into
can address security issues without modifying the the security problems and might suggest methods
program’s intended functionality. for resolving them. Verbalization resembles self-
debugging but also incorporates feedback from
static code analysis tool.
Bandit Feedback
» Issue: [B608:hardcoded_sql_expressions] 4.3 Metrics
Possible SQL injection vector through
string-based query construction. The primary objective of this paper is to introduce a
Line 7:cursor.execute("SELECT COUNT(*) method for enhancing the model’s ability to gener-
FROM ".format(table_name)) ate secure code that is free from security flaws and
to refine any insecure code. We evaluate each piece
of code using Bandit to identify common security
Figure 2: The Feedback from Bandit for Code Snippet issues.
1
3
Bandit: https://github.com/PyCQA/bandit
Table 1: The table illustrates the percentage of insecure code.
60 GPT-4 30
GPT-3.5
50 CodeLlama
Total number of codes
20
40
30 10
e g g ck n P
20 ted cod romptin ebuggin feedbVaerbalizatio FDS
Genera Direct p Self-D Bandit
10
Figure 5: The figure illustrates the percentage of secure
0 codes in the PythonSecurityEval dataset, comparing five
CW 259
CW 89
CW -94
CW 327
CW -20
CW 319
CW 502
CW -78
CW 400
32
E-7
E
E
E-
E-
E-
E-
E-
CW