Performance Review on LLM for solving leetcode
problems
1st Lun Wang* 2nd Chuanqi Shi 3rd Shaoshuai Du 4th Yiyi Tao
Duke University University of California San Diego University of Amsterdam Johns Hopkins University
North Carolina, USA California, USA Amsterdam, Netherlands Maryland, USC
[email protected] [email protected] [email protected] [email protected]
5th Yixian shen 6th Hang Zheng 7th Xinyu Qiu
University of Amsterdam University of California San Diego Northeastern University
Amsterdam, Netherlands California, USA Boston, USA
arXiv:2502.15770v1 [cs.SE] 16 Feb 2025
Abstract—This paper presents a comprehensive performance knowledge embedded within these models [4]. Despite their
evaluation of Large Language Models (LLMs) in solving pro- widespread adoption, research is increasingly focused on un-
gramming challenges from Leetcode, a widely used platform derstanding the limitations and evaluating the performance of
for algorithm practice and technical interviews. We began by
crawling the Leetcode website to collect a diverse set of problems LLMs. Studies have highlighted security vulnerabilities in AI-
encompassing various difficulty levels and topics. Using this generated code(Pearce etal., 2022; Sandoval etal., 2023; Perry
dataset, we generated solutions with multiple LLMs, including ., 2023) and the prevalence of bugs(Jesse, 2023), emphasizing
GPT-4 and GPT-3.5-turbo (ChatGPT-turbo). The generated solu- the need for thorough code review and testing. Other research
tions were systematically evaluated for correctness and efficiency. explores how developers interact with LLMs [12] and integrate
We employed the pass@k metric to assess the success rates
within a given number of attempts and analyzed the runtime them into their workflows(Vaithilingam., 2022; Barke., 2023),
performance of the solutions. Our results highlight the strengths examining the dynamics between human creativity and AI
and limitations of current LLMs [10] in code generation and assistance [13]. However, evaluating the runtime performance
problem-solving tasks, providing insights into their potential ap- of LLM-generated code has received less attention. While
plications and areas for improvement in automated programming correctness is crucial, the efficiency of code—how fast it
assistance.
Index Terms—LLM, LLM performance evaluation, ChatGPT runs and how optimally it uses resources—is a significant
review. concern in software engineering [6]. Performance optimization
is essential when resources are limited, scalability is needed,
I. I NTRODUCTION or energy consumption is a concern(Verdecchia., 2017; Acar
Large Language Models (LLMs) like ChatGPT(OpenAI, etal., 2016). In areas like high-frequency trading, real-time
2023) have revolutionized artificial intelligence, demonstrat- data processing, or large-scale web applications, even minor
ing remarkable capabilities in text and image generation. In execution time improvements can have substantial impacts. [8]
software development, specialized code-focused LLMs—such To address this gap, our research evaluates the performance of
as CodeGen(Nijkamp etal., 2022), StarCoder(Li etal., 2023), code generated by LLMs on algorithmic challenges typical of
WizardCoder(Luo etal., 2023), CodeT5(Wang etal., 2021), and programming contests and technical interviews. We conduct
Incoder(Fried etal., 2022)—assist developers by automating a comprehensive performance review using problems from
tasks like code generation, documentation, and unit testing. Leetcode,4 a widely used platform offering a vast repository
Additionally, LLMs have been integrated into Integrated De- of algorithmic problems across various difficulty levels and
velopment Environments (IDEs) as code assistants [3], includ- topics. [1] Our key contributions are: 1. Performance Analysis
ing GitHub Copilot,¹ Amazon CodeWhisperer,² and Tabnine.³ of LLM-Generated Code: We analyze the performance of
These tools aim to accelerate development by providing real- code generated by 18 LLMs on 204 Leetcode problems,
time code suggestions and automating routine coding tasks investigating performance differences across models using a
. The integration of LLMs into software development offers novel method for measuring and comparing runtime efficiency
significant benefits. Developers can save time, focus on higher- [11]. 2. Comparison with Human-Written Code: We compare
level design decisions, and potentially reduce time to mar- the performance of LLM-generated code with human-written
ket. LLMs help generate boilerplate code, suggest improve- code, providing insights into the current capabilities of LLMs
ments, and assist with complex problem-solving, enhancing in producing efficient code and highlighting areas where they
productivity and fostering innovation by leveraging the vast may lag behind human expertise. 3. Evaluation of Leetcode
as a Dataset: We assess the usability of Leetcode as a public
Identify applicable funding agency here. If none, delete this. repository of algorithmic problems for research purposes,
discussing its strengths and limitations to guide future research code generation process [16]. This framework automatically
utilizing similar resources [14]. Our methodology involves sent requests to the OpenAI API, providing the standardized
generating solutions using multiple LLMs, including GPT-4 problem input (problem statement, code comments, and code
and GPT-3.5-turbo, and systematically assessing their correct- framework) as prompts. The solutions returned by the LLM
ness and efficiency. We utilize metrics such as the pass@k were then parsed and formatted into the required Leetcode
metric, which evaluates the probability of a model providing submission format in Python code.
a correct solution within k attempts, and measure the runtime • Temperature Settings: We used five different tempera-
performance of the generated code. [15] By analyzing these ture values: 0.2, 0.4, 0.6, 0.8, and 1.0.
metrics, we aim to understand the strengths and limitations of • Solution Generation: At each temperature setting, we
current LLMs in algorithmic problem-solving contexts. Our generated 10 distinct solutions per model for each prob-
findings offer insights into how LLMs can assist developers lem.
in tackling complex programming challenges and identify
• OpenAI Models: We interfaced with the OpenAI API,
areas where further advancements are needed to enhance their
providing the standardized problem input (problem state-
capabilities in generating efficient, high-performance code
ment, code comments, and code framework) as prompts.
[17].
We set the temperature parameter accordingly and gener-
II. E XPERIMENT ated multiple solutions by invoking the model repeatedly.
• GitHub Copilot: We integrated Copilot into a compatible
This section outlines the experimental framework employed
code editor (e.g., Visual Studio Code) and input the
to evaluate the performance of Large Language Models
problem’s code framework. Copilot’s suggestions were
(LLMs) in solving algorithmic problems from Leetcode. The
captured for each temperature setting by configuring its
experiment is structured into three primary phases: data col-
randomness settings if available or by inducing variability
lection, code generation, and solution evaluation.
through prompt modifications.
A. Data Collection By generating multiple solutions across different temper-
To establish a comprehensive dataset for our evaluation, atures, we aimed to observe the impact of the temperature
we crawled the Leetcode website and collected a total of parameter on the correctness and efficiency of the generated
2,014 problems. These problems span various difficulty lev- code. This process also allowed us to assess the models’ ability
els—Easy, Medium, and Hard—and encompass a wide range to produce diverse solutions and their propensity to generate
of topics including data structures, algorithms, mathematics, optimal or suboptimal code under varying conditions [5].
and system design. During data collection, we focused on
extracting the essential components necessary for code gen- C. Solution Evaluation
eration: The evaluation phase involved assessing the correctness and
- Problem Statements: The detailed descriptions of each performance of the generated solutions by submitting them
problem, including the objective and any specific require- to Leetcode’s online judge system. The Leetcode platform
ments. provides an automated environment that compiles and executes
- Function Signatures: The provided code frameworks or submitted code against a predefined set of test cases.
templates, specifying input and output formats. For each submitted solution, we collected the following
- Code Comments: Any comments included in the code metrics
templates that provide additional guidance or constraints.
• Number of Unit Tests Passed: The total number of test
We standardized the problem data by removing any ex-
cases successfully passed by the solution.
traneous information such as solution discussions, hints, or
• Overall status: indicating whether the solution met all
previously submitted solutions. This preprocessing ensured
the problem requirements.
that the input to the LLMs was consistent and contained only
• Runtime: The execution time of the solution, measured
the information that a typical developer would have when
by Leetcode’s evaluation system.
attempting to solve the problem independently.
• Memory Usage: The amount of memory consumed
B. Code Generation during execution.
The code generation phase involved utilizing two categories The evaluation process was conducted systematically:
of LLMs to generate solutions for the collected Leetcode 1) Automated Submission: Solutions were programmati-
problems: OpenAI Models and GitHub Copilot Model. For cally submitted to Leetcode using their API or through
each problem, we generated solutions using these models automated scripting to ensure consistency and efficiency.
under varying levels of randomness and creativity, controlled [2]
by the temperature parameter in the models’ settings. The 2) Data Recording: All evaluation results were recorded
temperature parameter influences the diversity of the output, in a structured format for subsequent analysis. This
with higher values producing more varied and creative re- included capturing the raw output from Leetcode and
sponses. We utilized a Python framework to automate the parsing relevant information.
The collected data enabled us to analyze several aspects of parameter influences the randomness and diversity of the gen-
the models’ performance: erated solutions. By evaluating across different temperatures,
• Success Rate (Pass@ k Metric): The probability of a we aimed to identify the optimal setting for each model. The
model generating a correct solution within k attempts, best pass@ k value observed across all temperatures was then
considering the multiple solutions generated per problem. considered the final pass@ k metric for that LLM.
• Error Analysis: Identification of common errors or 2) D.2 Code Performance: To assess the performance of
misconceptions exhibited by the models, such as off-by- the code generated by the LLMs, we considered three key
one errors, incorrect loop conditions, or misuse of data metrics:
structures. 1) Memory Usage:
• Runtime Performance: Assessment of the efficiency of • We recorded the memory consumption reported by
the solutions, with a focus on execution time and resource Leetcode’s evaluation system for each submitted
utilization. solution. Memory usage is a critical factor in code
By evaluating both correctness and performance, we aimed performance, especially for problems with large
to understand not only whether the models could solve the input sizes or when operating under memory con-
problems but also how efficiently they could do so. This straints.
dual focus is critical in algorithmic problem-solving contexts, 2) Runtime Performance:
where optimal solutions are often required to meet time and • We measured the execution time of the generated
space constraints. solutions using pytest-benchmark, a Python
D. Data Analysis benchmarking tool. For each solution, we con-
ducted multiple runs to obtain a reliable estimate
In this section, we present the methods and metrics used of its runtime performance. The median runtime
to analyze the functional correctness and performance of was computed to mitigate the impact of outliers and
the code generated by the Large Language Models (LLMs). variability in execution times.
Our analysis aims to assess not only whether the models
3) Leetcode Runtime Percentile Rank:
can produce correct solutions but also how efficiently these
• Upon submission, Leetcode provides a percentile
solutions run compared to human-written code [9].
1) D.1 Functional Correctness: Functional correctness ranking that indicates how a solution’s runtime
measures the extent to which the code generated by an LLM compares to other users’ submissions for the same
adheres to the specified problem requirements, effectively problem. This rank is a value between 0 and 100,
conforming to the “program contract” defined by the input representing the percentage of submissions that the
prompt. To evaluate this aspect, we employed the pass@ k current solution outperforms. For example, a rank
metric, which calculates the probability that at least one of of 90 implies that the solution is faster than 90% of
the k generated samples passes all the test cases for a given all other submitted solutions. This metric allowed
problem [7]. us to benchmark the LLM-generated code against
We computed the pass@ k metrics for k = 1 and human-written code at a global scale.
k = 10, utilizing the unbiased estimator proposed by Chen III. R ESULTS
et al. (2021). This estimator accounts for the likelihood of
As show in TABLE 1, which presents the
obtaining a correct solution among multiple attempts and is
performance of various AI models in pass-k metrics,
defined as:
likely representing different tasks or evaluation
n−c
benchmarks. Below is an analysis of the data:
k https : //github.com/DHU er/LLMe valuationr esults
pass@k = E 1 − n ,
A. Dataset analysis
k
Our dataset analysis encompasses approximately 2,100
where:
LeetCode problems, meticulously selected to provide a com-
• n is the total number of generated samples, prehensive evaluation of Large Language Models (LLMs)
• c is the number of correct samples (i.e., samples that pass across a diverse range of algorithmic challenges. These prob-
all test cases), lems are systematically categorized into three difficulty levels:
• E denotes the expected value. Easy, Medium, and Hard, adhering to a distribution ratio of
This formula provides an unbiased estimate of the pass@ k approximately 11:50:10, respectively. Furthermore, all solu-
metric by considering all possible combinations of correct and tions generated by the LLMs were implemented in Python,
incorrect samples without replacement. a language renowned for its readability and widespread use
Following the methodology suggested by Chen et al. (2021), in coding competitions and technical interviews. Additionally,
we calculated the pass@ k for each temperature setting when each problem was approached using LLMs configured with
evaluating an LLM’s functional correctness. The temperature five different temperature settings—0.2, 0.4, 0.6, 0.8, and
1.0. The temperature parameter controls the creativity and
variability of the generated solutions, allowing us to examine
how different levels of randomness impact the correctness and
efficiency of the code produced. All the experiment code and
dataset is published at:
B. LLMs solution compared with Humans
To facilitate a robust comparison between LLM-generated
solutions and human-written code,we selected the o1-mini
model tested on LeetCode for this analysis. The results of
this comparison are depicted in Figure 3. Utilizing LeetCode’s
runtime percentile rankings—which assume that the majority
of historical submissions originate from human program-
mers—we assessed the execution speed of the LLM-generated
solutions relative to human-written counterparts. Our findings
reveal that the solutions produced by the selected LLM achieve
a mean runtime percentile rank of 63%, indicating that they
are faster than 63% of all previous submissions. Fig. 1. Leetcode memory analysis
C. Performance Overview
Top Performers:
• Canonical Solutions is the highest-performing model
with near-perfect scores (97.94 and 98.04). This suggests
it is tailored or highly optimized for the specific tasks.
• GTP-4-omni, GPT-4, and GPT-4-turbo follow but with
significantly lower scores, indicating a strong perfor-
mance but a noticeable gap compared to Canonical So-
lutions.
Mid-Tier Performers:
• Models such as Copilot, CodeLlama-13B-Instruct, and
WizardCoder-Python-7B show moderate performance
(scores in the range of ∼4–19). This reflects some
utility but highlights significant room for improvement
compared to the top-tier models.
Lower Performers:
Fig. 2. Leetcode runtime analysis
• Models like SantaCoder, InCoder-6B, and CodeT5-
Large-NTP-PY perform poorly with scores often below
5. These results suggest limited capability in handling the
evaluated tasks effectively.
R EFERENCES
[1] Ehsan Aghapour, Yixian Shen, Dolly Sapra, Andy Pimentel, and Anuj
Pathania. Piqi: Partially quantized dnn inference on hmpsocs. In
Proceedings of the 29th ACM/IEEE International Symposium on Low
Power Electronics and Design, pages 1–6, 2024.
[2] Tristan Coignion, Clément Quinton, and Romain Rouvoy. A perfor-
mance study of llm-generated code on leetcode. In Proceedings of
the 28th International Conference on Evaluation and Assessment in
Software Engineering, pages 79–89, 2024.
[3] Shaoshuai Du, Kuangrong Hao, Haichao Zhang, Xue-song Tang, and
Bing Wei. Patch elastic deformation: An effective data augmentation
method. In 2022 China Automation Congress (CAC), pages 2079–2084,
2022.
[4] Yiru Gong, Qimin Zhang, Huili Zheng, Zheyan Liu, and Shaohan Chen.
Graphical structural learning of rs-fmri data in heavy smokers. In
2024 4th International Conference on Computer Science and Blockchain
(CCSB), pages 434–438, 2024.
[5] Jiacheng Hu, Zhen Qi, Jianjun Wei, Jiajing Chen, Runyuan Bao,
and Xinyu Qiu. Few-shot learning with adaptive weight masking in Fig. 3. Distribution of the ranking for the runtime of the leetcode solution
conditional gans, 2024. of o1-mini
TABLE I
PASS @1 AND PASS @10 M ETRICS FOR VARIOUS M ODELS
Model Pass@1 (%) Pass@10 (%)
Canonical Solutions 97.94 98.04
GTP-4-omni 43.36 61.95
GPT-4 31.68 67.79
GPT-4-turbo 31.25 50.00
Copilot 8.09 19.12
CodeLlama-13B-Instruct 4.17 14.71
WizardCoder-Python-7B 4.17 12.25
CodeLlama-13B-Python 3.58 14.71
CodeLlama-7B-Instruct 3.24 12.75
StarCoder 2.70 10.29
CodeLlama-7B-Python 2.60 10.78
CodeLlama-7B 2.11 10.29
CodeGen2-7B-Instruct 2.16 10.78
CodeGen2-7B-Mono 1.28 7.35
CodeGen-6B-Mono 1.08 5.39
CodeGen-2B-Mono 1.13 6.37
Replit-Code-v1-3B 0.98 4.90
SantaCoder 0.69 4.90
InCoder-6B 0.59 3.92
InCoder-1B 0.10 0.98
CodeGen-350M-Mono 0.39 2.94
CodeT5-Large-NTP-PY 0.25 1.96
[6] Daoming Li, Qiang Chen, and Lun Wang. Phishing attacks: Detection
and prevention techniques. Journal of Industrial Engineering and
Applied Science, 2(4):48–53, 2024.
[7] Keqin Li, Jiajing Chen, Denzhi Yu, Tao Dajun, Xinyu Qiu, Lian Jieting,
Sun Baiwei, Zhang Shengyuan, Zhenyu Wan, Ran Ji, Bo Hong, and
Fanghao Ni. Deep reinforcement learning-based obstacle avoidance for
robot movement in warehouse environments, 2024.
[8] Zheyan Liu, Qimin Zhang, Huili Zheng, Shaohan Chen, and Yiru Gong.
A comparative study of machine learning approaches for diabetes risk
prediction: Insights from shap and feature importance. In 2024 5th In-
ternational Conference on Machine Learning and Computer Application
(ICMLCA), pages 35–38, 2024.
[9] Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, and Anuj Pathania.
Parameter-efficient fine-tuning via selective discrete cosine transform.
arXiv preprint arXiv:2410.09103, 2024.
[10] Yiyi Tao. Meta learning enabled adversarial defense. In 2023 IEEE
International Conference on Sensors, Electronics and Computer Engi-
neering (ICSECE), pages 1326–1330, 2023.
[11] Yiyi Tao. Meta learning enabled adversarial defense. In 2023 IEEE
International Conference on Sensors, Electronics and Computer Engi-
neering (ICSECE), pages 1326–1330. IEEE, 2023.
[12] Yiyi Tao, Yiling Jia, Nan Wang, and Hongning Wang. The fact:
Taming latent factor models for explainability with factorization trees.
In Proceedings of the 42nd international ACM SIGIR conference on
research and development in information retrieval, pages 295–304, 2019.
[13] Yiyi Tao, Zhuoyue Wang, Hang Zhang, and Lun Wang. Nevlp: Noise-
robust framework for efficient vision-language pre-training. arXiv
preprint arXiv:2409.09582, 2024.
[14] Chenxu Wang, Yixian Shen, Jia Jia, Yutong Lu, Zhiguang Chen, and
Bo Wang. Singlecaffe: an efficient framework for deep learning on a
single node. IEEE Access, 6:69660–69671, 2018.
[15] Lun Wang. Low-latency, high-throughput load balancing algorithms.
Journal of Computer Technology and Applied Mathematics, 1(2):1–9,
2024.
[16] Lun Wang, Wei Fang, and Yudi Du. Load balancing strategies in
heterogeneous environments. Journal of Computer Technology and
Applied Mathematics, 1(2):10–18, 2024.
[17] Lun Wang, Wentao Xiao, and Shan Ye. Dynamic multi-label learning
with multiple new labels. In Image and Graphics: 10th International
Conference, ICIG 2019, Beijing, China, August 23–25, 2019, Proceed-
ings, Part III 10, pages 421–431. Springer, 2019.