[ACL 2025 Findings] ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation
šPaperļ¼https://arxiv.org/abs/2503.07010
šLeaderboard: ProjectEval LeaderBoard
š«Contact: Kaiyuan Liu
-
2025/05/30 šProjectEval Repository opensource.
-
2025/05/25 šProjectEval Leaderboard online.
-
2025/05/16 šProjectEval is accepted by ACL 2025 Findings.
-
Operating System: ProjectEval works on both the Windows platform and the Linux platform. We didn't test macOS yet, but since it works fine on Linux, there should be no problem.
-
āBrowser and Driver : ProjectEval needs a browser and its driver. We officially support three: Edge, Chrome, and Firefox. For Windows platform: put the driver.exe into the root directory of ProjectEval. For Linux platform, just follow the official way to add a drive service.
-
āPython Virtual Environments (venv): ProjectEval asks you to use a virtual environment by using
python -m venv .venv. If your venv path is NOT.venv, change the VENV_PATH to yours inconfig.ini. -
LLM: Make sure that your Ollama runs if you wish to run any models mentioned in our papers.
ProjectEval standard evaluation only supports JSON. But you can easily transfer any text files into JSON by using tools\file_transform.py .
If you trust your LLM that it won't do harm to your device, you can run the execution evaluation process by just using:
python run_judge.py -r "[\"<your_folder_name_in_experiment>\"]"If NOT, run a Docker by following the following steps:
# Step 1
cd docker
# Step 2
sh build.sh # Linux
build.bat # Windows
# Step 3
sh compose.sh # Linux
compose.bat # WindowsNo matter which way you choose, all the results will be saved in the experiments directory.
In the project root directory:
python run_indicators.py -r "[\"<your_folder_name_in_experiment>\"]"The result will be in the experiments directory.
ProjectEval is an offline evaluation benchmark and its evalutaion phase is complicated and time-costy. So the reasoning phase is separated from the evaluation phase. The reasoning phase only produces JSON or files.
Running a standard ProjectEval reasoning phase by:
-
Open the
run_reasoning.py -
Follow the instruction in the file by editing your own parameters.
-
python run_reasoning.py
WARNING: THIS PHASE WILL INCUR COSTS.
ProjectEval used GPT-4o to generate the data. To reperform it:
python run_generation.pyThis part is for some common issues that have been noticed by authors, check this before you submit an issue:
-
The answer's path follows the example in the experiments directory.
-
The answer's format follows the example in the experiments directory, and if you transfer the files into JSON by yourself, we strongly recommend that you use the script
file_transform.pyin the tools directory. -
config.iniis set correctly. -
Docker runs properly.
- ProjectEval is a multi-level benchmark designed to evaluate LLMs and agents on project-level code generation through realistic user interactions. It aims to bridge the gap of lacking the ability to automatically evaluate code from usersā perspective, and also lacking the explainability of the results of LLM agentsā code generation capabilities.
- ProjectEval integrates natural language, structured checklists, and code skeletons as 3 different level inputs to simulate diverse development scenarios and support explainable evaluations. And it contains its standard Test Suite and Canonical Answer.
-
Level 1 - Natural Language Prompt (NL Prompt): the agent will receive one or several natural language sentences to describe the target of the project.
-
Level 2 - Natural Language Checklist (NL Checklist): the agent will receive a standard natural language checklist describing the project through the abilities and functions that the project should have.
-
Level 3 - Skeleton: the agent will receive a skeleton of the standard answer which contains doc-strings and comments to describe the project inside.
A mission test suite will contain two parts:
-
Testcodes: a mission contains several automated evaluation Python functions similar to HumanEval testcases.
-
Parameter Description (PD): PD is used for a special kind of parameter alignment. These parameters are required by the matching testcode to achieve the established test goal(s).
For every mission we constructed has a canonical solution, beside the canonical code, we also build every PDās standard answer matching to the canonical code called canonical parameter values.
Testcode is aligned with Checklist. Parameter Description is aligned with Testcode and Canonical Parameter Values. Canonical Parameter Values is aligned with Canonical Code and use for testcode to get passed.
The evaluation process begins by selecting a specific level from the input and presenting it to the agent. The agent generates solution code. The solution code is then fed back into the same agent along with the parameter description. The agent is tasked with answering the parameter description based on its own solution to produce parameter values (PV). The code is then converted into an executable file, creating a tangible project. PV is a substitute to testcode, and testcode is integrated into the ProjectEval evaluation machine to obtain the evaluation results.
Projecteval introduces automated evaluation tools and heterogeneous software verification systems, enabling fine-grained comparison of model outputs across semantically equivalent input formats. This provides deeper insight into a model understanding of end-to-end software development.
@misc{liu2025projectevalbenchmarkprogrammingagents,
title={ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation},
author={Kaiyuan Liu and Youcheng Pan and Yang Xiang and Daojing He and Jing Li and Yexing Du and Tianrun Gao},
year={2025},
eprint={2503.07010},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2503.07010},
}
-
There is a request made during the execution evaluation. We are working on removing it.
-
Full explanation about the ProjectEval projects.
-
Java Version Canonical Answer
-
Leaderboard Update