This guide provides step-by-step instructions for evaluating model-generated code patches against the Turing SWE-bench benchmark.
Before you begin, ensure you have the following installed:
- Python (3.10 or newer)
git- Docker Ensure the Docker daemon is running before you start the evaluation.
Linux Docker Setup If you are on Linux, we highly recommend following the post-installation steps to manage Docker as a non-root user.
First, download the delivery folder and go to the SWE-Bench folder and set up a local Python virtual environment.
cd SWE-Bench
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Evaluation requires two key dataset files in .jsonl format.
This file contains the SWE-Bench dataset. You have two options:
Option 1: Local File
- Filename:
turing-swe-bench-dataset.jsonl - Origin: This file should be obtained from the delivery Google Drive. It contains the core information for each task instance.
Option 2: Hugging Face Dataset
- Dataset Name:
TuringEnterprises/SWE-Bench-plus-plus - Access: Public dataset available on Hugging Face Hub
Simply use the dataset name directly in the --dataset_name argument (see examples below).
This is the file you create, containing the patches generated by your model. Each line must be a single JSON object with the following structure:
instance_id(string): A unique identifier in the formatrepo_owner__repo_name-pull_request_number. This must match aninstance_idfrom the testbed dataset.model_name_or_path(string): An identifier for your model (e.g., "gpt-4-turbo").model_patch(string): The full diff/patch content generated by the model.
For testing purposes, we've included a testing predictions.jsonl in the top directory.
Prediction File Format Example:
{"instance_id": "sympy__sympy-20590", "model_name_or_path": "gpt-4", "model_patch": "diff --git a/sympy/core/sympify.py b/sympy/core/sympify.py\nindex 6a73a83..fb90e1a 100644\n--- a/sympy/core/sympify.py\n+++ b/sympy/core/sympify.py\n@@ -508,7 +508,7 @@ def sympify(a, locals=None, convert_xor=True, strict=False, rational=False,\n converter[type(a)],\n (SympifyError,\n OverflowError,\n- ValueError)):\n+ ValueError, AttributeError)):\n return a\n"} {"instance_id": "another__repo-12345", "model_name_or_path": "gpt-4", "model_patch": "..."}
The main evaluation script is swebench.harness.run_evaluation. It should be executed from the root of the SWE-Bench repository.
To evaluate your own model, simply point --predictions_path to your custom predictions file.
Using a local dataset file:
python -m swebench.harness.run_evaluation \
--dataset_name <path/to/turing-swe-bench-dataset.jsonl> \
--predictions_path <path/to/your/predictions.jsonl> \
--namespace "" \
--run_id <run_id>Using the Hugging Face dataset:
python -m swebench.harness.run_evaluation \
--dataset_name TuringEnterprises/SWE-Bench-plus-plus \
--predictions_path <path/to/your/predictions.jsonl> \
--namespace "" \
--run_id <run_id> \
--turing_eval| Argument | Description |
|---|---|
--dataset_name |
Path to the testbed .jsonl file. |
--predictions_path |
Path to your model-generated predictions .jsonl file. |
--run_id |
A unique name for your evaluation run (e.g., gpt-4-turbo-run-1). This name will be used for the output log directory. |
--namespace |
The Docker Hub namespace for the environment images. Defaults to swe-bench. |
--max_workers |
The number of parallel processes to use. Defaults to the number of CPU cores. |
--cache_level |
Level of caching for Docker images. Defaults to cache env (Cache base and environment images) |
--clean |
Whether to clean up resources after evaluation. Defaults to true |
--instance_ids |
Specific instances to evaluate (comma-separated) |
--timeout |
Maximum time (seconds) for evaluating each instance |
For a complete list of arguments, run:
python -m swebench.harness.run_evaluation --helpThe --cache_level parameter controls how Docker images are cached between runs:
| Level | Description | Storage Impact | Speed |
|---|---|---|---|
none |
No caching | Minimal (~120GB during run) | Slowest |
base |
Cache only base image | Minimal (~120GB during run) | Slow |
env (default) |
Cache base and environment images | Moderate (~100GB) | Moderate |
instance |
Cache all images | High (~2,000GB) | Fastest |
Most users should use the default env level, which provides a good balance between speed and storage usage.
In some cases, you may want to build the Docker environment images for the dataset without running the evaluation.
By default, run_evaluation cleans up images after execution. To persist them, use the prepare_images utility:
python -m swebench.harness.prepare_images \
--dataset_name ../Dataset/turing-swe-bench-dataset.jsonl \
--tag turing_prebuilt_v1| Argument | Description |
|---|---|
--dataset_name |
Path to the testbed .jsonl dataset file. |
--tag |
A custom tag to assign to the built Docker images. Use this to differentiate between different builds (e.g., turing_prebuilt_v1). |
python -m swebench.harness.prepare_images \
--dataset_name ../Dataset/turing-swe-bench-dataset.jsonl \
--tag turing_prebuilt_v1After this command, the built images will remain available locally (they will not be deleted automatically). You can then run evaluations which will automatically use the already built images.
All evaluation artifacts are stored in the logs/ directory, inside a folder named after your --run_id. The final, aggregated report is generated as a .json file at the root of the repository.
The logs are organized by a unique --run_id that you provide for each evaluation.
The logs/ directory contains two main sub-directories: one for the Docker image build process and one for the evaluation runs themselves.
logs/
├── build_images/
│ └── instances/
│ └── {docker_env_instance_id}/
│ ├── Dockerfile
│ ├── build_image.log
│ └── setup_repo.sh
└── run_evaluation/
└── {run_id}/
└── {instance_id}/
├── report.json
├── run_instance.log
├── test_output_after.log
├── patch.diff
└── eval.sh
This directory contains the files related to building the specific Docker environment for a given task. You should inspect these files if an instance fails very early with a Docker-related error.
| File | Purpose & How to Use It |
|---|---|
Dockerfile |
This is the exact Dockerfile generated by the harness to create the testing environment. Review this file to see which base image was used and what dependencies were installed. |
build_image.log |
Contains the complete log from the docker build command. Look here first for environment setup failures, such as a failed apt-get install or a Docker daemon error. |
setup_repo.sh |
An auxiliary script that is copied into the Docker image. It handles cloning the repository and checking out the correct commit. |
This is the most important directory for debugging. For each instance in your run, a folder is created containing a detailed breakdown of the evaluation process.
| File | Purpose & How to Use It |
|---|---|
run_instance.log |
The master log for the instance. This is the first file you should check for any failure. It contains high-level logs of the entire process: applying the patch, running the tests, and reporting the results. |
report.json |
A machine-readable summary of the final outcome for this single instance, including whether the task was resolved and other key metrics. |
test_output_after.log |
The raw, unfiltered output from the test command (e.g., pytest, mvn test). If run_instance.log shows that the tests ran but failed, this file will contain the specific error messages, stack traces, and test failures. |
patch.diff |
The exact patch generated by your model that was applied to the code before running the tests. Use this to verify that the patch was parsed correctly from your predictions file. |
eval.sh |
The shell script generated from the test specification that is executed inside the Docker container. This file shows the precise command used to run the tests. |
- Total Instances: Total number of problems in the testbed dataset.
- Instances Submitted: Number of instances for which your file provided a prediction.
- Instances Completed: Number of instances that ran to completion without crashing or timing out.
- Instances Resolved: The number of instances where the model's patch successfully passed the test suite.
- Resolution Rate: The percentage of completed instances that were successfully resolved (calculated as
Resolved ÷ Completed × 100%).
If you encounter issues during evaluation, follow these steps:
- Ensure Docker is Running: The most common issue is the Docker daemon not being active or accessible.
- Verify Prediction File: Double-check that your predictions file is a valid
.jsonlfile (one complete, valid JSON object per line). Online JSONL validators can help. - Examine Logs: The most detailed error information can be found within the run-specific log files inside
logs/<your_run_id>/. - Debug with a Single Worker: If the script is crashing, running with a single worker provides clearer, sequential logs that make it easier to pinpoint the error. Add
--max_workers 1to your run command. - Manage Disk Space: Evaluation can consume significant disk space. Periodically run
docker system pruneto clear unused Docker images and containers, or use the--cache_level=baseflag to minimize the storage used for Docker images.