Evaluation Guide

This guide provides step-by-step instructions for evaluating model-generated code patches against the Turing SWE-bench benchmark.

0. Prerequisites

Before you begin, ensure you have the following installed:

Python (3.10 or newer)
git
Docker Ensure the Docker daemon is running before you start the evaluation.

Linux Docker Setup If you are on Linux, we highly recommend following the post-installation steps to manage Docker as a non-root user.

1. Environment Setup

First, download the delivery folder and go to the SWE-Bench folder and set up a local Python virtual environment.

cd SWE-Bench
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

2. Data Preparation

Evaluation requires two key dataset files in .jsonl format.

A. Testbed Dataset (`--dataset_name`)

This file contains the SWE-Bench dataset. You have two options:

Option 1: Local File

Filename: turing-swe-bench-dataset.jsonl
Origin: This file should be obtained from the delivery Google Drive. It contains the core information for each task instance.

Option 2: Hugging Face Dataset

Dataset Name: TuringEnterprises/SWE-Bench-plus-plus
Access: Public dataset available on Hugging Face Hub

Simply use the dataset name directly in the --dataset_name argument (see examples below).

B. Model Predictions (`--predictions_path`)

This is the file you create, containing the patches generated by your model. Each line must be a single JSON object with the following structure:

instance_id (string): A unique identifier in the format repo_owner__repo_name-pull_request_number. This must match an instance_id from the testbed dataset.
model_name_or_path (string): An identifier for your model (e.g., "gpt-4-turbo").
model_patch (string): The full diff/patch content generated by the model.

For testing purposes, we've included a testing predictions.jsonl in the top directory.

Prediction File Format Example:

{"instance_id": "sympy__sympy-20590", "model_name_or_path": "gpt-4", "model_patch": "diff --git a/sympy/core/sympify.py b/sympy/core/sympify.py\nindex 6a73a83..fb90e1a 100644\n--- a/sympy/core/sympify.py\n+++ b/sympy/core/sympify.py\n@@ -508,7 +508,7 @@ def sympify(a, locals=None, convert_xor=True, strict=False, rational=False,\n         converter[type(a)],\n         (SympifyError,\n          OverflowError,\n-         ValueError)):\n+         ValueError, AttributeError)):\n     return a\n"}
{"instance_id": "another__repo-12345", "model_name_or_path": "gpt-4", "model_patch": "..."}

3a. Run the Evaluation

The main evaluation script is swebench.harness.run_evaluation. It should be executed from the root of the SWE-Bench repository.

Evaluating Your Model

To evaluate your own model, simply point --predictions_path to your custom predictions file.

Using a local dataset file:

python -m swebench.harness.run_evaluation \
    --dataset_name <path/to/turing-swe-bench-dataset.jsonl> \
    --predictions_path <path/to/your/predictions.jsonl> \
    --namespace "" \
    --run_id <run_id>

Using the Hugging Face dataset:

python -m swebench.harness.run_evaluation \
    --dataset_name TuringEnterprises/SWE-Bench-plus-plus \
    --predictions_path <path/to/your/predictions.jsonl> \
    --namespace "" \
    --run_id <run_id> \
    --turing_eval

Command Breakdown

Argument	Description
`--dataset_name`	Path to the testbed `.jsonl` file.
`--predictions_path`	Path to your model-generated predictions `.jsonl` file.
`--run_id`	A unique name for your evaluation run (e.g., `gpt-4-turbo-run-1`). This name will be used for the output log directory.
`--namespace`	The Docker Hub namespace for the environment images. Defaults to `swe-bench`.
`--max_workers`	The number of parallel processes to use. Defaults to the number of CPU cores.
`--cache_level`	Level of caching for Docker images. Defaults to cache `env` (Cache base and environment images)
`--clean`	Whether to clean up resources after evaluation. Defaults to true
`--instance_ids`	Specific instances to evaluate (comma-separated)
`--timeout`	Maximum time (seconds) for evaluating each instance

For a complete list of arguments, run:

python -m swebench.harness.run_evaluation --help

Cache Levels

The --cache_level parameter controls how Docker images are cached between runs:

Level	Description	Storage Impact	Speed
`none`	No caching	Minimal (~120GB during run)	Slowest
`base`	Cache only base image	Minimal (~120GB during run)	Slow
`env` (default)	Cache base and environment images	Moderate (~100GB)	Moderate
`instance`	Cache all images	High (~2,000GB)	Fastest

Most users should use the default env level, which provides a good balance between speed and storage usage.

3b. Building and Persisting Images Only

In some cases, you may want to build the Docker environment images for the dataset without running the evaluation.

By default, run_evaluation cleans up images after execution. To persist them, use the prepare_images utility:

python -m swebench.harness.prepare_images \
    --dataset_name ../Dataset/turing-swe-bench-dataset.jsonl \
    --tag turing_prebuilt_v1

Command Breakdown

Argument	Description
`--dataset_name`	Path to the testbed `.jsonl` dataset file.
`--tag`	A custom tag to assign to the built Docker images. Use this to differentiate between different builds (e.g., `turing_prebuilt_v1`).

Example

python -m swebench.harness.prepare_images \
    --dataset_name ../Dataset/turing-swe-bench-dataset.jsonl \
    --tag turing_prebuilt_v1

After this command, the built images will remain available locally (they will not be deleted automatically). You can then run evaluations which will automatically use the already built images.

4. Understanding the Output Directory

All evaluation artifacts are stored in the logs/ directory, inside a folder named after your --run_id. The final, aggregated report is generated as a .json file at the root of the repository.

The logs are organized by a unique --run_id that you provide for each evaluation.

Directory Structure

The logs/ directory contains two main sub-directories: one for the Docker image build process and one for the evaluation runs themselves.

logs/
├── build_images/
│   └── instances/
│       └── {docker_env_instance_id}/
│           ├── Dockerfile
│           ├── build_image.log
│           └── setup_repo.sh
└── run_evaluation/
    └── {run_id}/
        └── {instance_id}/
            ├── report.json
            ├── run_instance.log
            ├── test_output_after.log
            ├── patch.diff
            └── eval.sh

File Explanations

Environment Build Logs (`logs/build_images/...`)

This directory contains the files related to building the specific Docker environment for a given task. You should inspect these files if an instance fails very early with a Docker-related error.

File	Purpose & How to Use It
`Dockerfile`	This is the exact Dockerfile generated by the harness to create the testing environment. Review this file to see which base image was used and what dependencies were installed.
`build_image.log`	Contains the complete log from the `docker build` command. Look here first for environment setup failures, such as a failed `apt-get install` or a Docker daemon error.
`setup_repo.sh`	An auxiliary script that is copied into the Docker image. It handles cloning the repository and checking out the correct commit.

Evaluation Run Logs (`logs/run_evaluation/{run_id}/{instance_id}/`)

This is the most important directory for debugging. For each instance in your run, a folder is created containing a detailed breakdown of the evaluation process.

File	Purpose & How to Use It
`run_instance.log`	The master log for the instance. This is the first file you should check for any failure. It contains high-level logs of the entire process: applying the patch, running the tests, and reporting the results.
`report.json`	A machine-readable summary of the final outcome for this single instance, including whether the task was resolved and other key metrics.
`test_output_after.log`	The raw, unfiltered output from the test command (e.g., `pytest`, `mvn test`). If `run_instance.log` shows that the tests ran but failed, this file will contain the specific error messages, stack traces, and test failures.
`patch.diff`	The exact patch generated by your model that was applied to the code before running the tests. Use this to verify that the patch was parsed correctly from your predictions file.
`eval.sh`	The shell script generated from the test specification that is executed inside the Docker container. This file shows the precise command used to run the tests.

Key Metrics

Total Instances: Total number of problems in the testbed dataset.
Instances Submitted: Number of instances for which your file provided a prediction.
Instances Completed: Number of instances that ran to completion without crashing or timing out.
Instances Resolved: The number of instances where the model's patch successfully passed the test suite.
Resolution Rate: The percentage of completed instances that were successfully resolved (calculated as Resolved ÷ Completed × 100%).

5. Troubleshooting

If you encounter issues during evaluation, follow these steps:

General Troubleshooting Steps

Ensure Docker is Running: The most common issue is the Docker daemon not being active or accessible.
Verify Prediction File: Double-check that your predictions file is a valid .jsonl file (one complete, valid JSON object per line). Online JSONL validators can help.
Examine Logs: The most detailed error information can be found within the run-specific log files inside logs/<your_run_id>/.
Debug with a Single Worker: If the script is crashing, running with a single worker provides clearer, sequential logs that make it easier to pinpoint the error. Add --max_workers 1 to your run command.
Manage Disk Space: Evaluation can consume significant disk space. Periodically run docker system prune to clear unused Docker images and containers, or use the --cache_level=base flag to minimize the storage used for Docker images.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
swebench		swebench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation Guide

0. Prerequisites

1. Environment Setup

2. Data Preparation

A. Testbed Dataset (`--dataset_name`)

B. Model Predictions (`--predictions_path`)

3a. Run the Evaluation

Evaluating Your Model

Command Breakdown

Cache Levels

3b. Building and Persisting Images Only

Command Breakdown

Example

4. Understanding the Output Directory

Directory Structure

File Explanations

Environment Build Logs (`logs/build_images/...`)

Evaluation Run Logs (`logs/run_evaluation/{run_id}/{instance_id}/`)

Key Metrics

5. Troubleshooting

General Troubleshooting Steps

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluation Guide

0. Prerequisites

1. Environment Setup

2. Data Preparation

A. Testbed Dataset (--dataset_name)

B. Model Predictions (--predictions_path)

3a. Run the Evaluation

Evaluating Your Model

Command Breakdown

Cache Levels

3b. Building and Persisting Images Only

Command Breakdown

Example

4. Understanding the Output Directory

Directory Structure

File Explanations

Environment Build Logs (logs/build_images/...)

Evaluation Run Logs (logs/run_evaluation/{run_id}/{instance_id}/)

Key Metrics

5. Troubleshooting

General Troubleshooting Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

A. Testbed Dataset (`--dataset_name`)

B. Model Predictions (`--predictions_path`)

Environment Build Logs (`logs/build_images/...`)

Evaluation Run Logs (`logs/run_evaluation/{run_id}/{instance_id}/`)

Packages