Thanks to visit codestin.com
Credit goes to github.com

Skip to content

TuringEnterprises/SWE-Bench-plus-plus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Evaluation Guide

This guide provides step-by-step instructions for evaluating model-generated code patches against the Turing SWE-bench benchmark.

0. Prerequisites

Before you begin, ensure you have the following installed:

  • Python (3.10 or newer)
  • git
  • Docker Ensure the Docker daemon is running before you start the evaluation.

Linux Docker Setup If you are on Linux, we highly recommend following the post-installation steps to manage Docker as a non-root user.

1. Environment Setup

First, download the delivery folder and go to the SWE-Bench folder and set up a local Python virtual environment.

cd SWE-Bench
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

2. Data Preparation

Evaluation requires two key dataset files in .jsonl format.

A. Testbed Dataset (--dataset_name)

This file contains the SWE-Bench dataset. You have two options:

Option 1: Local File

  • Filename: turing-swe-bench-dataset.jsonl
  • Origin: This file should be obtained from the delivery Google Drive. It contains the core information for each task instance.

Option 2: Hugging Face Dataset

  • Dataset Name: TuringEnterprises/SWE-Bench-plus-plus
  • Access: Public dataset available on Hugging Face Hub

Simply use the dataset name directly in the --dataset_name argument (see examples below).

B. Model Predictions (--predictions_path)

This is the file you create, containing the patches generated by your model. Each line must be a single JSON object with the following structure:

  • instance_id (string): A unique identifier in the format repo_owner__repo_name-pull_request_number. This must match an instance_id from the testbed dataset.
  • model_name_or_path (string): An identifier for your model (e.g., "gpt-4-turbo").
  • model_patch (string): The full diff/patch content generated by the model.

For testing purposes, we've included a testing predictions.jsonl in the top directory.

Prediction File Format Example:

{"instance_id": "sympy__sympy-20590", "model_name_or_path": "gpt-4", "model_patch": "diff --git a/sympy/core/sympify.py b/sympy/core/sympify.py\nindex 6a73a83..fb90e1a 100644\n--- a/sympy/core/sympify.py\n+++ b/sympy/core/sympify.py\n@@ -508,7 +508,7 @@ def sympify(a, locals=None, convert_xor=True, strict=False, rational=False,\n         converter[type(a)],\n         (SympifyError,\n          OverflowError,\n-         ValueError)):\n+         ValueError, AttributeError)):\n     return a\n"}
{"instance_id": "another__repo-12345", "model_name_or_path": "gpt-4", "model_patch": "..."}

3a. Run the Evaluation

The main evaluation script is swebench.harness.run_evaluation. It should be executed from the root of the SWE-Bench repository.

Evaluating Your Model

To evaluate your own model, simply point --predictions_path to your custom predictions file.

Using a local dataset file:

python -m swebench.harness.run_evaluation \
    --dataset_name <path/to/turing-swe-bench-dataset.jsonl> \
    --predictions_path <path/to/your/predictions.jsonl> \
    --namespace "" \
    --run_id <run_id>

Using the Hugging Face dataset:

python -m swebench.harness.run_evaluation \
    --dataset_name TuringEnterprises/SWE-Bench-plus-plus \
    --predictions_path <path/to/your/predictions.jsonl> \
    --namespace "" \
    --run_id <run_id> \
    --turing_eval

Command Breakdown

Argument Description
--dataset_name Path to the testbed .jsonl file.
--predictions_path Path to your model-generated predictions .jsonl file.
--run_id A unique name for your evaluation run (e.g., gpt-4-turbo-run-1). This name will be used for the output log directory.
--namespace The Docker Hub namespace for the environment images. Defaults to swe-bench.
--max_workers The number of parallel processes to use. Defaults to the number of CPU cores.
--cache_level Level of caching for Docker images. Defaults to cache env (Cache base and environment images)
--clean Whether to clean up resources after evaluation. Defaults to true
--instance_ids Specific instances to evaluate (comma-separated)
--timeout Maximum time (seconds) for evaluating each instance

For a complete list of arguments, run:

python -m swebench.harness.run_evaluation --help

Cache Levels

The --cache_level parameter controls how Docker images are cached between runs:

Level Description Storage Impact Speed
none No caching Minimal (~120GB during run) Slowest
base Cache only base image Minimal (~120GB during run) Slow
env (default) Cache base and environment images Moderate (~100GB) Moderate
instance Cache all images High (~2,000GB) Fastest

Most users should use the default env level, which provides a good balance between speed and storage usage.

3b. Building and Persisting Images Only

In some cases, you may want to build the Docker environment images for the dataset without running the evaluation.

By default, run_evaluation cleans up images after execution. To persist them, use the prepare_images utility:

python -m swebench.harness.prepare_images \
    --dataset_name ../Dataset/turing-swe-bench-dataset.jsonl \
    --tag turing_prebuilt_v1

Command Breakdown

Argument Description
--dataset_name Path to the testbed .jsonl dataset file.
--tag A custom tag to assign to the built Docker images. Use this to differentiate between different builds (e.g., turing_prebuilt_v1).

Example

python -m swebench.harness.prepare_images \
    --dataset_name ../Dataset/turing-swe-bench-dataset.jsonl \
    --tag turing_prebuilt_v1

After this command, the built images will remain available locally (they will not be deleted automatically). You can then run evaluations which will automatically use the already built images.

4. Understanding the Output Directory

All evaluation artifacts are stored in the logs/ directory, inside a folder named after your --run_id. The final, aggregated report is generated as a .json file at the root of the repository.

The logs are organized by a unique --run_id that you provide for each evaluation.

Directory Structure

The logs/ directory contains two main sub-directories: one for the Docker image build process and one for the evaluation runs themselves.

logs/
├── build_images/
│   └── instances/
│       └── {docker_env_instance_id}/
│           ├── Dockerfile
│           ├── build_image.log
│           └── setup_repo.sh
└── run_evaluation/
    └── {run_id}/
        └── {instance_id}/
            ├── report.json
            ├── run_instance.log
            ├── test_output_after.log
            ├── patch.diff
            └── eval.sh

File Explanations

Environment Build Logs (logs/build_images/...)

This directory contains the files related to building the specific Docker environment for a given task. You should inspect these files if an instance fails very early with a Docker-related error.

File Purpose & How to Use It
Dockerfile This is the exact Dockerfile generated by the harness to create the testing environment. Review this file to see which base image was used and what dependencies were installed.
build_image.log Contains the complete log from the docker build command. Look here first for environment setup failures, such as a failed apt-get install or a Docker daemon error.
setup_repo.sh An auxiliary script that is copied into the Docker image. It handles cloning the repository and checking out the correct commit.

Evaluation Run Logs (logs/run_evaluation/{run_id}/{instance_id}/)

This is the most important directory for debugging. For each instance in your run, a folder is created containing a detailed breakdown of the evaluation process.

File Purpose & How to Use It
run_instance.log The master log for the instance. This is the first file you should check for any failure. It contains high-level logs of the entire process: applying the patch, running the tests, and reporting the results.
report.json A machine-readable summary of the final outcome for this single instance, including whether the task was resolved and other key metrics.
test_output_after.log The raw, unfiltered output from the test command (e.g., pytest, mvn test). If run_instance.log shows that the tests ran but failed, this file will contain the specific error messages, stack traces, and test failures.
patch.diff The exact patch generated by your model that was applied to the code before running the tests. Use this to verify that the patch was parsed correctly from your predictions file.
eval.sh The shell script generated from the test specification that is executed inside the Docker container. This file shows the precise command used to run the tests.

Key Metrics

  • Total Instances: Total number of problems in the testbed dataset.
  • Instances Submitted: Number of instances for which your file provided a prediction.
  • Instances Completed: Number of instances that ran to completion without crashing or timing out.
  • Instances Resolved: The number of instances where the model's patch successfully passed the test suite.
  • Resolution Rate: The percentage of completed instances that were successfully resolved (calculated as Resolved ÷ Completed × 100%).

5. Troubleshooting

If you encounter issues during evaluation, follow these steps:

General Troubleshooting Steps

  1. Ensure Docker is Running: The most common issue is the Docker daemon not being active or accessible.
  2. Verify Prediction File: Double-check that your predictions file is a valid .jsonl file (one complete, valid JSON object per line). Online JSONL validators can help.
  3. Examine Logs: The most detailed error information can be found within the run-specific log files inside logs/<your_run_id>/.
  4. Debug with a Single Worker: If the script is crashing, running with a single worker provides clearer, sequential logs that make it easier to pinpoint the error. Add --max_workers 1 to your run command.
  5. Manage Disk Space: Evaluation can consume significant disk space. Periodically run docker system prune to clear unused Docker images and containers, or use the --cache_level=base flag to minimize the storage used for Docker images.

About

SWE-Bench-plus-plus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages