SE-Bench is a diagnostic environment designed to rigorously measure an agent's ability to internalize novel knowledge, which is a foundational capability for true self-evolution.
First, create and activate a dedicated Conda environment, then install the required dependencies for the project.
- Conda (Anaconda/Miniconda) installed
- Docker (required for evaluation sandbox)
# Create a Conda environment named "se-bench" with Python 3.12
conda create -n se-bench python==3.12 -y
# Activate the Conda environment
conda activate se-bench
# Navigate to the SE-Bench project root directory
cd SE-Bench
# Install all required dependencies
pip install -r requirements.txtYou can load the dataset using the Hugging Face datasets library:
from datasets import load_dataset
dataset = load_dataset("jintailin/SE-Bench", "train")
# Data is in dataset['train']
print(dataset)
dataset = load_dataset("jintailin/SE-Bench", "single_test")
# Data is in dataset['train']
print(dataset)
dataset = load_dataset("jintailin/SE-Bench", "multiple_test")
# Data is in dataset['train']
print(dataset)Alternatively, you can run the provided load_datasets.py script to download and save the data to the local directory structure:
python load_datasets.pyThis will generate the following file structure:
| Path | Description | Usage |
|---|---|---|
datasets/train/api_doc.jsonl |
API documentation for the zwc package |
Training material |
datasets/train/train.jsonl |
Training questions | Training material |
datasets/test/single_test.jsonl |
Single-function problems | Evaluation |
datasets/test/multiple_test.jsonl |
Multi-function composition problems | Evaluation |
Protocol: Train your model or agent using only the information provided in datasets/train/, then evaluate on problems in datasets/test/ without access to documentation. This tests whether the model has truly internalized the API knowledge.
Before running the rollout scripts, you need to deploy the model (e.g., Qwen3-8B) locally. We support deployment via vLLM or SGLang at localhost:8800.
Option 1: Deploy with vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-8B \
--port 8800 \
--host localhostOption 2: Deploy with SGLang
pip install "sglang[all]"
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--port 8800 \
--host localhostRun the query_only.py script to perform inference using only the query content:
cd src
python query_only.py \
--num_workers 1 \
--input_path ../datasets/test/single_test.jsonl \
--output_path ../rollout_results/query_only.jsonl \
--model_name Qwen3-8B \
--host localhost \
--ports 8800 \
--sample_count 1 \
--temperature 0.6 \
--max_length 8192Run the query_doc.py script to perform inference with API documentation (requires specifying the document path):
cd src
python query_doc.py \
--num_workers 1 \
--input_path ../datasets/test/single_test.jsonl \
--output_path ../rollout_results/query_doc.jsonl \
--doc_path ../datasets/train/api_doc.jsonl \
--model_name Qwen3-8B \
--host localhost \
--ports 8800 \
--sample_count 1 \
--temperature 0.6 \
--max_length 8192| Parameter | Description |
|---|---|
--num_workers |
Number of parallel worker processes (adjust based on hardware) |
--input_path |
Path to input test dataset (JSONL format) |
--output_path |
Path to save rollout (inference) results |
--doc_path |
Path to API documentation (required only for query_doc.py) |
--model_name |
Name of the model to use (e.g., Qwen3-8B) |
--host |
Host address of locally deployed models |
--ports |
Port(s) of locally deployed models |
--base_url |
Base URL for OpenAI-compatible APIs |
--api_key |
API key for OpenAI-compatible models |
--sample_count |
Number of samples to generate per input |
--temperature |
Sampling temperature (higher values = more random outputs) |
--max_length |
Maximum length of generated text by the model |
The evaluation phase requires building a Docker sandbox for safe code execution, followed by filtering correct inference trajectories.
The sandbox provides a secure environment for code execution and is deployed at http://localhost:8111 by default:
# Build and start the Docker sandbox (Docker must be running)
bash sandbox_build_and_run.shNote: To change the sandbox port, modify the port configuration in
src/evaluation/worker.py.
After the sandbox is successfully started, execute the evaluation script to filter correct results:
cd src
python filter_correct_trajectory.py \
--input_path ../rollout_results/query_only.jsonl \
--num_workers 64 \
--output_path ../evaluation_results/correct_trajectories.jsonl # Optional: Path to save correct trajectoriesIf you are not using our generation scripts (query_only.py or query_doc.py) and wish to evaluate your own model outputs, you must format your rollout results as a JSONL file. Each line should be a dictionary containing the following keys:
| Key | Description |
|---|---|
query |
The original question from the dataset. |
response |
The model's generation, containing the execution code wrapped in python blocks and the reasoning process. |
test_cases |
The original test cases from the dataset. Format: [{"input":..., "output":...}, ...]. |
right_exe_result |
The original ground truth executable result string from the dataset. |
Once your data is formatted correctly, you can directly run the evaluation script above.
- Adjust
--num_workersbased on your hardware resources (avoid overloading the system) - The sandbox must remain running during the entire evaluation process
- All output paths will be created automatically if they do not exist
Updated soon.