This project implements a ReAct (Reasoning and Acting) agent for software engineering tasks using large language models. The codebase is adapted from mini-swe-agent.
-
Install dependencies
uv pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the project root and add your OpenAI API key:OPENAI_API_KEY=your_api_key_here
We provide a skeleton code for you to implement your ReAct agent. Refer to CODE.md for more details about the structure of the code and what you should implement.
To run the ReAct agent on SWE-bench instances:
python run_agent.py --model gpt-5-mini --max-steps 100 --outputs resultsThe agent will process SWE-bench instances and save results to the results/ directory.
Note: We suggest testing the agent on a single instance first by setting instances = instances[:1] in run_agent.py.
After generating predictions, run SWEBench's evaluation harness to evaluate the submissions:
python -m swebench.harness.run_evaluation \
--dataset_name lynnliu030/swebench-eval-subset \
--predictions_path ./results/preds.json \
--max_workers 8 \
--run_id my_evaluation_runThe evaluation will generate a results file with the following structure:
{
"total_instances": 20,
"submitted_instances": 20,
"completed_instances": 19,
"resolved_instances": 9,
"unresolved_instances": 10,
"empty_patch_instances": 1,
"error_instances": 0,
"completed_ids": ["astropy__astropy-7166", ...],
"resolved_ids": ["astropy__astropy-7166", ...],
"unresolved_ids": ["django__django-10973", ...],
"schema_version": 2
}After optimizing your agent, submit the following to the submission server:
- Must contain everything needed to build and run an end-to-end evaluation
- Do not commit secrets/keys
- Include setup instructions in README
- Ensure reproducible environment setup
Your report should:
- Report your accuracy number
- Describe the custom tools you created and explain the reason behind making them
- Share the lessons you learned
The evaluation result file with the format shown above, containing your agent's performance metrics on the SWE-Bench subset.
Good luck optimizing your SWE agent! 🤖