SWE-bench for your codebase.
Turn your merged PRs into reproducible coding-agent benchmarks. Find out which AI coding agent actually works on your repo, your tests, your constraints — with structured, replayable, diffable run artifacts instead of opaque chat logs.
Today: mock-fix (oracle baseline), claude-code (native Anthropic CLI), and aider (multi-vendor: Opus 4.7, GPT-5.5, Sonnet 4.6, Gemini 3.1 Pro all verified). Codex CLI and Gemini CLI native adapters are next — see Roadmap.
Three Click bug-fix PRs × four current vendor flagships (driven by aider) + mock-fix oracle + native claude-code. Tasks were mined with repoagentbench infer and run with the harness's default settings. Regenerate with python scripts/make_leaderboard_chart.py.
| Model / Agent | #3299 | #3240 | #3364 | Pass rate |
|---|---|---|---|---|
mock-fix (oracle, applies actual PR diff) |
PASS | PASS | PASS | 3 / 3 |
aider + anthropic/claude-opus-4-7 |
FAIL | PASS | PASS | 2 / 3 |
aider + gemini/gemini-3.1-pro-preview |
FAIL | PASS | PASS | 2 / 3 |
aider + anthropic/claude-sonnet-4-6 |
FAIL | FAIL | PASS | 1 / 3 |
aider + openai/gpt-5.5 |
PASS | FAIL | FAIL | 1 / 3 |
claude-code (native CLI) |
PASS | — | — | 1 / 1 |
No frontier model passed all three PRs. Each fails on a different bug.
The bug: default_value == "" raises ValueError when default_value is an object with a strict __eq__. The whole PR is a one-line fix:
- elif default_value == "":
+ elif isinstance(default_value, str) and default_value == "":What the four frontier flagships produced (full diffs in .runs/<run_id>/diff.patch):
✓ openai/gpt-5.5 -- canonical, identical to PR
- elif default_value == "":
+ elif isinstance(default_value, str) and default_value == "":
✓ claude-code (Anthropic, native CLI) -- canonical, identical to PR
- elif default_value == "":
+ elif isinstance(default_value, str) and default_value == "":
✗ aider + anthropic/claude-opus-4-7 -- a literal no-op
- elif default_value == "":
+ elif default_value == str(): # str() evaluates to ""
# so this is the same comparison
# (comment claims it "avoids TypeError")
✗ aider + anthropic/claude-sonnet-4-6 -- caught the wrong exception
- elif default_value == "":
+ try:
+ is_empty_string = default_value == ""
+ except TypeError: # but the test raises ValueError
+ is_empty_string = False # so the test still fails
✗ aider + gemini/gemini-3.1-pro-preview -- patched the wrong function
# Line 2408 in _value_is_missing(), nowhere near the bug:
- if (self.nargs != 1 or self.multiple) and value == ():
+ if (self.nargs != 1 or self.multiple) and (value == () or value == ""):
# The actual bug is at line 3113 in get_help_extra() — untouched.Important
Same model family, different harness, different outcome. Both rows above used Anthropic models — aider + Opus 4.7 (FAIL, no-op fix) vs claude-code (PASS, canonical fix) on the identical PR. The agent harness is part of the system under test, not a passthrough. This is the kind of finding you cannot get from a model leaderboard.
- PR #3240 / PR #3364: the win/loss pattern flips — Opus and Gemini pass both, GPT-5.5 fails both, Sonnet flips. The "winner" on one PR is the "loser" on the next.
These are real-codebase failure modes you cannot see on public benchmarks. They surface here because every run executes the project's real test suite against each agent's actual diff — captured in diff.patch, agent.log, and events.jsonl for inspection (see How it works).
Note
Total real-API spend on this leaderboard: ~$11. Reproduce with repoagentbench run-one --task tasks/click-pr-3299 --agent aider (set RAB_AIDER_MODEL and the appropriate vendor key first).
Public benchmarks tell you which agent wins on curated, generic tasks. They do not tell you which agent works on the codebase you actually maintain. And recent research argues those benchmarks are increasingly compromised:
- "Saving SWE-Bench" (arxiv:2510.08996, Jan 2026) — public benchmarks overestimate agent capability by 20–50%.
- "Does SWE-Bench-Verified Test Agent Ability or Model Memory?" (arxiv:2512.10218, Dec 2025) — frontier models perform 3× better on SWE-Bench-Verified than on benchmarks built from training-cutoff-fresh PRs, suggesting heavy training-data overlap.
RepoAgentBench dodges both problems. It is local-first: your code never leaves your machine. The differentiator: every merged PR can become a benchmark task. PR description → goal. PR tests → acceptance criteria. Diff (split into test and source halves) → broken starting state. Mine PRs that post-date the model's training cutoff and you have a contamination-free benchmark of your own.
30-second smoke test (no API key, mock-fix oracle just applies a known diff):
pip install repoagentbench
git clone https://github.com/HumphreySun98/repoagentbench.git && cd repoagentbench
repoagentbench run-one --task examples/demo --agent mock-fixReal agent on a real PR:
# Mine a benchmark task from any merged GitHub PR (gh CLI required)
repoagentbench infer --from-pr https://github.com/pallets/click/pull/3299 --out tasks/click-3299
# Option A — Aider (multi-vendor; recommended for the leaderboard)
conda create -n aider-rab python=3.11 -y && conda run -n aider-rab pip install aider-chat
export ANTHROPIC_API_KEY=sk-ant-... # or OPENAI_API_KEY / GEMINI_API_KEY
RAB_AIDER_MODEL=anthropic/claude-opus-4-7 \
repoagentbench run-one --task tasks/click-3299 --agent aider
# Option B — Claude Code (native)
repoagentbench run-one --task tasks/click-3299 --agent claude-code
repoagentbench report # markdown leaderboard of every runAider model is set via RAB_AIDER_MODEL (any LiteLLM-compatible string). Aider binary discovery: RAB_AIDER_BIN env var > ~/miniforge/envs/aider-rab/bin/aider > $PATH. Claude Code is auto-discovered from VSCode/Cursor extension dirs.
Mining a PR into a task (infer):
- Pull the PR title, body, base SHA, and unified diff via
gh. - Clone the repo at the PR's base commit.
.gitis preserved so projects usingsetuptools_scm/hatch-vcsinstall cleanly. - Split the PR diff into
solution_tests.patchandsolution_source.patch. Apply the test portion to the task folder so the starting state is "post-PR tests vs pre-PR source" — without this,pre_verifywould trivially pass on PRs that add new tests. - Auto-generate
verify.shfor the detected test framework. For pytest projects, the generated script discovers and installs fromrequirements*.txt,[project.optional-dependencies], and PEP 735[dependency-groups].
Running a task (run-one):
- Copy the task into a fresh
.runs/<run_id>/workdir/. - Bootstrap a per-task venv (
.venv-rab/) inside the workdir. Pip-install the project under test there — never into your system Python. - Run
verify.shonce (pre_verify, must FAIL — establishes a broken baseline). - Invoke the agent against the goal.
- Run
verify.shagain (post_verify, must PASS — proves the fix). - Emit a self-describing artifact bundle (see below).
Aggregating runs (report / replay / diff):
repoagentbench report # markdown leaderboard of every run
repoagentbench report --task click-pr-3299 # filter to one task
repoagentbench replay --run <id_or_prefix> # re-run same task+agent (variance check)
repoagentbench diff --run <a> --run <b> # side-by-side comparisonRun artifact layout
.runs/<ISO_ts>__<task>__<agent>__<short_id>/
manifest.json # run_id, task_id, agent, base_commit, started_at, harness_version
status.json # final outcome: status, failure_stage, summary, durations
verification.json # pre/post phases: command, passed, exit_code, duration_seconds
events.jsonl # streaming lifecycle events with ms timestamps
pre_verify.log # raw test output before the agent ran (must fail)
post_verify.log # raw test output after the agent ran (must pass)
agent.log # what the agent did (stdout + stderr)
diff.patch # what the agent actually changed
venv_bootstrap.log # per-task venv setup output
workdir/ # isolated copy of the task (with `.venv-rab/` inside)
Run ids are sortable and human-readable: 20260429T060601Z__click-pr-3299__mock-fix__daf638.
Warning
This is v0.1.0 / early alpha. Concrete things you should know before reading too much into the leaderboard:
- n=3 is small. The Click sweep above is enough to falsify "Model X is best" but not enough to rank models. The point of the project is to let you build your own benchmark on your own PRs — not to publish a definitive ranking from this README.
- Single project, single language so far. The verified leaderboard is all Python (Click). The verify.sh generator covers Go / Cargo / npm too, but those paths haven't been stress-tested end-to-end. If you try a non-Python repo and hit a snag, file an issue.
- Aider is one harness among many. Aider's repo-map / edit format / summarizer all influence outcomes. The native
claude-coderow in the leaderboard already shows this. Codex CLI and Gemini CLI native adapters are next (v0.2). - PR selection bias. I picked PRs that compile, have a clean test diff, and don't gate on optional deps. About 30–50% of merged PRs in a typical Python repo will fail one of those checks today; better mining heuristics are roadmap work.
- No statistical confidence yet. Pass / fail is the metric; bootstrap CIs and run-to-run variance estimation are v0.3.
If any of these would change your interpretation of the leaderboard, please tell me — happy to adjust the README or the harness.
| SWE-bench | CodeScaleBench | RepoAgentBench | |
|---|---|---|---|
| Tasks | 2,294 curated | 275 curated | mined from your PRs |
| Codebase | 12 OSS repos | enterprise OSS | your repo |
| Distribution | public dataset | public dataset | local-first |
| Training-data contamination | known issue 1 | known issue | avoidable (mine PRs after model cutoff) |
| Question answered | which model is strongest in general | which agent leverages context tools well | which agent works on this codebase |
- v0.1.0 — single-task runner, PR-to-task mining (test/source split,
.gitpreserved, PEP 735 dep-groups), per-task venv isolation, structured run-dir (manifest.json/events.jsonl/verification.json),report/replay/diffsubcommands, adapters formock-fix/claude-code/aider(4 frontier models verified). Full version history. - v0.2 — Codex CLI + Gemini CLI native adapters; parallel multi-agent eval; HTML report.
- v0.3 — bootstrap confidence intervals, pairwise statistical comparison, run-to-run variance estimation.
- v0.4 — real-repo demo suite (3+ OSS repos × historical PRs × all adapters); permission modes (readonly / workspace-write / bypass).
MIT
Haofei Sun — [email protected] · github.com/HumphreySun98
Reach out about agent-eval / devtools / infra roles, RepoAgentBench feedback, or contributions.
Footnotes
-
Bian et al., 2025 — "Does SWE-Bench-Verified Test Agent Ability or Model Memory?" — finds frontier models score 3× higher on SWE-Bench-Verified than on equivalent training-fresh tasks. ↩
