Thanks to visit codestin.com
Credit goes to github.com

Skip to content

AweAI-Team/BeyondSWE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Paper Hugging Face Datasets Harbor-format Huggingface Face Datasets Scaffold Evaluation Framework Website License

BeyondSWE evaluates code agents along two key dimensions — resolution scope and knowledge scope — moving beyond single-repo bug fixing into the real-world deep waters of software engineering.

We also introduce SearchSWE, a framework that integrates deep research capabilities with coding agents.

⚠️ NOTE

We provide a ready-to-use evaluation framework in AweAgent/recipes/beyond_swe and pre-built Docker environments with the dataset on Hugging Face.

Feel free to open an issue or reach out by email for any questions.

📑 Table of Contents


📰 News

  • [2026-03-22] 📝 Released the trajectories and analysis markdown for "Codex + GPT5.4 xhigh", Openhands and SearchSWE with DeepSeek v4 pro
  • [2026-03-22] 📊 Updated leaderboard with latest frontier model results: GLM-5, Kimi-K2.5, MiniMax-M2.5, and Claude Code (Opus 4.6). Check out the BeyondSWE Leaderboard!
  • [2026-03-22] 🔧 Released Harbor-format dataset for evaluating coding agents (e.g., Claude Code) on BeyondSWE via the Harbor framework.
  • [2026-03-03] 🎉 Our paper BeyondSWE is now on arXiv.
  • [2026-03-01] 🎉 BeyondSWE benchmark and SearchSWE framework released!

✨ Highlights

  • 🗂️ 500 real-world instances across 246 GitHub repositories, spanning four distinct task settings
  • 📐 Two-dimensional evaluation — simultaneously expands both resolution scope (local → global) and knowledge scope (within-repo → cross-repo / domain / web)
  • 📊 18x more complex than SWE-bench Verified — 5.6 files and 209.9 lines per instance on average (vs. 1.3 files / 11.6 lines)
  • 🔍 SearchSWE framework — first standardized benchmark for evaluating deep research in coding, with rigorous anti-cheating mechanisms
  • 🔑 Key finding — frontier models plateau below 45% on BeyondSWE, despite achieving 80%+ on SWE-bench Verified

📋 Benchmark Overview

BeyondSWE covers four task settings that span the full spectrum of real-world software engineering challenges:

Task Resolution Scope Knowledge Scope #Repos #Instances Description
🔗 CrossRepo Local Function Cross-Repository 67 200 Fix issues that require consulting external repositories, Stack Overflow, and upstream libraries
🧬 DomainFix Local Function Domain-Specific 12 72 Solve bugs in specialized scientific domains (quantum physics, bioinformatics, etc.) requiring expert knowledge
🕊️ DepMigrate Global Repository Official Docs 120 178 Perform codebase-wide migration triggered by breaking dependency upgrades (e.g., NumPy 1.x → 2.0)
📝 Doc2Repo Global Repository Human Spec 50 50 Build an entire functional repository from a natural language specification

Comparison with Existing Benchmarks


🔍 SearchSWE Framework

SearchSWE augments code agents with deep research capabilities, enabling them to interleave web search and code reasoning — just like real developers do.

Key Components:

  • SearchTool — Query web search engines during task solving
  • BrowserTool — Retrieve and summarize webpage content given a URL and goal

Anti-Cheating Mechanisms:

  • Regex-based blocklist filters both search results and bash commands, blocking any access to the target repository (GitHub/GitLab pages, API endpoints, raw content sources, git operations)
  • Docker environments are sanitized by removing all commits after the target commit

For the SearchSWE agent implementation, see AweAgent.


📈 Results

Key Findings

1. The 45% Ceiling — Even frontier models (Gemini 3 Pro, GPT-5.2, DeepSeek-V3.2, etc.) fail to exceed 45% overall on BeyondSWE, compared to 80%+ on SWE-bench Verified.

2. No Single Winner — Different models lead on different tasks — Seed-Coder on CrossRepo (44.72%), DeepSeek-V3.2 on Doc2Repo (54.99%), Gemini 3 Pro on DepMigrate (41.81%) — revealing that the four tasks test fundamentally different capabilities.

3. Search Helps, but Integration Remains Open — 6 out of 9 models improve with SearchSWE, with Gemini 3 Pro gaining +7.5% on DomainFix. However, gains are inconsistent — search and coding have matured independently, but their effective fusion is still an unsolved challenge.

4. Quality over Quantity — Gemini 3 Pro searches only 0.8–1.1 times per instance yet achieves the best overall gain (+2.0%), while DeepSeek-V3.2 searches 4.2–5.4 times but shows a slight decline (-0.2%).


🚀 Quick Start

Data

The benchmark data is available on Hugging Face in two formats:

Standard Format — for evaluation with AweAgent/SearchSWE:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="AweAI-Team/BeyondSWE",
    repo_type="dataset",
    local_dir="<your_path>/BeyondSWE",
)

Harbor Format — for evaluation with coding agents (e.g., Claude Code) via the Harbor framework. The dataset is available on 🤗 BeyondSWE-harbor:

from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="AweAI-Team/BeyondSWE-harbor",
    repo_type="dataset",
    local_dir="data",
)

We recommend using git clone to avoid HuggingFace API rate limits:

# Make sure git-lfs is installed: https://git-lfs.com
git lfs install
git clone https://huggingface.co/datasets/AweAI-Team/BeyondSWE-harbor data

This will download two directories into data/:

  • beyondswe/ — 500 Harbor task directories (each containing task.toml, instruction.md, environment/, tests/, solution/)
  • doc2repo_test_suite/ — test suite ZIP files for Doc2Repo evaluation (already bundled inside each task's tests/test_suite.zip, included here for reference)

Trajectories

We release agent trajectories and analyses for selected runs in the trajectories/ directory:

  • DeepSeek-V4-Pro/OpenHands/dpsk_v4_pro_max_nosearch.jsonl — DeepSeek-V4-Pro on the OpenHands scaffold
  • DeepSeek-V4-Pro/SearchSWE/dpsk_v4_pro_max_search.jsonl — DeepSeek-V4-Pro on SearchSWE
  • GPT-5.4-XHigh/Codex/codex_searchswe.jsonl — GPT-5.4-XHigh on Codex (SearchSWE), with companion analysis in web_cheating_analysis.md

All .jsonl files are tracked with Git LFS because each one exceeds GitHub's 100 MB per-file limit (total ~749 MB).

Option A — fresh clone (recommended). LFS files are pulled automatically:

# Install git-lfs once per machine: https://git-lfs.com
git lfs install
git clone https://github.com/AweAI-Team/BeyondSWE.git

Option B — already cloned without LFS. Run inside the existing checkout:

cd BeyondSWE
git lfs install
git lfs pull

Tip — skip trajectories on initial clone. If you do not need the trajectory files right away, defer the ~749 MB download:

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/AweAI-Team/BeyondSWE.git
# later, fetch only what you need:
git lfs pull --include="trajectories/GPT-5.4-XHigh/**"

Evaluation with SearchSWE

Please refer to AweAgent for the full evaluation pipeline, including SearchSWE setup and running instructions.

Evaluation with Harbor (e.g., Claude Code)

We use the Harbor framework to evaluate coding agents such as Claude Code on BeyondSWE.

1. Install Harbor

uv tool install harbor
# or
pip install harbor

You can see the Harbor repository for more details.

2. Configure API credentials

To evaluate Claude Code, you will need an Anthropic API key or OAuth token.:

export ANTHROPIC_API_KEY=<YOUR-KEY>
# or, if using OAuth:
export CLAUDE_CODE_OAUTH_TOKEN=<YOUR-TOKEN>

3. Run evaluation

harbor run --path data/beyondswe \
    --agent claude-code \
    --model anthropic/claude-opus-4-6 \
    --n-concurrent 1 \
    --ak max_turns=200 \
    --ak reasoning_effort=high \
    --ak "disallowed_tools='Bash(git log * --all*) Bash(git verify-pack *) Bash(git fsck *) Bash(git cat-file *) Bash(git fetch *) Bash(git pull *)'"

Key parameters:

  • --agent claude-code — use Claude Code as the coding agent
  • --model — LLM model to use (e.g., anthropic/claude-opus-4-6)
  • --n-concurrent - concurrency limit
  • --ak max_turns=200 — allow up to 200 agent iterations
  • --ak reasoning_effort=high — enable extended thinking
  • --ak disallowed_tools=... — restrict git history commands to prevent data leakage
  • -t <task_name> — run a specific instance (e.g., -t pylons_plaster_pastedeploy_pr14)

To see all supported agents, and other options run:

harbor run --help

Results will be saved in the jobs/ directory. Each trial contains:

  • result.json — score, timing, token usage, and exception info
  • agent/trajectory.json — full agent trajectory (steps, tool calls, reasoning)
  • verifier/reward.txt — evaluation reward (1.0 = resolved, 0.0 = failed; Doc2Repo uses fractional scores)

📝 Citation

If you find BeyondSWE useful in your research, please cite our paper:

@misc{beyondswe2026,
      title={BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing}, 
      author={Guoxin Chen and Fanzhe Meng and Jiale Zhao and Minghao Li and Daixuan Cheng and Huatong Song and Jie Chen and Yuzhi Lin and Hui Chen and Xin Zhao and Ruihua Song and Chang Liu and Cheng Chen and Kai Jia and Ji-Rong Wen},
      year={2026},
      eprint={2603.03194},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.03194}, 
}

📄 License

This project is licensed under the CC BY 4.0 License — see the LICENSE file for details.


If you find this project useful, please consider giving it a ⭐ !

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors