SWE-bench for your codebase. Turn merged PRs into reproducible coding-agent benchmarks.
Project description
RepoAgentBench
SWE-bench for your codebase.
Turn your merged PRs into reproducible coding-agent benchmarks. Find out which AI coding agent actually works on your repo, your tests, your constraints — with structured, replayable, diffable run artifacts instead of opaque chat logs.
Today: mock-fix (oracle baseline), claude-code, and aider adapters. Codex CLI and Gemini CLI are next — see Roadmap.
Sample leaderboard
A frontier-model sweep on Click PR #3299 (Fix speculative empty string check, merged 2026-04-08), driven by aider against four current flagship models plus a mock-fix oracle that applies the PR's actual diff:
| Agent | Model | Status | Duration | Cost | What it did |
|---|---|---|---|---|---|
| mock-fix | (oracle) | PASS | 8.8s | — | Applies the actual PR diff: isinstance(default, str) and default == "" |
| aider | anthropic/claude-sonnet-4-6 |
FAIL | 610s | $0.72 | Wrote try/except TypeError — but the test class raises ValueError, so the exception still propagates |
| aider | anthropic/claude-opus-4-7 |
FAIL | 36s | $0.58 | Replaced default == "" with default == str() — completely equivalent at runtime; comment claimed it "avoids TypeError" but it's a no-op |
| aider | openai/gpt-5.5 |
PASS | 272s | $1.58 | Wrote the canonical PR fix verbatim: isinstance(default, str) and default == "" |
| aider | gemini/gemini-3.1-pro-preview |
FAIL | 939s | $0.03+ | Edited the wrong function entirely — patched a value == "" check inside _value_is_missing() (line 2408), but the bug is in get_help_extra() (line 3113); the test still calls into the unpatched code path |
Aggregate across the four frontier-flagship runs:
| Vendor flagship | Pass rate on Click PR #3299 |
|---|---|
| OpenAI GPT-5.5 | 1 / 1 |
| Anthropic Opus 4.7 | 0 / 1 |
| Anthropic Sonnet 4.6 | 0 / 1 |
| Google Gemini 3.1 Pro | 0 / 1 |
This is exactly the failure topology public benchmarks hide. The PR is a one-line bug fix, the failure mode is a Python edge case (custom __eq__ raising ValueError), and three of the four current vendor flagships failed in three different ways: Sonnet 4.6 over-engineered the wrong exception class, Opus 4.7 produced a no-op fix that "looks right," and Gemini 3.1 Pro patched the wrong function entirely. Only GPT-5.5 reproduced the canonical PR fix. The harness surfaces this because every run executes the project's real test suite against each agent's actual diff — captured in diff.patch, agent.log, and events.jsonl for every run.
Why
Public benchmarks tell you which agent wins on curated, generic tasks. They do not tell you which agent works on the codebase you actually maintain. And recent research argues those benchmarks are increasingly compromised:
- "Saving SWE-Bench" (arxiv:2510.08996, Jan 2026) — public benchmarks overestimate agent capability by 20–50%.
- "Does SWE-Bench-Verified Test Agent Ability or Model Memory?" (arxiv:2512.10218, Dec 2025) — frontier models perform 3× better on SWE-Bench-Verified than on benchmarks built from training-cutoff-fresh PRs, suggesting heavy training-data overlap.
RepoAgentBench dodges both problems. It is local-first: your code never leaves your machine. The differentiator: every merged PR can become a benchmark task. PR description → goal. PR tests → acceptance criteria. Diff (split into test and source halves) → broken starting state. Mine PRs that post-date the model's training cutoff and you have a contamination-free benchmark of your own.
Status
v0.1.0 — early alpha. Single-task runner with per-task venv isolation, PR-to-task mining (test/source patch splitting), verify.sh auto-generation (requirements*.txt, [project.optional-dependencies], PEP 735 [dependency-groups]), structured run artifacts (manifest.json, events.jsonl, verification.json), report / replay / diff subcommands, and adapters for mock-fix / claude-code / aider. Validated end-to-end on a real OSS PR (Click #3299) with real Claude API. Codex CLI / Gemini CLI adapters, parallel multi-agent eval, and statistical reporting are next. See Roadmap.
Quickstart
pip install -e .
# Smoke test (no API key, no CLI install required)
repoagentbench run-one --task examples/demo --agent mock-fix
For real agents:
# Aider (recommended path: dedicated conda env so Aider's deps don't conflict with task deps)
conda create -n aider-rab python=3.11 -y
conda run -n aider-rab pip install aider-chat
export ANTHROPIC_API_KEY=sk-ant-...
repoagentbench run-one --task examples/demo --agent aider
# Claude Code
# Install Claude Code CLI per https://docs.claude.com/en/docs/claude-code, then:
repoagentbench run-one --task examples/demo --agent claude-code
Aider defaults to anthropic/claude-sonnet-4-6. Override with RAB_AIDER_MODEL=anthropic/claude-opus-4-7 (or any LiteLLM-compatible model). Override the binary path with RAB_AIDER_BIN=/path/to/aider.
Mine a benchmark task from any merged GitHub PR
repoagentbench infer \
--from-pr https://github.com/pallets/click/pull/3299 \
--out tasks/click-pr-3299
repoagentbench run-one --task tasks/click-pr-3299 --agent mock-fix
repoagentbench run-one --task tasks/click-pr-3299 --agent aider
The infer command:
- pulls the PR's title, body, base SHA, and unified diff via
gh - clones the repo at the PR's base commit (preserves
.gitso projects usingsetuptools_scm/hatch-vcs/ similar still install) - splits the PR diff into
solution_tests.patchandsolution_source.patch, then applies the test portion to the task folder so the starting state is "post-PR tests vs pre-PR source" — without that,pre_verifywould trivially pass on PRs that add new tests - writes
goal.md(PR title + body),solution.patch(full diff for reference), andtask.json(source metadata) - auto-generates
verify.shbased on the test framework it detects (pytest /go test/cargo test/npm test). For pytest projects, the generated script discovers and installs fromrequirements*.txt,[project.optional-dependencies]extras, and PEP 735[dependency-groups]. When no framework can be detected it writes aTODO.mdexplaining what to fill in.
Requires the
ghCLI installed and authenticated.
Run artifacts
Each run-dir is a self-describing bundle:
.runs/<ISO_ts>__<task>__<agent>__<short_id>/
manifest.json # run_id, task_id, agent, base_commit, started_at, harness_version
status.json # final outcome: status, failure_stage, summary, durations
verification.json # pre/post phases: command, passed, exit_code, duration_seconds
events.jsonl # streaming lifecycle events with millisecond timestamps
# (run.started, verify.*, agent.*, diff.captured, run.finished)
pre_verify.log # raw test output before the agent ran (must fail)
post_verify.log # raw test output after the agent ran (must pass)
agent.log # what the agent did (stdout + stderr)
diff.patch # what the agent actually changed
venv_bootstrap.log # per-task venv setup output
workdir/ # isolated copy of the task (with `.venv-rab/` inside)
Run ids are sortable and human-readable: 20260429T060601Z__click-pr-3299__mock-fix__daf638. Each task runs inside its own .venv-rab venv so installing the project under test does not pollute your system Python or break the harness itself.
Aggregate, replay, compare
repoagentbench report # markdown leaderboard of every run
repoagentbench report --task click-pr-3299 # filter to one task
repoagentbench report --output report.md # write to file
repoagentbench replay --run <run_id_or_prefix> # re-run the same task+agent (variance check)
repoagentbench diff --run <id_a> --run <id_b> # side-by-side comparison
report groups runs by task and includes a per-agent aggregate (runs, passed, pass rate, average duration). replay reads manifest.json so the same task and agent are reused — useful for measuring run-to-run variance or re-validating after upgrading the harness. diff highlights the fields that changed between two runs.
How is this different from SWE-bench / CodeScaleBench?
| SWE-bench | CodeScaleBench | RepoAgentBench | |
|---|---|---|---|
| Tasks | 2,294 curated | 275 curated | mined from your PRs |
| Codebase | 12 OSS repos | enterprise OSS | your repo |
| Distribution | public dataset | public dataset | local-first |
| Training-data contamination | known issue [^1] | known issue | avoidable (mine PRs after model cutoff) |
| Question answered | which model is strongest in general | which agent leverages context tools well | which agent works on this codebase |
[^1]: Bian et al., 2025 — "Does SWE-Bench-Verified Test Agent Ability or Model Memory?" — finds frontier models score 3× higher on SWE-Bench-Verified than on equivalent training-fresh tasks.
Roadmap
- v0.0.1 — single-task runner with
mock-fixandclaude-codeadapters - v0.0.2 —
repoagentbench infer --from-pr <url>mines tasks from merged GitHub PRs - v0.0.3 —
inferauto-generatesverify.shfor pytest / Go / Cargo / npm projects - v0.0.4 —
infersplits PR diff into test/source patches so PR-mined tasks have a valid pre-fix starting state - v0.0.5 — per-task venv isolation;
verify.shhandlesrequirements*.txtand PEP 735[dependency-groups] - v0.0.6 — structured run-dir (
manifest.json,events.jsonl,verification.json); sortable human-readablerun_ids - v0.0.7 —
report/replay/diffsubcommands - v0.1.0 — Aider adapter; first real Claude API leaderboard data
- v0.2 — Codex CLI adapter; parallel multi-agent eval; HTML report
- v0.3 — bootstrap confidence intervals, pairwise statistical comparison
- v0.4 — real-repo demo suite (3 OSS repos × historical PRs × all agents)
License
MIT
Author
Haofei Sun — humphreysun98@gmail.com · github.com/HumphreySun98
Reach out about agent-eval / devtools / infra roles, RepoAgentBench feedback, or contributions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file repoagentbench-0.1.0.tar.gz.
File metadata
- Download URL: repoagentbench-0.1.0.tar.gz
- Upload date:
- Size: 27.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
787d82f2e5dfbdc18a5ad5ce8331583d5be83ceffc39f1a06c367f6ccc49e47a
|
|
| MD5 |
75749abf4f0c1e6b51809e9e423d2a0b
|
|
| BLAKE2b-256 |
20501174b22747b6ae824ce9fe67691232b24e8c4cbcd2065b7dd3784d5f79ca
|
File details
Details for the file repoagentbench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: repoagentbench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cb3d0a5c7e183d8ff34c1ffbf6a405af271f15f46d4c1676eb6f93f002cd711
|
|
| MD5 |
14c5d2757e4a4e4203e237ae704c9225
|
|
| BLAKE2b-256 |
5a328e524aa2d7e4bbfac5c1efe39dc2bf808662578794cba7c5681faff8e44d
|