Benchmark AI coding agents on your own codebase
Project description
agentarena
Race your AI agents. Any agent, any task, your data.
$ agentarena run
agentarena v0.1.0 — racing 3 agents on 3 tasks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Task 1/3: fix-type-error
claude-code ····· PASS 15s $0.08 4.2K tokens 2 calls
aider ··········· PASS 23s $0.14 8.7K tokens 5 calls
codex ··········· PASS 31s $0.21 12.1K tokens 8 calls
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RESULTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Pass Rate Avg Time Avg Cost Total Tokens
claude-code 3/3 100% 45s $0.20 34.6K
aider 2/3 67% 69s $0.35 65.2K
codex 2/3 67% 71s $0.42 82.2K
Winner: claude-code (highest pass rate, lowest cost)
Why agentarena?
Every company building AI is asking the same question: which agent actually works best?
Today that answer is opinions, blog posts, and vibes. Your manager asks for a POC — you spend two weeks manually testing three tools and write a Google Doc that says "I think Claude was better."
agentarena gives you hard numbers in 30 minutes:
| Who you are | What you get |
|---|---|
| Developer picking a tool | Run agents on YOUR codebase, see which passes more tests, costs less, runs faster |
| Team doing a POC | One command, one report — give your manager data, not opinions |
| Agent builder | Prove your agent beats competitors with reproducible benchmarks |
| Company evaluating vendors | Compare digital workers on your actual workload |
Inspired by the ActionEngine paper which found 11.8x cost differences and 5.67x token usage variance between agent architectures on identical tasks. agentarena makes these differences visible on your own data.
Install
pip install agentarena
Quick Start
1. Create a bench.yaml in your project:
agentarena init
2. Define your tasks and agents:
project: my-app
timeout: 120
tasks:
- name: fix-type-error
prompt: "Fix the TypeScript type error in src/auth/login.ts"
validate: "npx tsc --noEmit"
- name: add-pagination
prompt: "Add offset/limit pagination to GET /api/users endpoint"
validate: "bun test test/api/users.test.ts"
agents:
- name: claude-code
command: "claude --print --max-turns 10 '{prompt}'"
patterns: # optional: extract metrics
tokens_in: "input tokens:\\s*([\\d,]+)"
tokens_out: "output tokens:\\s*([\\d,]+)"
cost: "cost:\\s*\\$?([\\d.]+)"
- name: aider
command: "aider --message '{prompt}' --yes-always --no-git"
- name: my-custom-agent # any CLI tool works
command: "my-tool run '{prompt}'"
3. Run the race:
agentarena run
How It Works
For each task x agent combination:
- Creates a clean sandbox (git worktree for code repos, temp directory for anything else)
- Runs the agent CLI with your prompt
- Runs your validation command (tests, typecheck, lint — anything with an exit code)
- Collects metrics: wall time, tokens, cost, LLM calls, pass/fail
- Cleans up the sandbox
Works with any project — git repos, plain directories, any language, any domain.
CLI
agentarena run # Race all agents on all tasks
agentarena run --task fix-type-error # Run specific task
agentarena run --agent claude-code # Run specific agent
agentarena run --json # Export as JSON
agentarena run --csv # Export as CSV
agentarena run --md # Export as Markdown
agentarena init # Create starter bench.yaml
agentarena history # List past runs
Metric Extraction
agentarena uses regex patterns defined in your config to extract metrics from agent output. No code changes needed for new agents:
agents:
- name: my-agent
command: "my-agent '{prompt}'"
patterns:
tokens_in: "Input:\\s*(\\d+) tokens" # regex with one capture group
tokens_out: "Output:\\s*(\\d+) tokens"
cost: "Total:\\s*\\$([\\d.]+)"
llm_calls: "(\\d+) API calls"
No patterns? agentarena still measures wall time and pass/fail — works for any tool.
Examples
See examples/ for ready-to-use configs:
python-pytest.yaml— Python with pytest, mypy, banditnode-typescript.yaml— TypeScript with tsc, jest, ESLintreact-nextjs.yaml— Next.js with vitestgo.yaml— Go with go test, race detector
Contributing
See CONTRIBUTING.md for development setup and guidelines.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentarena-0.1.0.tar.gz.
File metadata
- Download URL: agentarena-0.1.0.tar.gz
- Upload date:
- Size: 29.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec5bcfde27c1e56bd37dcad5f94ef1b23c69bb0807a3e8e343dc2078915f5115
|
|
| MD5 |
5735b12d10fa29fffea8670b9c85aebb
|
|
| BLAKE2b-256 |
52f0b27a741fbd784e3aa94341e25c595819c0afcbdcc41f8cb3feab92497838
|
Provenance
The following attestation bundles were made for agentarena-0.1.0.tar.gz:
Publisher:
publish.yml on manishbabel/agentarena
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentarena-0.1.0.tar.gz -
Subject digest:
ec5bcfde27c1e56bd37dcad5f94ef1b23c69bb0807a3e8e343dc2078915f5115 - Sigstore transparency entry: 995154138
- Sigstore integration time:
-
Permalink:
manishbabel/agentarena@fe81c1672c027546a9d58aa54afc195ebcda2bb0 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/manishbabel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fe81c1672c027546a9d58aa54afc195ebcda2bb0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file agentarena-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agentarena-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed5faee614f8dbc8abc37f15ef23be4d77fe63308e488df3b296081b60438982
|
|
| MD5 |
23e96d96365624b99c4b34ca4702ad0f
|
|
| BLAKE2b-256 |
7497462765ec1a0d3c8cfe1cf8d96fcf45f9a2535572eeffbf918aee01e281b9
|
Provenance
The following attestation bundles were made for agentarena-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on manishbabel/agentarena
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agentarena-0.1.0-py3-none-any.whl -
Subject digest:
ed5faee614f8dbc8abc37f15ef23be4d77fe63308e488df3b296081b60438982 - Sigstore transparency entry: 995154140
- Sigstore integration time:
-
Permalink:
manishbabel/agentarena@fe81c1672c027546a9d58aa54afc195ebcda2bb0 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/manishbabel
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@fe81c1672c027546a9d58aa54afc195ebcda2bb0 -
Trigger Event:
push
-
Statement type: