Skip to main content

Benchmark AI coding agents on your own codebase

Project description

agentarena

Race your AI agents. Any agent, any task, your data.

CI PyPI version Python 3.11+ License: MIT

$ agentarena run

 agentarena v0.1.0 — racing 3 agents on 3 tasks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 Task 1/3: fix-type-error
   claude-code ····· PASS   15s   $0.08   4.2K tokens   2 calls
   aider ··········· PASS   23s   $0.14   8.7K tokens   5 calls
   codex ··········· PASS   31s   $0.21  12.1K tokens   8 calls

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 RESULTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

              Pass Rate   Avg Time   Avg Cost   Total Tokens
 claude-code  3/3 100%      45s       $0.20       34.6K
 aider        2/3  67%      69s       $0.35       65.2K
 codex        2/3  67%      71s       $0.42       82.2K

 Winner: claude-code (highest pass rate, lowest cost)

Why agentarena?

Every company building AI is asking the same question: which agent actually works best?

Today that answer is opinions, blog posts, and vibes. Your manager asks for a POC — you spend two weeks manually testing three tools and write a Google Doc that says "I think Claude was better."

agentarena gives you hard numbers in 30 minutes:

Who you are What you get
Developer picking a tool Run agents on YOUR codebase, see which passes more tests, costs less, runs faster
Team doing a POC One command, one report — give your manager data, not opinions
Agent builder Prove your agent beats competitors with reproducible benchmarks
Company evaluating vendors Compare digital workers on your actual workload

Inspired by the ActionEngine paper which found 11.8x cost differences and 5.67x token usage variance between agent architectures on identical tasks. agentarena makes these differences visible on your own data.

Install

pip install agentarena

Quick Start

1. Create a bench.yaml in your project:

agentarena init

2. Define your tasks and agents:

project: my-app
timeout: 120

tasks:
  - name: fix-type-error
    prompt: "Fix the TypeScript type error in src/auth/login.ts"
    validate: "npx tsc --noEmit"

  - name: add-pagination
    prompt: "Add offset/limit pagination to GET /api/users endpoint"
    validate: "bun test test/api/users.test.ts"

agents:
  - name: claude-code
    command: "claude --print --max-turns 10 '{prompt}'"
    patterns:                                          # optional: extract metrics
      tokens_in: "input tokens:\\s*([\\d,]+)"
      tokens_out: "output tokens:\\s*([\\d,]+)"
      cost: "cost:\\s*\\$?([\\d.]+)"

  - name: aider
    command: "aider --message '{prompt}' --yes-always --no-git"

  - name: my-custom-agent                              # any CLI tool works
    command: "my-tool run '{prompt}'"

3. Run the race:

agentarena run

How It Works

For each task x agent combination:

  1. Creates a clean sandbox (git worktree for code repos, temp directory for anything else)
  2. Runs the agent CLI with your prompt
  3. Runs your validation command (tests, typecheck, lint — anything with an exit code)
  4. Collects metrics: wall time, tokens, cost, LLM calls, pass/fail
  5. Cleans up the sandbox

Works with any project — git repos, plain directories, any language, any domain.

CLI

agentarena run                          # Race all agents on all tasks
agentarena run --task fix-type-error    # Run specific task
agentarena run --agent claude-code      # Run specific agent
agentarena run --json                   # Export as JSON
agentarena run --csv                    # Export as CSV
agentarena run --md                     # Export as Markdown
agentarena init                         # Create starter bench.yaml
agentarena history                      # List past runs

Metric Extraction

agentarena uses regex patterns defined in your config to extract metrics from agent output. No code changes needed for new agents:

agents:
  - name: my-agent
    command: "my-agent '{prompt}'"
    patterns:
      tokens_in: "Input:\\s*(\\d+) tokens"        # regex with one capture group
      tokens_out: "Output:\\s*(\\d+) tokens"
      cost: "Total:\\s*\\$([\\d.]+)"
      llm_calls: "(\\d+) API calls"

No patterns? agentarena still measures wall time and pass/fail — works for any tool.

Examples

See examples/ for ready-to-use configs:

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentarena-0.1.0.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentarena-0.1.0-py3-none-any.whl (18.8 kB view details)

Uploaded Python 3

File details

Details for the file agentarena-0.1.0.tar.gz.

File metadata

  • Download URL: agentarena-0.1.0.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentarena-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ec5bcfde27c1e56bd37dcad5f94ef1b23c69bb0807a3e8e343dc2078915f5115
MD5 5735b12d10fa29fffea8670b9c85aebb
BLAKE2b-256 52f0b27a741fbd784e3aa94341e25c595819c0afcbdcc41f8cb3feab92497838

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentarena-0.1.0.tar.gz:

Publisher: publish.yml on manishbabel/agentarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agentarena-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentarena-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentarena-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed5faee614f8dbc8abc37f15ef23be4d77fe63308e488df3b296081b60438982
MD5 23e96d96365624b99c4b34ca4702ad0f
BLAKE2b-256 7497462765ec1a0d3c8cfe1cf8d96fcf45f9a2535572eeffbf918aee01e281b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentarena-0.1.0-py3-none-any.whl:

Publisher: publish.yml on manishbabel/agentarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page