Benchmark AI coding agents on your own codebase

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

manishbabel

These details have not been verified by PyPI

Project description

agentarena

Race your AI agents. Any agent, any task, your data.

$ agentarena run

 agentarena v0.1.0 — racing 3 agents on 3 tasks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 Task 1/3: fix-type-error
   claude-code ····· PASS   15s   $0.08   4.2K tokens   2 calls
   aider ··········· PASS   23s   $0.14   8.7K tokens   5 calls
   codex ··········· PASS   31s   $0.21  12.1K tokens   8 calls

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 RESULTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

              Pass Rate   Avg Time   Avg Cost   Total Tokens
 claude-code  3/3 100%      45s       $0.20       34.6K
 aider        2/3  67%      69s       $0.35       65.2K
 codex        2/3  67%      71s       $0.42       82.2K

 Winner: claude-code (highest pass rate, lowest cost)

Why agentarena?

Every company building AI is asking the same question: which agent actually works best?

Today that answer is opinions, blog posts, and vibes. Your manager asks for a POC — you spend two weeks manually testing three tools and write a Google Doc that says "I think Claude was better."

agentarena gives you hard numbers in 30 minutes:

Who you are	What you get
Developer picking a tool	Run agents on YOUR codebase, see which passes more tests, costs less, runs faster
Team doing a POC	One command, one report — give your manager data, not opinions
Agent builder	Prove your agent beats competitors with reproducible benchmarks
Company evaluating vendors	Compare digital workers on your actual workload

Inspired by the ActionEngine paper which found 11.8x cost differences and 5.67x token usage variance between agent architectures on identical tasks. agentarena makes these differences visible on your own data.

Install

pip install agentarena

Quick Start

1. Create a bench.yaml in your project:

agentarena init

2. Define your tasks and agents:

project: my-app
timeout: 120

tasks:
  - name: fix-type-error
    prompt: "Fix the TypeScript type error in src/auth/login.ts"
    validate: "npx tsc --noEmit"

  - name: add-pagination
    prompt: "Add offset/limit pagination to GET /api/users endpoint"
    validate: "bun test test/api/users.test.ts"

agents:
  - name: claude-code
    command: "claude --print --max-turns 10 '{prompt}'"
    patterns:                                          # optional: extract metrics
      tokens_in: "input tokens:\\s*([\\d,]+)"
      tokens_out: "output tokens:\\s*([\\d,]+)"
      cost: "cost:\\s*\\$?([\\d.]+)"

  - name: aider
    command: "aider --message '{prompt}' --yes-always --no-git"

  - name: my-custom-agent                              # any CLI tool works
    command: "my-tool run '{prompt}'"

3. Run the race:

agentarena run

How It Works

For each task x agent combination:

Creates a clean sandbox (git worktree for code repos, temp directory for anything else)
Runs the agent CLI with your prompt
Runs your validation command (tests, typecheck, lint — anything with an exit code)
Collects metrics: wall time, tokens, cost, LLM calls, pass/fail
Cleans up the sandbox

Works with any project — git repos, plain directories, any language, any domain.

CLI

agentarena run                          # Race all agents on all tasks
agentarena run --task fix-type-error    # Run specific task
agentarena run --agent claude-code      # Run specific agent
agentarena run --json                   # Export as JSON
agentarena run --csv                    # Export as CSV
agentarena run --md                     # Export as Markdown
agentarena init                         # Create starter bench.yaml
agentarena history                      # List past runs

Metric Extraction

agentarena uses regex patterns defined in your config to extract metrics from agent output. No code changes needed for new agents:

agents:
  - name: my-agent
    command: "my-agent '{prompt}'"
    patterns:
      tokens_in: "Input:\\s*(\\d+) tokens"        # regex with one capture group
      tokens_out: "Output:\\s*(\\d+) tokens"
      cost: "Total:\\s*\\$([\\d.]+)"
      llm_calls: "(\\d+) API calls"

No patterns? agentarena still measures wall time and pass/fail — works for any tool.

Examples

See examples/ for ready-to-use configs:

python-pytest.yaml — Python with pytest, mypy, bandit
node-typescript.yaml — TypeScript with tsc, jest, ESLint
react-nextjs.yaml — Next.js with vitest
go.yaml — Go with go test, race detector

Contributing

See CONTRIBUTING.md for development setup and guidelines.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

manishbabel

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Feb 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentarena-0.1.0.tar.gz (29.5 kB view details)

Uploaded Feb 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentarena-0.1.0-py3-none-any.whl (18.8 kB view details)

Uploaded Feb 26, 2026 Python 3

File details

Details for the file agentarena-0.1.0.tar.gz.

File metadata

Download URL: agentarena-0.1.0.tar.gz
Upload date: Feb 26, 2026
Size: 29.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentarena-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ec5bcfde27c1e56bd37dcad5f94ef1b23c69bb0807a3e8e343dc2078915f5115`
MD5	`5735b12d10fa29fffea8670b9c85aebb`
BLAKE2b-256	`52f0b27a741fbd784e3aa94341e25c595819c0afcbdcc41f8cb3feab92497838`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentarena-0.1.0.tar.gz:

Publisher: publish.yml on manishbabel/agentarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentarena-0.1.0.tar.gz
- Subject digest: ec5bcfde27c1e56bd37dcad5f94ef1b23c69bb0807a3e8e343dc2078915f5115
- Sigstore transparency entry: 995154138
- Sigstore integration time: Feb 26, 2026
Source repository:
- Permalink: manishbabel/agentarena@fe81c1672c027546a9d58aa54afc195ebcda2bb0
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/manishbabel
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@fe81c1672c027546a9d58aa54afc195ebcda2bb0
- Trigger Event: push

File details

Details for the file agentarena-0.1.0-py3-none-any.whl.

File metadata

Download URL: agentarena-0.1.0-py3-none-any.whl
Upload date: Feb 26, 2026
Size: 18.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for agentarena-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed5faee614f8dbc8abc37f15ef23be4d77fe63308e488df3b296081b60438982`
MD5	`23e96d96365624b99c4b34ca4702ad0f`
BLAKE2b-256	`7497462765ec1a0d3c8cfe1cf8d96fcf45f9a2535572eeffbf918aee01e281b9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agentarena-0.1.0-py3-none-any.whl:

Publisher: publish.yml on manishbabel/agentarena

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agentarena-0.1.0-py3-none-any.whl
- Subject digest: ed5faee614f8dbc8abc37f15ef23be4d77fe63308e488df3b296081b60438982
- Sigstore transparency entry: 995154140
- Sigstore integration time: Feb 26, 2026
Source repository:
- Permalink: manishbabel/agentarena@fe81c1672c027546a9d58aa54afc195ebcda2bb0
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/manishbabel
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@fe81c1672c027546a9d58aa54afc195ebcda2bb0
- Trigger Event: push

agentarena 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agentarena

Why agentarena?

Install

Quick Start

How It Works

CLI

Metric Extraction

Examples

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance