Skip to main content

Benchmark AI coding agents against each other on the same task

Project description

agent-bench

Python 3.9+ License: MIT [Tests: 96]

Benchmark AI coding agents against each other on the same task.

Run the same coding task across Claude Code, Codex CLI, Gemini CLI, Aider, OpenClaw, and more — then compare cost, speed, token usage, code quality, and test pass rate.

Why?

No existing tool lets regular devs compare AI coding agents head-to-head on their codebase. Terminal-Bench is academic. This is practical.

Install

pip install agent-bench

Quick Start

# Create config
agent-bench init

# Edit .agent-bench.yaml with your task and agents

# Run all configured agents
agent-bench run

# Run specific agents
agent-bench run --agent claude-code,codex-cli

# Override the task
agent-bench run --task "Add error handling to all API calls"

# View results
agent-bench results

# JSON output
agent-bench results --json

# View history
agent-bench history

# Check which agents are installed
agent-bench agents

Example Output

╭─────────────────── Agent Benchmark Results ────────────────────╮
│ Task: "Add pagination to users endpoint"                        │
│ Run: 2026-04-07 00:15                                          │
├──────────┬──────────┬───────┬──────────┬────────┬──────────────┤
│ Agent    │ Time     │ Cost  │ Tokens   │ Tests  │ Quality      │
├──────────┼──────────┼──────────┼────────┼────────┼──────────────┤
│ Claude   │ 2m 14s   │ $0.42 │ 18.2K    │ 8/8 ✅ │ A (92/100)   │
│ Codex    │ 1m 47s   │ $0.31 │ 14.1K    │ 8/8 ✅ │ A- (88/100)  │
│ Gemini   │ 3m 02s   │ $0.18 │ 22.3K    │ 7/8 ⚠️ │ B+ (81/100)  │
│ Aider    │ 4m 15s   │ $0.55 │ 31.0K    │ 6/8 ❌ │ B (75/100)   │
╰──────────┴──────────┴───────┴──────────┴────────┴──────────────╯

Winner: Claude Code (A — best quality)
Fastest: Codex CLI (1m 47s)
Cheapest: Gemini CLI ($0.18)

Configuration

.agent-bench.yaml:

agents:
  claude-code:
    command: claude
    args: ["--dangerously-skip-permissions"]
  codex-cli:
    command: codex
    args: ["--full-auto"]
  aider:
    command: aider
    args: ["--yes-always"]

default-task: "Refactor this file to use type hints throughout"

scoring:
  run-tests: true
  lint: true
  timeout: 300

Quality Score

Component Weight
Test pass rate 40%
Lint clean 20%
Code diff sensibility 15%
Task completion 15%
Speed bonus 10%

How It Works

  1. Copies your project to an isolated temp directory per agent
  2. Runs each agent as a subprocess with the task prompt
  3. Captures stdout/stderr, exit code, duration
  4. Parses token usage from output
  5. Runs tests and linter if configured
  6. Calculates cost from token usage + model pricing
  7. Scores quality across multiple dimensions
  8. Stores results in SQLite for history
  9. Cleans up temp directories

Supported Agents

Any CLI-based coding agent: Claude Code, Codex CLI, Gemini CLI, Aider, OpenClaw, Hermes, OpenCode, and more. If it runs in a terminal, it works.

License

MIT © Hiren Thakore

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cli_agent_bench-0.2.1.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cli_agent_bench-0.2.1-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file cli_agent_bench-0.2.1.tar.gz.

File metadata

  • Download URL: cli_agent_bench-0.2.1.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cli_agent_bench-0.2.1.tar.gz
Algorithm Hash digest
SHA256 3598a861b13cf31cd40b934aeba6190594d261b151b5ea1ea75164f1f2a8e739
MD5 c61deb3f8a1d3dd5302c7646106c3339
BLAKE2b-256 546219176fa46558753d3cf762610562a05ba27a5d01913d4fac08ceb4e92911

See more details on using hashes here.

File details

Details for the file cli_agent_bench-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: cli_agent_bench-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cli_agent_bench-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ccd60e48bcd3a75691c74499f0dd0873f200d6e784658c943d56bc492ab1517d
MD5 38f00f0678ad414fbfba3da1f027839b
BLAKE2b-256 82c0d1f306633b8596814050dc6482ebfdb8848c4d7a98f299656bc0bd45b4d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page