Skip to main content

Benchmark AI coding agents against each other on the same task

Project description

agent-bench

Python 3.9+ License: MIT [Tests: 96]

Benchmark AI coding agents against each other on the same task.

Run the same coding task across Claude Code, Codex CLI, Gemini CLI, Aider, OpenClaw, and more — then compare cost, speed, token usage, code quality, and test pass rate.

Why?

No existing tool lets regular devs compare AI coding agents head-to-head on their codebase. Terminal-Bench is academic. This is practical.

Install

pip install agent-bench

Quick Start

# Create config
agent-bench init

# Edit .agent-bench.yaml with your task and agents

# Run all configured agents
agent-bench run

# Run specific agents
agent-bench run --agent claude-code,codex-cli

# Override the task
agent-bench run --task "Add error handling to all API calls"

# View results
agent-bench results

# JSON output
agent-bench results --json

# View history
agent-bench history

# Check which agents are installed
agent-bench agents

Example Output

╭─────────────────── Agent Benchmark Results ────────────────────╮
│ Task: "Add pagination to users endpoint"                        │
│ Run: 2026-04-07 00:15                                          │
├──────────┬──────────┬───────┬──────────┬────────┬──────────────┤
│ Agent    │ Time     │ Cost  │ Tokens   │ Tests  │ Quality      │
├──────────┼──────────┼──────────┼────────┼────────┼──────────────┤
│ Claude   │ 2m 14s   │ $0.42 │ 18.2K    │ 8/8 ✅ │ A (92/100)   │
│ Codex    │ 1m 47s   │ $0.31 │ 14.1K    │ 8/8 ✅ │ A- (88/100)  │
│ Gemini   │ 3m 02s   │ $0.18 │ 22.3K    │ 7/8 ⚠️ │ B+ (81/100)  │
│ Aider    │ 4m 15s   │ $0.55 │ 31.0K    │ 6/8 ❌ │ B (75/100)   │
╰──────────┴──────────┴───────┴──────────┴────────┴──────────────╯

Winner: Claude Code (A — best quality)
Fastest: Codex CLI (1m 47s)
Cheapest: Gemini CLI ($0.18)

Configuration

.agent-bench.yaml:

agents:
  claude-code:
    command: claude
    args: ["--dangerously-skip-permissions"]
  codex-cli:
    command: codex
    args: ["--full-auto"]
  aider:
    command: aider
    args: ["--yes-always"]

default-task: "Refactor this file to use type hints throughout"

scoring:
  run-tests: true
  lint: true
  timeout: 300

Quality Score

Component Weight
Test pass rate 40%
Lint clean 20%
Code diff sensibility 15%
Task completion 15%
Speed bonus 10%

How It Works

  1. Copies your project to an isolated temp directory per agent
  2. Runs each agent as a subprocess with the task prompt
  3. Captures stdout/stderr, exit code, duration
  4. Parses token usage from output
  5. Runs tests and linter if configured
  6. Calculates cost from token usage + model pricing
  7. Scores quality across multiple dimensions
  8. Stores results in SQLite for history
  9. Cleans up temp directories

Supported Agents

Any CLI-based coding agent: Claude Code, Codex CLI, Gemini CLI, Aider, OpenClaw, Hermes, OpenCode, and more. If it runs in a terminal, it works.

License

MIT © Hiren Thakore

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cli_agent_bench-0.2.0.tar.gz (26.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cli_agent_bench-0.2.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file cli_agent_bench-0.2.0.tar.gz.

File metadata

  • Download URL: cli_agent_bench-0.2.0.tar.gz
  • Upload date:
  • Size: 26.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cli_agent_bench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 94b8219f68631329a03051d3cfca95d37b26ef4c37e13794b8afa5b3c9f7defc
MD5 97a98ecaaf1a93799b52c7aa8e5beca1
BLAKE2b-256 a2d50923525aea341c3e8a41100c1de0b8212e7f906ab49f5c89a7067a72bc11

See more details on using hashes here.

File details

Details for the file cli_agent_bench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cli_agent_bench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cli_agent_bench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d8c0ef37f8eafdbc9a50adfba4d164be9fde3518bef70f1bd56079784ff650af
MD5 5d0cb1281304a10f3019d4be9bc2b994
BLAKE2b-256 8bb81da86f306477b1fc669914ef80fb624f510c8a129337e4f39fc93dc04260

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page