Benchmark AI coding agents against each other on the same task

These details have not been verified by PyPI

Project description

agent-bench

[ Tests: 96 ]

Benchmark AI coding agents against each other on the same task.

Run the same coding task across Claude Code, Codex CLI, Gemini CLI, Aider, OpenClaw, and more — then compare cost, speed, token usage, code quality, and test pass rate.

Why?

No existing tool lets regular devs compare AI coding agents head-to-head on their codebase. Terminal-Bench is academic. This is practical.

Install

pip install agent-bench

Quick Start

# Create config
agent-bench init

# Edit .agent-bench.yaml with your task and agents

# Run all configured agents
agent-bench run

# Run specific agents
agent-bench run --agent claude-code,codex-cli

# Override the task
agent-bench run --task "Add error handling to all API calls"

# View results
agent-bench results

# JSON output
agent-bench results --json

# View history
agent-bench history

# Check which agents are installed
agent-bench agents

Example Output

╭─────────────────── Agent Benchmark Results ────────────────────╮
│ Task: "Add pagination to users endpoint"                        │
│ Run: 2026-04-07 00:15                                          │
├──────────┬──────────┬───────┬──────────┬────────┬──────────────┤
│ Agent    │ Time     │ Cost  │ Tokens   │ Tests  │ Quality      │
├──────────┼──────────┼──────────┼────────┼────────┼──────────────┤
│ Claude   │ 2m 14s   │ $0.42 │ 18.2K    │ 8/8 ✅ │ A (92/100)   │
│ Codex    │ 1m 47s   │ $0.31 │ 14.1K    │ 8/8 ✅ │ A- (88/100)  │
│ Gemini   │ 3m 02s   │ $0.18 │ 22.3K    │ 7/8 ⚠️ │ B+ (81/100)  │
│ Aider    │ 4m 15s   │ $0.55 │ 31.0K    │ 6/8 ❌ │ B (75/100)   │
╰──────────┴──────────┴───────┴──────────┴────────┴──────────────╯

Winner: Claude Code (A — best quality)
Fastest: Codex CLI (1m 47s)
Cheapest: Gemini CLI ($0.18)

Configuration

.agent-bench.yaml:

agents:
  claude-code:
    command: claude
    args: ["--dangerously-skip-permissions"]
  codex-cli:
    command: codex
    args: ["--full-auto"]
  aider:
    command: aider
    args: ["--yes-always"]

default-task: "Refactor this file to use type hints throughout"

scoring:
  run-tests: true
  lint: true
  timeout: 300

Quality Score

Component	Weight
Test pass rate	40%
Lint clean	20%
Code diff sensibility	15%
Task completion	15%
Speed bonus	10%

How It Works

Copies your project to an isolated temp directory per agent
Runs each agent as a subprocess with the task prompt
Captures stdout/stderr, exit code, duration
Parses token usage from output
Runs tests and linter if configured
Calculates cost from token usage + model pricing
Scores quality across multiple dimensions
Stores results in SQLite for history
Cleans up temp directories

Supported Agents

Any CLI-based coding agent: Claude Code, Codex CLI, Gemini CLI, Aider, OpenClaw, Hermes, OpenCode, and more. If it runs in a terminal, it works.

License

MIT © Hiren Thakore

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

Apr 7, 2026

This version

0.2.0

Apr 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cli_agent_bench-0.2.0.tar.gz (26.3 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cli_agent_bench-0.2.0-py3-none-any.whl (21.3 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file cli_agent_bench-0.2.0.tar.gz.

File metadata

Download URL: cli_agent_bench-0.2.0.tar.gz
Upload date: Apr 7, 2026
Size: 26.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cli_agent_bench-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`94b8219f68631329a03051d3cfca95d37b26ef4c37e13794b8afa5b3c9f7defc`
MD5	`97a98ecaaf1a93799b52c7aa8e5beca1`
BLAKE2b-256	`a2d50923525aea341c3e8a41100c1de0b8212e7f906ab49f5c89a7067a72bc11`

See more details on using hashes here.

File details

Details for the file cli_agent_bench-0.2.0-py3-none-any.whl.

File metadata

Download URL: cli_agent_bench-0.2.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for cli_agent_bench-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d8c0ef37f8eafdbc9a50adfba4d164be9fde3518bef70f1bd56079784ff650af`
MD5	`5d0cb1281304a10f3019d4be9bc2b994`
BLAKE2b-256	`8bb81da86f306477b1fc669914ef80fb624f510c8a129337e4f39fc93dc04260`

See more details on using hashes here.

cli-agent-bench 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

agent-bench

Why?

Install

Quick Start

Example Output

Configuration

Quality Score

How It Works

Supported Agents

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes