Benchmark AI coding agents against each other on the same task
Project description
agent-bench
Benchmark AI coding agents against each other on the same task.
Run the same coding task across Claude Code, Codex CLI, Gemini CLI, Aider, OpenClaw, and more — then compare cost, speed, token usage, code quality, and test pass rate.
Why?
No existing tool lets regular devs compare AI coding agents head-to-head on their codebase. Terminal-Bench is academic. This is practical.
Install
pip install agent-bench
Quick Start
# Create config
agent-bench init
# Edit .agent-bench.yaml with your task and agents
# Run all configured agents
agent-bench run
# Run specific agents
agent-bench run --agent claude-code,codex-cli
# Override the task
agent-bench run --task "Add error handling to all API calls"
# View results
agent-bench results
# JSON output
agent-bench results --json
# View history
agent-bench history
# Check which agents are installed
agent-bench agents
Example Output
╭─────────────────── Agent Benchmark Results ────────────────────╮
│ Task: "Add pagination to users endpoint" │
│ Run: 2026-04-07 00:15 │
├──────────┬──────────┬───────┬──────────┬────────┬──────────────┤
│ Agent │ Time │ Cost │ Tokens │ Tests │ Quality │
├──────────┼──────────┼──────────┼────────┼────────┼──────────────┤
│ Claude │ 2m 14s │ $0.42 │ 18.2K │ 8/8 ✅ │ A (92/100) │
│ Codex │ 1m 47s │ $0.31 │ 14.1K │ 8/8 ✅ │ A- (88/100) │
│ Gemini │ 3m 02s │ $0.18 │ 22.3K │ 7/8 ⚠️ │ B+ (81/100) │
│ Aider │ 4m 15s │ $0.55 │ 31.0K │ 6/8 ❌ │ B (75/100) │
╰──────────┴──────────┴───────┴──────────┴────────┴──────────────╯
Winner: Claude Code (A — best quality)
Fastest: Codex CLI (1m 47s)
Cheapest: Gemini CLI ($0.18)
Configuration
.agent-bench.yaml:
agents:
claude-code:
command: claude
args: ["--dangerously-skip-permissions"]
codex-cli:
command: codex
args: ["--full-auto"]
aider:
command: aider
args: ["--yes-always"]
default-task: "Refactor this file to use type hints throughout"
scoring:
run-tests: true
lint: true
timeout: 300
Quality Score
| Component | Weight |
|---|---|
| Test pass rate | 40% |
| Lint clean | 20% |
| Code diff sensibility | 15% |
| Task completion | 15% |
| Speed bonus | 10% |
How It Works
- Copies your project to an isolated temp directory per agent
- Runs each agent as a subprocess with the task prompt
- Captures stdout/stderr, exit code, duration
- Parses token usage from output
- Runs tests and linter if configured
- Calculates cost from token usage + model pricing
- Scores quality across multiple dimensions
- Stores results in SQLite for history
- Cleans up temp directories
Supported Agents
Any CLI-based coding agent: Claude Code, Codex CLI, Gemini CLI, Aider, OpenClaw, Hermes, OpenCode, and more. If it runs in a terminal, it works.
License
MIT © Hiren Thakore
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cli_agent_bench-0.2.0.tar.gz.
File metadata
- Download URL: cli_agent_bench-0.2.0.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94b8219f68631329a03051d3cfca95d37b26ef4c37e13794b8afa5b3c9f7defc
|
|
| MD5 |
97a98ecaaf1a93799b52c7aa8e5beca1
|
|
| BLAKE2b-256 |
a2d50923525aea341c3e8a41100c1de0b8212e7f906ab49f5c89a7067a72bc11
|
File details
Details for the file cli_agent_bench-0.2.0-py3-none-any.whl.
File metadata
- Download URL: cli_agent_bench-0.2.0-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.28 {"installer":{"name":"uv","version":"0.9.28","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8c0ef37f8eafdbc9a50adfba4d164be9fde3518bef70f1bd56079784ff650af
|
|
| MD5 |
5d0cb1281304a10f3019d4be9bc2b994
|
|
| BLAKE2b-256 |
8bb81da86f306477b1fc669914ef80fb624f510c8a129337e4f39fc93dc04260
|