A CLI arena for AI coding agents. Throw one bug at Claude Code, Codex, aider — let them race, auto-score, and pick the winner.

These details have not been verified by PyPI

Project links

Project description

CodeJoust

Pit AI coding agents against the same bug. Score them on tests, diff size, cost, and time. Merge the winner.

中文 → README_CN.md

Same model, different harness. Independent testing found Claude Sonnet scored 77% through Claude Code but 93% through Cursor on the same benchmark — a 15-point gap that is pure tooling, not model quality. Which "AI coding assistant" is right for your task is not a model question, it's a task-level empirical question.

CodeJoust answers it. One CLI command fires the same task at Claude Code, aider, (soon) Codex, Cursor CLI, Gemini CLI in parallel — each in its own git worktree — then auto-grades them and hands you the winning patch.

Why not just open three terminals?

That's what most people do. It's also why most people never actually benchmark their tools — running three agents, waiting, eyeballing three diffs, manually tallying tokens is work, so they pick one and stick with it.

CodeJoust takes that hour down to one command:

codejoust run "fix the off-by-one in Scheduler.next_fire" \
  --agents claude-code,aider --test "pytest tests/test_scheduler.py"

You get:

A side-by-side terminal table ranked by test pass-rate → cost → diff size → latency
A single-file HTML report with each agent's full diff
One .patch file per agent — apply the winner with git apply

Install

pip install codejoust

You'll also need whichever agent CLIs you want to race. Install as many or as few as you like:

# Claude Code
npm install -g @anthropic-ai/claude-code

# aider
pip install aider-chat

Set the usual API keys in your environment (ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.) — CodeJoust just shells out to each CLI, so whatever auth setup you already use keeps working.

Quickstart

cd ~/code/my-project

codejoust run "add a --dry-run flag to the deploy command"

You'll see:

─────────── CodeJoust — 2 agents ────────────
task:   add a --dry-run flag to the deploy command
repo:   /Users/you/code/my-project
agents: claude-code, aider

                  CodeJoust — add a --dry-run flag...
  #  agent         status    diff        tests    cost      time
★ 1  claude-code   success   +38/-2      8/8      $0.028    71.3s
  2  aider         success   +21/-1      7/8      $0.019    55.7s

winner: claude-code
  — git log and merge via: cat .codejoust/runs/.../claude-code.patch | git apply

report: /Users/you/code/my-project/.codejoust/runs/20260424-222310/report.html

All artifacts live in .codejoust/runs/<timestamp>/:

file	what
`report.html`	single-file HTML, side-by-side diffs, shareable
`session.json`	structured run data (for scripting and CI)
`<agent>.patch`	the agent's changes, apply with `git apply`
`logs/<agent>/*.log`	raw stdout/stderr from each CLI

How it scores

Four signals, in order:

Test pass ratio (tests_passed / tests_total). Auto-detects pytest or npm test; override with --test.
Cost (USD). Lower is better. Pulled from each CLI's own usage output.
Diff size (added + removed lines). Smaller is better — more conservative changes are usually safer.
Wall time. Tie-breaker.

The first signal where one agent strictly beats another decides the winner. This is intentional: if Claude Code passes 8/8 tests and aider passes 7/8, we don't care that aider was cheaper — test correctness dominates.

LLM-as-judge scoring (subjective code quality) is planned for v0.2 behind --judge; for now the scoring is fully objective so it's reproducible and cheap.

Agents

codejoust agents
#   claude-code    cli: claude
#   aider          cli: aider

MVP ships with Claude Code and aider because their CLIs have the most stable headless modes and per-run usage reporting. The following are on the Phase 2 roadmap (in order):

OpenAI Codex CLI (codex exec)
Gemini CLI (gemini -p)
Cursor CLI (cursor-agent) — flagged experimental until the CLI stabilises
OpenHands (openhands --headless)

Each adapter is ~50 lines. PRs welcome.

CLI reference

codejoust run TASK [OPTIONS]

  -a, --agents TEXT      comma-separated list. default: claude-code,aider
  -r, --repo DIR         repo root. default: current directory
      --timeout INT      per-agent timeout in seconds. default: 600
      --test CMD         test command. default: auto-detect pytest / npm test
      --model NAME       optional model override passed to every agent
      --keep-worktrees   don't clean up worktrees afterwards
      --html / --no-html write report.html. default: on
      --open             open the report in your browser when done

vs. other tools

Project	What it does	What it doesn't
CodeJoust (this)	CLI, parallel agents, auto-score (tests+cost+diff), HTML report	LLM-as-judge is Phase 2
Claude Squad (7k★)	tmux + worktree session manager	no scoring, no diff comparison — manual
parallel-code (544★)	parallel runs + diff viewer	no auto-score, no test integration
Cursor 3 `/best-of-n`	best-of-N inside Cursor	closed source, IDE-only, paid, no external CLIs
GitHub Agent HQ	multi-agent on GitHub Cloud	closed source, paid, cloud only
Terminal-Bench / SWE-bench	fixed benchmark evaluation	can't throw your own issue at it
CodeClash	LLM tournaments on fixed arenas (BattleSnake, Poker)	not for your codebase

The empty cell CodeJoust fills: open-source, CLI-first, real agents (not just models), auto-scored, on your own repo.

FAQ

Does it need internet / a credit card? Only what the underlying agent CLIs need. CodeJoust itself is offline.

What about cost blowup? Each agent runs once per task. Default --timeout 600 caps wall time. Tokens and USD are reported per run; you'll see exactly what each task cost.

Can I run more than two agents at once? Yes: --agents claude-code,aider,codex. Each runs in its own worktree, fully isolated. Watch your API rate limits.

Does it work on Windows? Tested on macOS and Linux. Windows should work inside WSL; native Windows has not been tested and is unlikely to work cleanly because the agent CLIs themselves are Unix-first.

What if I don't have tests? CodeJoust still ranks by cost, diff size, and wall time. But the really useful ranking signal is test pass-rate — if you can write a single failing test for your bug first, you'll get much better picks.

How do I stop the project from churning through my API quota? Start with --timeout 120 for small tasks. Each agent is independently rate-limited by its own API key. CodeJoust makes no network calls itself.

Roadmap

v0.1 (now): Claude Code + aider, objective scoring, HTML report.
v0.2: Codex CLI + Gemini CLI adapters, --judge for LLM-as-judge scoring on tied runs, YAML config for reusable agent profiles.
v0.3: Cursor CLI + OpenHands adapters, batch mode (run a list of issues, aggregate winners), Markdown export for PR descriptions.
later: server mode for team/CI use, public arena leaderboard.

Kill criteria: if claude-squad or parallel-code ship built-in auto-scoring, CodeJoust repositions as the lightweight standalone scorer and deprecates its orchestration layer.

Contributing

Adapters are the main contribution surface. See src/codejoust/adapters.py — each adapter is a subclass of AgentAdapter with build_command() and parse_usage(). Open a PR with your agent of choice.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Apr 25, 2026

This version

0.1.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codejoust-0.1.0.tar.gz (21.1 kB view details)

Uploaded Apr 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

codejoust-0.1.0-py3-none-any.whl (18.5 kB view details)

Uploaded Apr 24, 2026 Python 3

File details

Details for the file codejoust-0.1.0.tar.gz.

File metadata

Download URL: codejoust-0.1.0.tar.gz
Upload date: Apr 24, 2026
Size: 21.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for codejoust-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7a948b19447acbb17e9a72b6a1d6637f65ec4293c46e69c2e022d681fcf8562e`
MD5	`6d90caa1b0fb88a081b07415b9d26e34`
BLAKE2b-256	`1e8fbe4290354fb745c23b836b0feac43dd2316a8b1af2344ec4f09a753e259d`

See more details on using hashes here.

File details

Details for the file codejoust-0.1.0-py3-none-any.whl.

File metadata

Download URL: codejoust-0.1.0-py3-none-any.whl
Upload date: Apr 24, 2026
Size: 18.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for codejoust-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5358e1cad4710de251de5a6bac12c279a776dbcda60996d7561542f7249ba8a3`
MD5	`2edd44854971762dcae204bbc6be84a9`
BLAKE2b-256	`08d817ac57a551dc3db8fb850543eb17c2090c5a9b74c0243d29e2168deef05e`

See more details on using hashes here.

codejoust 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CodeJoust

Why not just open three terminals?

Install

Quickstart

How it scores

Agents

CLI reference

vs. other tools

FAQ

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes