Skip to main content

Measure how well and how cheaply a coding-agent harness answers questions about your codebase. Drives real agent CLIs (Claude Code, Codex, Cursor), judges answers against ground truth, and reports accuracy + token cost per suite.

Project description

Quetzal

the feathered serpent · asks · judges · reports

Measure how well — and how cheaply — a coding-agent harness answers questions about your codebase.

Quetzal points a real coding-agent CLI (Claude Code, Codex, Cursor, opencode) at a repository, asks it questions you've written, and judges each answer against a ground-truth answer. It reports accuracy, token usage, and cost per suite — so you can see whether your docs make an agent faster and cheaper, compare models/harnesses, or catch when a change makes part of the codebase harder to navigate.

It drives the actual harness — its system prompt, tools, and planning loop — not a raw-API reimplementation, because the harness is the thing worth measuring.

1. RUN     quetzal run      answer questions with an agent harness   → tokens + cost per question
2. SCORE   quetzal score    judge answers vs ground truth            → correct? + 1–5 score
3. REPORT  quetzal report   aggregate per suite + overall            → accuracy %, avg tokens, cost

Install

As a standalone tool on your PATH (recommended) — no venv to manage:

uv tool install .          # or: pipx install .   (from a clone)
quetzal --version

Or for development, editable in a venv:

git clone <your-fork-url> quetzal && cd quetzal
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

Requires Python 3.11+. To answer questions you need at least one agent CLI installed and authenticated — by default Claude Code (claude). Check what Quetzal can see:

quetzal agents      # ✓ / ✗ per harness: claude-code, codex, cursor, opencode

Quick start (against Quetzal's own repo)

The shipped quetzal.toml points at this repository, with one self-referential quetzal suite, so it runs out of the box:

quetzal run --all --agent claude-code     # answer every suite's questions
quetzal score <session-id> --judge claude-code
quetzal report <session-id>

run prints the <session-id> (e.g. quetzal-20260629-101500) when it finishes. Or chain all three with the wrapper:

./run-benchmark.sh --all                  # AGENT=codex ./run-benchmark.sh quetzal

Smoke a single suite without spending much: quetzal run --suite quetzal --limit 2.

Point it at your own codebase

In a hurry? Hand your coding agent the kick-off prompt and it will install Quetzal, run init, write a first suite, and run the benchmark for you.

Run init from inside the repo you want to benchmark — it scaffolds everything:

cd /path/to/your/repo
quetzal init                       # asks which agent harness to wire the keep-docs-fresh hook for
quetzal init --agent codex         # or pick non-interactively (claude-code | codex | cursor | opencode)
quetzal init --git-hook            # also install a harness-agnostic git pre-commit hook
quetzal init --no-hooks            # config only, skip the hook

init scaffolds quetzal.toml, suites/, results/, installs the keep-docs-fresh hook native to your chosen harness (see below), and prints which agent CLIs it found. It's idempotent (existing files are left as-is unless --force). It then leaves you with a quetzal.toml to fill in — map each code area you care about to a suite:

target_repo = "/path/to/your/repo"   # the codebase under test
suites_dir  = "suites"               # one <suite>.json per suite
results_dir = "results"

[suites]
# suite name -> code root(s) relative to target_repo (the agent's starting hint)
auth     = ["services/auth"]
billing  = ["services/billing", "libs/money"]

Then write questions. Use the UI (below) or drop a suites/<name>.json file — a list of:

{
  "id": "auth_token_refresh",
  "service": "auth",
  "question": "How are refresh tokens rotated?",
  "ground_truth": "Derived from the code: on each refresh the old token is revoked and ...",
  "difficulty": "medium",
  "tags": ["tokens"],
  "reviewed": false
}

Ground truth should be derived from the code, not guessed, so the benchmark can detect a doc that's wrong or incomplete. Cases start reviewed: false; flip to true once a human verifies the answer. Every value in quetzal.toml is overridable by env var (QUETZAL_TARGET_REPO, QUETZAL_SUITES_DIR, QUETZAL_RESULTS_DIR, QUETZAL_CONFIG) for CI and ad-hoc runs.

Agents (answerer) and the judge

--agent selects the harness; --model is passed through to it (default: the CLI's own default).

Agent CLI Read-only enforcement Token + cost telemetry
claude-code (default) claude -p --output-format json --allowedTools Read Grep Glob LS full (usage + total_cost_usd)
codex codex exec --json --sandbox read-only sandbox read-only best-effort (parsed from events)
cursor cursor-agent -p --output-format json best-effort
opencode opencode run accuracy + latency only (no token telemetry)

The judge defaults to claude-code too (quetzal score --judge claude-code) — it shells out to claude -p for a structured verdict, so no API keys are required anywhere in the pipeline. Pin the judge model with --judge-model.

Answerers always run read-only (enforced per CLI, never by trusting the model). Quetzal never passes a skip-permissions flag.

Management UI

A local, build-free web console to manage question suites and view score history:

quetzal ui          # → http://127.0.0.1:8765
  • Questions tab — per-suite add / edit / delete, set difficulty and tags, flip reviewed inline. Edits write straight to the JSON suite files.
  • Score history tab — every past run as a card (overall accuracy + avg tokens + model), a per-suite accuracy/token trend chart across runs, and a click-through per-suite breakdown.

Local-only, no auth — don't expose the port publicly.

Keeping module docs fresh

Good module docs are what Quetzal's benchmark rewards — they make a coding-agent harness answer questions about your code faster and cheaper. To keep them from rotting as the code grows, quetzal init installs a keep-docs-fresh hook using each harness's own native mechanism. When the agent finishes a turn it nudges on two signals:

  • Missing docs — a new package manifest (pyproject.toml, package.json, go.mod, …) landed in a directory with no README → write documentation for that module.
  • Bloated docs — a README you're editing has grown past a budget that scales with its module's size (base + per-100-LOC × module_LOC) → condense it: cut redundancy, move deep detail out, keep purpose / API / key files. A 3000-line package earns a long README; a 50-line helper does not.

quetzal init asks how detailed READMEs should be — concise / balanced / thorough — and writes the matching budget into [docs_check]. Either way it only looks at files in the current working set, and you can always say the change isn't warranted.

Harness Native integration Installed to Behavior
claude-code Stop hook .claude/settings.json + .claude/hooks/ blocks the turn; fires once (stop_hook_active guard)
codex Stop hook .codex/hooks.json + .codex/hooks/ blocks (exit 2); run Codex /hooks to trust it first
cursor stop hook .cursor/hooks.json + .cursor/hooks/ auto-submits a follow-up; loop_limit caps re-fire
opencode plugin .opencode/plugin/ notifies on session.idle (plugins can't block a finished session)

For claude-code it also drops a document-module skill (.claude/skills/document-module/) — the documented way to write a module README + docstrings derived from the code. quetzal init --git-hook adds a harness-agnostic git pre-commit warning on top of any of these.

All of them call one command you can also run by hand:

quetzal docs-check                      # claude-code blocking JSON (the default)
quetzal docs-check --format json        # {"nudge": bool, "dirs": [...], "reason": ...}

It's deliberately high-precision, low-noise and only ever inspects files in the working set. Tune it in quetzal.toml under [docs_check]: manifests = [...] sets what counts as a "new module", and readme_base_lines / readme_lines_per_100_loc set the size-relative condense budget (both 0 disables the bloat nudge).

Output

results/<session-id>/:

  • config.json — run metadata (agent, model, judge, suites)
  • <suite>/<case-id>.json — question, answer, token usage, judge verdict
  • report.json — aggregated per-suite + overall stats

How it's organized

Area What it does
quetzal/agents/ AgentClient adapters that shell out to coding-agent CLIs (lazy registry)
quetzal/judge/ Judge prompt + the Claude Code judge that grades against ground truth
quetzal/core/ Run loop + JSON session storage
quetzal/datasets/ JSON-backed question store (shared by runner + UI)
quetzal/ui/ Build-free local web console
quetzal/{cli,score,report,main}.py The pipeline entry points
quetzal/init_cmd.py quetzal init — scaffold config, suites/results dirs, the keep-docs-fresh hook + skill
quetzal/docs_check.py quetzal docs-check — the new-module-without-docs nudge behind the hook
quetzal.toml Target repo, suites dir, results dir, suite → code-roots map

Adding a new answerer = a new AgentClient in quetzal/agents/ plus one line in its registry. Keep answers read-only; read telemetry from the harness's own report.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quetzal_eval-0.1.1.tar.gz (222.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quetzal_eval-0.1.1-py3-none-any.whl (54.4 kB view details)

Uploaded Python 3

File details

Details for the file quetzal_eval-0.1.1.tar.gz.

File metadata

  • Download URL: quetzal_eval-0.1.1.tar.gz
  • Upload date:
  • Size: 222.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quetzal_eval-0.1.1.tar.gz
Algorithm Hash digest
SHA256 01c29c41e2584fe91d95be9a56e54528b282620372fae8ff5792980521531ac6
MD5 a1fa7b6ebfa40a5c508435af4816c8df
BLAKE2b-256 eb7c13087897ddab786a6106ac523ff9aa9f2a4323c5b6393de3f4e0258e2294

See more details on using hashes here.

Provenance

The following attestation bundles were made for quetzal_eval-0.1.1.tar.gz:

Publisher: release.yml on YoavAlro/quetzal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file quetzal_eval-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: quetzal_eval-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 54.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quetzal_eval-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 73d0bb8b06c14b7120a75d48ea125fdd2a386e5841a07f09e018acc70a1e04fb
MD5 f7247b48960229d20152d93eab875d2f
BLAKE2b-256 63cc029cae9c777d41db9da7044ecda5adebd0d09f6fd7281d723d9b074beb74

See more details on using hashes here.

Provenance

The following attestation bundles were made for quetzal_eval-0.1.1-py3-none-any.whl:

Publisher: release.yml on YoavAlro/quetzal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page