Skip to main content

Measure how well and how cheaply a coding-agent harness answers questions about your codebase. Drives real agent CLIs (Claude Code, Codex, Cursor), judges answers against ground truth, and reports accuracy + token cost per suite.

Project description

Quetzal

the feathered serpent · asks · judges · reports

Measure how well — and how cheaply — a coding-agent harness answers questions about your codebase.

Quetzal points a real coding-agent CLI (Claude Code, Codex, Cursor, opencode) at a repository, asks it questions you've written, and judges each answer against a ground-truth answer. It reports accuracy, token usage, and cost per suite — so you can see whether your docs make an agent faster and cheaper, compare models/harnesses, or catch when a change makes part of the codebase harder to navigate.

It drives the actual harness — its system prompt, tools, and planning loop — not a raw-API reimplementation, because the harness is the thing worth measuring.

1. RUN     quetzal run      answer questions with an agent harness   → tokens + cost per question
2. SCORE   quetzal score    judge answers vs ground truth            → correct? + 1–5 score
3. REPORT  quetzal report   aggregate per suite + overall            → accuracy %, avg tokens, cost

Install

As a standalone tool on your PATH (recommended) — no venv to manage:

uv tool install quetzal-eval      # or: pipx install quetzal-eval
quetzal --version

The distribution is quetzal-eval on PyPI; the command it installs is quetzal. Or for development, editable from a clone:

git clone https://github.com/YoavAlro/quetzal && cd quetzal
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

Requires Python 3.11+. To answer questions you need at least one agent CLI installed and authenticated — by default Claude Code (claude). Check what Quetzal can see:

quetzal agents      # ✓ / ✗ per harness: claude-code, codex, cursor, opencode

Quick start (against Quetzal's own repo)

The shipped quetzal.toml points at this repository, with one self-referential quetzal suite, so it runs out of the box:

quetzal run --all --agent claude-code     # answer → judge → report, in one command

By default run answers every question, judges the answers, and prints the report — the whole pipeline. Stop earlier with --no-score (just answer) or --no-report, and pin the judge with --judge / --judge-model. The individual steps are still there if you want them separately:

quetzal run --all --no-score              # just answer (prints the <session-id>)
quetzal score <session-id>                # judge later
quetzal report <session-id>               # re-print the summary

Smoke a single suite without spending much: quetzal run --suite quetzal --limit 2.

Point it at your own codebase

In a hurry? Hand your coding agent the kick-off prompt — it installs Quetzal, runs init, explores your code to generate an eval suite (questions + code-derived ground truth), and runs a first benchmark for you.

Run init from inside the repo you want to benchmark — it scaffolds everything:

cd /path/to/your/repo
quetzal init                       # asks which agent harness to wire the keep-docs-fresh hook for
quetzal init --agent codex         # or pick non-interactively (claude-code | codex | cursor | opencode)
quetzal init --git-hook            # also install a harness-agnostic git pre-commit hook
quetzal init --no-hooks            # config only, skip the hook

init scaffolds quetzal.toml, suites/, .quetzal/results/, installs the keep-docs-fresh hook native to your chosen harness (see below), and prints which agent CLIs it found. It's idempotent (existing files are left as-is unless --force). It then leaves you with a quetzal.toml to fill in — map each code area you care about to a suite:

target_repo = "/path/to/your/repo"   # the codebase under test
suites_dir  = "suites"               # one <suite>.json per suite (curated, committed)
results_dir = ".quetzal/results"     # benchmark sessions (generated, git-ignored)

[suites]
# suite name -> code root(s) relative to target_repo (the agent's starting hint)
auth     = ["services/auth"]
billing  = ["services/billing", "libs/money"]

Then write questions. Use the UI (below) or drop a suites/<name>.json file — a list of:

{
  "id": "auth_token_refresh",
  "service": "auth",
  "question": "How are refresh tokens rotated?",
  "ground_truth": "Derived from the code: on each refresh the old token is revoked and ...",
  "difficulty": "medium",
  "tags": ["tokens"]
}

Ground truth should be derived from the code, not guessed, so the benchmark can detect a doc that's wrong or incomplete. Every value in quetzal.toml is overridable by env var (QUETZAL_TARGET_REPO, QUETZAL_SUITES_DIR, QUETZAL_RESULTS_DIR, QUETZAL_CONFIG) for CI and ad-hoc runs.

Agents (answerer) and the judge

--agent selects the harness; --model is passed through to it (default: the CLI's own default).

Agent CLI Read-only enforcement Token + cost telemetry
claude-code (default) claude -p --output-format json --allowedTools Read Grep Glob LS full (usage + total_cost_usd)
codex codex exec --json --sandbox read-only sandbox read-only best-effort (parsed from events)
cursor cursor-agent -p --output-format json best-effort
opencode opencode run accuracy + latency only (no token telemetry)

The judge defaults to claude-code too (quetzal score --judge claude-code) — it shells out to claude -p for a structured verdict, so no API keys are required anywhere in the pipeline. Pin the judge model with --judge-model.

Answerers always run read-only (enforced per CLI, never by trusting the model). Quetzal never passes a skip-permissions flag.

Management UI

A local, build-free web console to manage question suites and view score history:

quetzal ui          # → http://127.0.0.1:8765
  • Questions tab — per-suite add / edit / delete, set difficulty and tags; each suite shows its latest benchmark score in the sidebar. Edits write straight to the JSON suite files.
  • Score history tab — every past run as a card and in an all-runs table (accuracy, tokens, cost, agent, judge), a per-suite accuracy/token trend chart, and a click-through breakdown.

Local-only, no auth — don't expose the port publicly. Styled in the Quetzal brand theme (navy + teal→green); the palette, type, and component tokens are documented in docs/design.md.

Keeping module docs fresh

Good module docs are what Quetzal's benchmark rewards — they make a coding-agent harness answer questions about your code faster and cheaper. To keep them from rotting as the code grows, quetzal init installs a keep-docs-fresh hook using each harness's own native mechanism. When the agent finishes a turn it nudges on two signals:

  • Missing docs — a new package manifest (pyproject.toml, package.json, go.mod, …) landed in a directory with no README → write documentation for that module.
  • Bloated docs — a README you're editing has grown past a budget that scales with its module's size (base + per-100-LOC × module_LOC) → condense it: cut redundancy, move deep detail out, keep purpose / API / key files. A 3000-line package earns a long README; a 50-line helper does not.

quetzal init asks how detailed READMEs should be — concise / balanced / thorough — and writes the matching budget into [docs_check]. Either way it only looks at files in the current working set, and you can always say the change isn't warranted.

Harness Native integration Installed to Behavior
claude-code Stop hook .claude/settings.json + .claude/hooks/ blocks the turn; fires once (stop_hook_active guard)
codex Stop hook .codex/hooks.json + .codex/hooks/ blocks (exit 2); run Codex /hooks to trust it first
cursor stop hook .cursor/hooks.json + .cursor/hooks/ auto-submits a follow-up; loop_limit caps re-fire
opencode plugin .opencode/plugin/ notifies on session.idle (plugins can't block a finished session)

For claude-code it also drops a document-module skill (.claude/skills/document-module/) — the documented way to write a module README + docstrings derived from the code. quetzal init --git-hook adds a harness-agnostic git pre-commit warning on top of any of these.

All of them call one command you can also run by hand:

quetzal docs-check                      # claude-code blocking JSON (the default)
quetzal docs-check --format json        # {"nudge": bool, "dirs": [...], "reason": ...}

It's deliberately high-precision, low-noise and only ever inspects files in the working set. Tune it in quetzal.toml under [docs_check]: manifests = [...] sets what counts as a "new module", and readme_base_lines / readme_lines_per_100_loc set the size-relative condense budget (both 0 disables the bloat nudge).

Output

.quetzal/results/<session-id>/:

  • config.json — run metadata (agent, model, judge, suites)
  • <suite>/<case-id>.json — question, answer, token usage, judge verdict
  • report.json — aggregated per-suite + overall stats

How it's organized

Area What it does
quetzal/agents/ AgentClient adapters that shell out to coding-agent CLIs (lazy registry)
quetzal/judge/ Judge prompt + the Claude Code judge that grades against ground truth
quetzal/core/ Run loop + JSON session storage
quetzal/datasets/ JSON-backed question store (shared by runner + UI)
quetzal/ui/ Build-free local web console
quetzal/{cli,score,report,main}.py The pipeline entry points
quetzal/init_cmd.py quetzal init — scaffold config, suites/results dirs, the keep-docs-fresh hook + skill
quetzal/docs_check.py quetzal docs-check — the new-module-without-docs nudge behind the hook
quetzal.toml Target repo, suites dir, results dir, suite → code-roots map

Adding a new answerer = a new AgentClient in quetzal/agents/ plus one line in its registry. Keep answers read-only; read telemetry from the harness's own report.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quetzal_eval-0.2.3.tar.gz (266.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quetzal_eval-0.2.3-py3-none-any.whl (93.6 kB view details)

Uploaded Python 3

File details

Details for the file quetzal_eval-0.2.3.tar.gz.

File metadata

  • Download URL: quetzal_eval-0.2.3.tar.gz
  • Upload date:
  • Size: 266.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quetzal_eval-0.2.3.tar.gz
Algorithm Hash digest
SHA256 297dd84eadb34696df9ddad05919e6d1062a1111a286f8f2a8e67cdcf3d2ba71
MD5 811c02b2dd5343ed17a422066e20db9b
BLAKE2b-256 ec488e6c70ac68d2d779996169d4d67064623cabe452ec769e083dd34c8e50a1

See more details on using hashes here.

File details

Details for the file quetzal_eval-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: quetzal_eval-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 93.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quetzal_eval-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 de74bba23b923a158b71864cd99b7f2aee22268e8d826cbe6be50877494badce
MD5 a221912982c2208b4f025adf5b5e073c
BLAKE2b-256 d9399eb97e303de5ed293e6fcbd21d6f85b3a0b79c7e81239a30337e3d24cbd6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page