Skip to main content

Measure how well and how cheaply a coding-agent harness answers questions about your codebase. Drives real agent CLIs (Claude Code, Codex, Cursor), judges answers against ground truth, and reports accuracy + token cost per suite.

Project description

Quetzal

the feathered serpent · asks · judges · reports

Measure how well — and how cheaply — a coding-agent harness answers questions about your codebase.

Quetzal points a real coding-agent CLI (Claude Code, Codex, Cursor, opencode) at a repository, asks it questions you've written, and judges each answer against a ground-truth answer. It reports accuracy, token usage, and cost per suite — so you can see whether your docs make an agent faster and cheaper, compare models/harnesses, or catch when a change makes part of the codebase harder to navigate.

It drives the actual harness — its system prompt, tools, and planning loop — not a raw-API reimplementation, because the harness is the thing worth measuring.

1. RUN     quetzal run      answer questions with an agent harness   → tokens + cost per question
2. SCORE   quetzal score    judge answers vs ground truth            → correct? + 1–5 score
3. REPORT  quetzal report   aggregate per suite + overall            → accuracy %, avg tokens, cost

Install

As a standalone tool on your PATH (recommended) — no venv to manage:

uv tool install quetzal-eval      # or: pipx install quetzal-eval
quetzal --version

The distribution is quetzal-eval on PyPI; the command it installs is quetzal. Or for development, editable from a clone:

git clone https://github.com/YoavAlro/quetzal && cd quetzal
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

Requires Python 3.11+. To answer questions you need at least one agent CLI installed and authenticated — by default Claude Code (claude). Check what Quetzal can see:

quetzal agents      # ✓ / ✗ per harness: claude-code, codex, cursor, opencode

Quick start (against Quetzal's own repo)

The shipped quetzal.toml points at this repository, with one self-referential quetzal suite, so it runs out of the box:

quetzal run --all --agent claude-code     # answer → judge → report, in one command

By default run answers every question, judges the answers, and prints the report — the whole pipeline. Stop earlier with --no-score (just answer) or --no-report, and pin the judge with --judge / --judge-model. The individual steps are still there if you want them separately:

quetzal run --all --no-score              # just answer (prints the <session-id>)
quetzal score <session-id>                # judge later
quetzal report <session-id>               # re-print the summary

Smoke a single suite without spending much: quetzal run --suite quetzal --limit 2.

Point it at your own codebase

In a hurry? Hand your coding agent the kick-off prompt — it installs Quetzal, runs init, explores your code to generate an eval suite (questions + code-derived ground truth), and runs a first benchmark for you.

Run init from inside the repo you want to benchmark — it scaffolds everything:

cd /path/to/your/repo
quetzal init                       # asks which agent harness to wire the keep-docs-fresh hook for
quetzal init --agent codex         # or pick non-interactively (claude-code | codex | cursor | opencode)
quetzal init --git-hook            # also install a harness-agnostic git pre-commit hook
quetzal init --no-hooks            # config only, skip the hook

init scaffolds quetzal.toml, suites/, .quetzal/results/, installs the keep-docs-fresh hook native to your chosen harness (see below), and prints which agent CLIs it found. It's idempotent (existing files are left as-is unless --force). It then leaves you with a quetzal.toml to fill in — map each code area you care about to a suite:

target_repo = "/path/to/your/repo"   # the codebase under test
suites_dir  = "suites"               # one <suite>.json per suite (curated, committed)
results_dir = ".quetzal/results"     # benchmark sessions (generated, git-ignored)

[suites]
# suite name -> code root(s) relative to target_repo (the agent's starting hint)
auth     = ["services/auth"]
billing  = ["services/billing", "libs/money"]

Then write questions. Use the UI (below) or drop a suites/<name>.json file — a list of:

{
  "id": "auth_token_refresh",
  "service": "auth",
  "question": "How are refresh tokens rotated?",
  "ground_truth": "Derived from the code: on each refresh the old token is revoked and ...",
  "difficulty": "medium",
  "tags": ["tokens"],
  "reviewed": false
}

Ground truth should be derived from the code, not guessed, so the benchmark can detect a doc that's wrong or incomplete. Cases start reviewed: false; flip to true once a human verifies the answer. Every value in quetzal.toml is overridable by env var (QUETZAL_TARGET_REPO, QUETZAL_SUITES_DIR, QUETZAL_RESULTS_DIR, QUETZAL_CONFIG) for CI and ad-hoc runs.

Agents (answerer) and the judge

--agent selects the harness; --model is passed through to it (default: the CLI's own default).

Agent CLI Read-only enforcement Token + cost telemetry
claude-code (default) claude -p --output-format json --allowedTools Read Grep Glob LS full (usage + total_cost_usd)
codex codex exec --json --sandbox read-only sandbox read-only best-effort (parsed from events)
cursor cursor-agent -p --output-format json best-effort
opencode opencode run accuracy + latency only (no token telemetry)

The judge defaults to claude-code too (quetzal score --judge claude-code) — it shells out to claude -p for a structured verdict, so no API keys are required anywhere in the pipeline. Pin the judge model with --judge-model.

Answerers always run read-only (enforced per CLI, never by trusting the model). Quetzal never passes a skip-permissions flag.

Management UI

A local, build-free web console to manage question suites and view score history:

quetzal ui          # → http://127.0.0.1:8765
  • Questions tab — per-suite add / edit / delete, set difficulty and tags, flip reviewed inline. Edits write straight to the JSON suite files.
  • Score history tab — every past run as a card (overall accuracy + avg tokens + model), a per-suite accuracy/token trend chart across runs, and a click-through per-suite breakdown.

Local-only, no auth — don't expose the port publicly. Styled in the Quetzal brand theme (navy + teal→green); the palette, type, and component tokens are documented in docs/design.md.

Keeping module docs fresh

Good module docs are what Quetzal's benchmark rewards — they make a coding-agent harness answer questions about your code faster and cheaper. To keep them from rotting as the code grows, quetzal init installs a keep-docs-fresh hook using each harness's own native mechanism. When the agent finishes a turn it nudges on two signals:

  • Missing docs — a new package manifest (pyproject.toml, package.json, go.mod, …) landed in a directory with no README → write documentation for that module.
  • Bloated docs — a README you're editing has grown past a budget that scales with its module's size (base + per-100-LOC × module_LOC) → condense it: cut redundancy, move deep detail out, keep purpose / API / key files. A 3000-line package earns a long README; a 50-line helper does not.

quetzal init asks how detailed READMEs should be — concise / balanced / thorough — and writes the matching budget into [docs_check]. Either way it only looks at files in the current working set, and you can always say the change isn't warranted.

Harness Native integration Installed to Behavior
claude-code Stop hook .claude/settings.json + .claude/hooks/ blocks the turn; fires once (stop_hook_active guard)
codex Stop hook .codex/hooks.json + .codex/hooks/ blocks (exit 2); run Codex /hooks to trust it first
cursor stop hook .cursor/hooks.json + .cursor/hooks/ auto-submits a follow-up; loop_limit caps re-fire
opencode plugin .opencode/plugin/ notifies on session.idle (plugins can't block a finished session)

For claude-code it also drops a document-module skill (.claude/skills/document-module/) — the documented way to write a module README + docstrings derived from the code. quetzal init --git-hook adds a harness-agnostic git pre-commit warning on top of any of these.

All of them call one command you can also run by hand:

quetzal docs-check                      # claude-code blocking JSON (the default)
quetzal docs-check --format json        # {"nudge": bool, "dirs": [...], "reason": ...}

It's deliberately high-precision, low-noise and only ever inspects files in the working set. Tune it in quetzal.toml under [docs_check]: manifests = [...] sets what counts as a "new module", and readme_base_lines / readme_lines_per_100_loc set the size-relative condense budget (both 0 disables the bloat nudge).

Output

.quetzal/results/<session-id>/:

  • config.json — run metadata (agent, model, judge, suites)
  • <suite>/<case-id>.json — question, answer, token usage, judge verdict
  • report.json — aggregated per-suite + overall stats

How it's organized

Area What it does
quetzal/agents/ AgentClient adapters that shell out to coding-agent CLIs (lazy registry)
quetzal/judge/ Judge prompt + the Claude Code judge that grades against ground truth
quetzal/core/ Run loop + JSON session storage
quetzal/datasets/ JSON-backed question store (shared by runner + UI)
quetzal/ui/ Build-free local web console
quetzal/{cli,score,report,main}.py The pipeline entry points
quetzal/init_cmd.py quetzal init — scaffold config, suites/results dirs, the keep-docs-fresh hook + skill
quetzal/docs_check.py quetzal docs-check — the new-module-without-docs nudge behind the hook
quetzal.toml Target repo, suites dir, results dir, suite → code-roots map

Adding a new answerer = a new AgentClient in quetzal/agents/ plus one line in its registry. Keep answers read-only; read telemetry from the harness's own report.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quetzal_eval-0.2.2.tar.gz (266.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quetzal_eval-0.2.2-py3-none-any.whl (94.0 kB view details)

Uploaded Python 3

File details

Details for the file quetzal_eval-0.2.2.tar.gz.

File metadata

  • Download URL: quetzal_eval-0.2.2.tar.gz
  • Upload date:
  • Size: 266.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quetzal_eval-0.2.2.tar.gz
Algorithm Hash digest
SHA256 9296b43a2d5161291e743c3e20e1af0d096b61733e934926865c8ef6bbdc54c7
MD5 c1e8ea3829cfd6a48a8de36af2167370
BLAKE2b-256 a9e3a7c69262cf2c57ad4647ffdb919ce0d0d2aa2122ba4bf1461737da5b763c

See more details on using hashes here.

File details

Details for the file quetzal_eval-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: quetzal_eval-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 94.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quetzal_eval-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 121eb17e98cf0a152210872317b15500e82c933401ce4d6fa5086e8bb6fc22d3
MD5 55dcdfaf281735c08f794e5a207abcbb
BLAKE2b-256 a0e676f70ce8c4f6fcc98b3238fce75d98039880e8010dd390cab50cd9c492ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page