quetzal-eval
Last released
Measure how well and how cheaply a coding-agent harness answers questions about your codebase. Drives real agent CLIs (Claude Code, Codex, Cursor), judges answers against ground truth, and reports accuracy + token cost per suite.