Skip to main content

Config-driven eval harness that grades and self-improves headless agent skills against ground-truth corpora

Project description

proofbench

A config-driven eval harness that grades and self-improves headless agent skills against ground-truth corpora. You plant tasks you already have the answers to, run an agent skill across them, and grade what it changed against the known answer with an LLM judge. Then you let the harness rewrite the skill from its own misses and keep the rewrite only if the score holds.

It started as the harness behind crg-debug, a graph-driven debugging skill. The methodology generalizes to any skill whose output you can compare to a known answer.

Why it exists

A skill is only as trustworthy as the proof that it works. Anyone can write a prompt that sounds like a methodology. The honest way to know is to measure it against ground truth, on the model weak enough to embarrass it. Two ideas carry the whole design, both learned the hard way (see METHODOLOGY.md):

  1. Fail loud or do not measure. Every empty or non-numeric grade is a hard stop. An eval that fails open does not just miss data, it manufactures false confidence.
  2. The weak model is the signal, not the noise. A self-improving loop learns only from misses. Frontier models on easy tasks miss nothing, so they teach nothing. The weak leg is the curriculum.

Install

uv tool install proofbench      # or: uvx proofbench

Requires the claude CLI on PATH for the default runner and for the judge.

Quickstart (demo)

git clone https://github.com/CodeBlackwell/proofbench && cd proofbench
bash examples/sample-corpus/build.sh        # builds two toy repos with buggy + fixed branches
uvx --from . proofbench run --demo          # eval an agent over them, graded vs the answers

How it works

One YAML config declares everything domain-specific; the engine is generic. The mode is implied by what the config contains:

  • a corpus + runner + judge gives you a bench (run)
  • adding a subject + synth unlocks the self-improving loop (optimize)
subject: ~/.claude/skills/crg-debug/SKILL.md   # optional; omit for pure-eval mode
runner: claude                                  # default adapter; any executable works
models: [opus, sonnet, haiku]                   # the driver sweep; the weak leg is the signal
judge_model: opus                               # held constant; never let a model grade itself
objective: macro_recall                         # the metric the keep/revert gate reads
judge: prompts/judge.md
synth: prompts/synth.md                          # present => `optimize` is available
corpus:
  - name: primes
    path: examples/sample-corpus/repos/primes
    invoke: "Find and fix the bug in this repository."
    default_branch: buggy
    answer_branch: fixed
proofbench run      --config proofbench.yaml    # eval + scoreboard
proofbench optimize --config proofbench.yaml    # baseline -> synth -> re-run -> keep|revert

The keep/revert gate

optimize runs a baseline on the weak leg, asks an LLM to rewrite the subject from the graded misses, re-runs, and keeps the rewrite only if the objective did not regress. The decision is the harness comparing two numbers, never the model's own claim of success.

Adapters

  • Runner (runner:): claude is built in. Any other value is an executable invoked as <runner> <invoke> <model>, so you can drive aider, codex, or a custom agent.
  • Capture: the git-diff default resets a repo to its broken branch and snapshots what the agent changed, excluding dependency and cache trees so they never pollute or balloon the judge prompt.

Bring your own corpus

The bundled corpus is two MIT toy repos for the demo. Point corpus: at your own repos (each with a broken default branch and a fixed answer branch). See CORPUS.md for how to curate tasks that actually discriminate and do not leak their answers to the judge.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proofbench-0.2.0.tar.gz (47.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proofbench-0.2.0-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file proofbench-0.2.0.tar.gz.

File metadata

  • Download URL: proofbench-0.2.0.tar.gz
  • Upload date:
  • Size: 47.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proofbench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4bd9cd67ec18c87543fb5305c796d3fa185c984c9de7aa7aa420a7161f0952ed
MD5 1c80de82934b3e44903dcc7d54cd0af8
BLAKE2b-256 16caf1e6623f96b3b03d475679967f862b99ad12a482b2221effd7421ae4e27f

See more details on using hashes here.

Provenance

The following attestation bundles were made for proofbench-0.2.0.tar.gz:

Publisher: release.yml on CodeBlackwell/proofbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file proofbench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: proofbench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proofbench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 69413738e972969352fda59fa6eaffd5bcbcf40f7a46c46419b1cba9876dff80
MD5 1352846c5384055744847190125415f9
BLAKE2b-256 0494c4f0104da97be7d3ecb80d49f0c74e154219993702e8ad020b0d3f7f1e7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for proofbench-0.2.0-py3-none-any.whl:

Publisher: release.yml on CodeBlackwell/proofbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page