Skip to main content

Config-driven eval harness that grades and self-improves headless agent skills against ground-truth corpora

Project description

proofbench

A config-driven eval harness that grades and self-improves headless agent skills against ground-truth corpora. You plant tasks you already have the answers to, run an agent skill across them, and grade what it changed against the known answer with an LLM judge. Then you let the harness rewrite the skill from its own misses and keep the rewrite only if the score holds.

It started as the harness behind crg-debug, a graph-driven debugging skill. The methodology generalizes to any skill whose output you can compare to a known answer.

Why it exists

A skill is only as trustworthy as the proof that it works. Anyone can write a prompt that sounds like a methodology. The honest way to know is to measure it against ground truth, on the model weak enough to embarrass it. Two ideas carry the whole design, both learned the hard way (see METHODOLOGY.md):

  1. Fail loud or do not measure. Every empty or non-numeric grade is a hard stop. An eval that fails open does not just miss data, it manufactures false confidence.
  2. The weak model is the signal, not the noise. A self-improving loop learns only from misses. Frontier models on easy tasks miss nothing, so they teach nothing. The weak leg is the curriculum.

Install

uv tool install proofbench      # or: uvx proofbench

Requires the claude CLI on PATH for the default runner and for the judge.

Quickstart (demo)

git clone https://github.com/CodeBlackwell/proofbench && cd proofbench
bash examples/sample-corpus/build.sh        # builds two toy repos with buggy + fixed branches
uvx --from . proofbench run --demo          # eval an agent over them, graded vs the answers

How it works

One YAML config declares everything domain-specific; the engine is generic. The mode is implied by what the config contains:

  • a corpus + runner + judge gives you a bench (run)
  • adding a subject + synth unlocks the self-improving loop (optimize)
subject: ~/.claude/skills/crg-debug/SKILL.md   # optional; omit for pure-eval mode
runner: claude                                  # default adapter; any executable works
models: [opus, sonnet, haiku]                   # the driver sweep; the weak leg is the signal
judge_model: opus                               # held constant; never let a model grade itself
objective: macro_recall                         # the metric the keep/revert gate reads
judge: prompts/judge.md
synth: prompts/synth.md                          # present => `optimize` is available
corpus:
  - name: primes
    path: examples/sample-corpus/repos/primes
    invoke: "Find and fix the bug in this repository."
    default_branch: buggy
    answer_branch: fixed
proofbench run      --config proofbench.yaml    # eval + scoreboard
proofbench optimize --config proofbench.yaml    # baseline -> synth -> re-run -> keep|revert

The keep/revert gate

optimize runs a baseline on the weak leg, asks an LLM to rewrite the subject from the graded misses, re-runs, and keeps the rewrite only if the objective did not regress. The decision is the harness comparing two numbers, never the model's own claim of success.

Adapters

  • Runner (runner:): claude is built in. Any other value is an executable invoked as <runner> <invoke> <model>, so you can drive aider, codex, or a custom agent.
  • Capture: the git-diff default resets a repo to its default branch and snapshots what the agent changed, excluding dependency and cache trees so they never pollute or balloon the judge prompt.

Bring your own corpus

The bundled corpus is two MIT toy repos for the demo. Point corpus: at your own repos (each with a broken default branch and a fixed solution branch). See CORPUS.md for how to curate tasks that actually discriminate and do not leak their answers to the judge.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proofbench-0.2.1.tar.gz (49.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proofbench-0.2.1-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file proofbench-0.2.1.tar.gz.

File metadata

  • Download URL: proofbench-0.2.1.tar.gz
  • Upload date:
  • Size: 49.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proofbench-0.2.1.tar.gz
Algorithm Hash digest
SHA256 36fd23d36ba37c3aa266bc1100b089370ac8c7a7861afce8db268b1d554e0095
MD5 8fc6f9593b6fad105569cd2e67b4d4e7
BLAKE2b-256 f26c58ea607134cbb781837fddd2874343ce3a65dd0e6f16282a0d7a5dd2e19d

See more details on using hashes here.

Provenance

The following attestation bundles were made for proofbench-0.2.1.tar.gz:

Publisher: release.yml on CodeBlackwell/proofbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file proofbench-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: proofbench-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proofbench-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9229ed748b43bda2838574cdca5a762087079643577bb0aa61d123226ea997f0
MD5 03d64018a54989513b670dc0d47b1a2a
BLAKE2b-256 c89a89728c5c89d63d3d718b72594c05c73596a26cf96a67abab29c03f31ca41

See more details on using hashes here.

Provenance

The following attestation bundles were made for proofbench-0.2.1-py3-none-any.whl:

Publisher: release.yml on CodeBlackwell/proofbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page