Config-driven eval harness that grades and self-improves headless agent skills against ground-truth corpora
Project description
proofbench
A config-driven eval harness that grades and self-improves headless agent skills against ground-truth corpora. You plant tasks you already have the answers to, run an agent skill across them, and grade what it changed against the known answer with an LLM judge. Then you let the harness rewrite the skill from its own misses and keep the rewrite only if the score holds.
It started as the harness behind crg-debug, a graph-driven debugging skill. The methodology generalizes to any skill whose output you can compare to a known answer.
Why it exists
A skill is only as trustworthy as the proof that it works. Anyone can write a prompt that sounds like a methodology. The honest way to know is to measure it against ground truth, on the model weak enough to embarrass it. Two ideas carry the whole design, both learned the hard way (see METHODOLOGY.md):
- Fail loud or do not measure. Every empty or non-numeric grade is a hard stop. An eval that fails open does not just miss data, it manufactures false confidence.
- The weak model is the signal, not the noise. A self-improving loop learns only from misses. Frontier models on easy tasks miss nothing, so they teach nothing. The weak leg is the curriculum.
Install
uv tool install proofbench # or: uvx proofbench
Requires the claude CLI on PATH for the default runner and for the judge.
Quickstart (demo)
git clone https://github.com/CodeBlackwell/proofbench && cd proofbench
bash examples/sample-corpus/build.sh # builds two toy repos with buggy + fixed branches
uvx --from . proofbench run --demo # eval an agent over them, graded vs the answers
How it works
One YAML config declares everything domain-specific; the engine is generic. The mode is implied by what the config contains:
- a corpus + runner + judge gives you a bench (
run) - adding a
subject+synthunlocks the self-improving loop (optimize)
subject: ~/.claude/skills/crg-debug/SKILL.md # optional; omit for pure-eval mode
runner: claude # default adapter; any executable works
models: [opus, sonnet, haiku] # the driver sweep; the weak leg is the signal
judge_model: opus # held constant; never let a model grade itself
objective: macro_recall # the metric the keep/revert gate reads
judge: prompts/judge.md
synth: prompts/synth.md # present => `optimize` is available
corpus:
- name: primes
path: examples/sample-corpus/repos/primes
invoke: "Find and fix the bug in this repository."
default_branch: buggy
answer_branch: fixed
proofbench run --config proofbench.yaml # eval + scoreboard
proofbench optimize --config proofbench.yaml # baseline -> synth -> re-run -> keep|revert
The keep/revert gate
optimize runs a baseline on the weak leg, asks an LLM to rewrite the subject from the graded
misses, re-runs, and keeps the rewrite only if the objective did not regress. The decision is
the harness comparing two numbers, never the model's own claim of success.
Adapters
- Runner (
runner:):claudeis built in. Any other value is an executable invoked as<runner> <invoke> <model>, so you can drive aider, codex, or a custom agent. - Capture: the git-diff default resets a repo to its default branch and snapshots what the agent changed, excluding dependency and cache trees so they never pollute or balloon the judge prompt.
Bring your own corpus
The bundled corpus is two MIT toy repos for the demo. Point corpus: at your own repos
(each with a broken default branch and a fixed solution branch). See CORPUS.md for
how to curate tasks that actually discriminate and do not leak their answers to the judge.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proofbench-0.2.1.tar.gz.
File metadata
- Download URL: proofbench-0.2.1.tar.gz
- Upload date:
- Size: 49.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36fd23d36ba37c3aa266bc1100b089370ac8c7a7861afce8db268b1d554e0095
|
|
| MD5 |
8fc6f9593b6fad105569cd2e67b4d4e7
|
|
| BLAKE2b-256 |
f26c58ea607134cbb781837fddd2874343ce3a65dd0e6f16282a0d7a5dd2e19d
|
Provenance
The following attestation bundles were made for proofbench-0.2.1.tar.gz:
Publisher:
release.yml on CodeBlackwell/proofbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proofbench-0.2.1.tar.gz -
Subject digest:
36fd23d36ba37c3aa266bc1100b089370ac8c7a7861afce8db268b1d554e0095 - Sigstore transparency entry: 2011828385
- Sigstore integration time:
-
Permalink:
CodeBlackwell/proofbench@32ba74b82f59cfa792d91724d2f9e3efa781c6cb -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/CodeBlackwell
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@32ba74b82f59cfa792d91724d2f9e3efa781c6cb -
Trigger Event:
push
-
Statement type:
File details
Details for the file proofbench-0.2.1-py3-none-any.whl.
File metadata
- Download URL: proofbench-0.2.1-py3-none-any.whl
- Upload date:
- Size: 12.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9229ed748b43bda2838574cdca5a762087079643577bb0aa61d123226ea997f0
|
|
| MD5 |
03d64018a54989513b670dc0d47b1a2a
|
|
| BLAKE2b-256 |
c89a89728c5c89d63d3d718b72594c05c73596a26cf96a67abab29c03f31ca41
|
Provenance
The following attestation bundles were made for proofbench-0.2.1-py3-none-any.whl:
Publisher:
release.yml on CodeBlackwell/proofbench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
proofbench-0.2.1-py3-none-any.whl -
Subject digest:
9229ed748b43bda2838574cdca5a762087079643577bb0aa61d123226ea997f0 - Sigstore transparency entry: 2011828449
- Sigstore integration time:
-
Permalink:
CodeBlackwell/proofbench@32ba74b82f59cfa792d91724d2f9e3efa781c6cb -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/CodeBlackwell
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@32ba74b82f59cfa792d91724d2f9e3efa781c6cb -
Trigger Event:
push
-
Statement type: