Config-driven eval harness that grades and self-improves headless agent skills against ground-truth corpora

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

CodeBlackwell

These details have not been verified by PyPI

Project description

proofbench

A config-driven eval harness that grades and self-improves headless agent skills against ground-truth corpora. You plant tasks you already have the answers to, run an agent skill across them, and grade what it changed against the known answer with an LLM judge. Then you let the harness rewrite the skill from its own misses and keep the rewrite only if the score holds.

It started as the harness behind crg-debug, a graph-driven debugging skill. The methodology generalizes to any skill whose output you can compare to a known answer.

Why it exists

A skill is only as trustworthy as the proof that it works. Anyone can write a prompt that sounds like a methodology. The honest way to know is to measure it against ground truth, on the model weak enough to embarrass it. Two ideas carry the whole design, both learned the hard way (see METHODOLOGY.md):

Fail loud or do not measure. Every empty or non-numeric grade is a hard stop. An eval that fails open does not just miss data, it manufactures false confidence.
The weak model is the signal, not the noise. A self-improving loop learns only from misses. Frontier models on easy tasks miss nothing, so they teach nothing. The weak leg is the curriculum.

Install

uv tool install proofbench      # or: uvx proofbench

Requires the claude CLI on PATH for the default runner and for the judge.

Quickstart (demo)

git clone https://github.com/CodeBlackwell/proofbench && cd proofbench
bash examples/sample-corpus/build.sh        # builds two toy repos with buggy + fixed branches
uvx --from . proofbench run --demo          # eval an agent over them, graded vs the answers

How it works

One YAML config declares everything domain-specific; the engine is generic. The mode is implied by what the config contains:

a corpus + runner + judge gives you a bench (run)
adding a subject + synth unlocks the self-improving loop (optimize)

subject: ~/.claude/skills/crg-debug/SKILL.md   # optional; omit for pure-eval mode
runner: claude                                  # default adapter; any executable works
models: [opus, sonnet, haiku]                   # the driver sweep; the weak leg is the signal
judge_model: opus                               # held constant; never let a model grade itself
objective: macro_recall                         # the metric the keep/revert gate reads
judge: prompts/judge.md
synth: prompts/synth.md                          # present => `optimize` is available
corpus:
  - name: primes
    path: examples/sample-corpus/repos/primes
    invoke: "Find and fix the bug in this repository."
    default_branch: buggy
    answer_branch: fixed

proofbench run      --config proofbench.yaml    # eval + scoreboard
proofbench optimize --config proofbench.yaml    # baseline -> synth -> re-run -> keep|revert

The keep/revert gate

optimize runs a baseline on the weak leg, asks an LLM to rewrite the subject from the graded misses, re-runs, and keeps the rewrite only if the objective did not regress. The decision is the harness comparing two numbers, never the model's own claim of success.

Adapters

Runner (runner:): claude is built in. Any other value is an executable invoked as <runner> <invoke> <model>, so you can drive aider, codex, or a custom agent.
Capture: the git-diff default resets a repo to its broken branch and snapshots what the agent changed, excluding dependency and cache trees so they never pollute or balloon the judge prompt.

Bring your own corpus

The bundled corpus is two MIT toy repos for the demo. Point corpus: at your own repos (each with a broken default branch and a fixed answer branch). See CORPUS.md for how to curate tasks that actually discriminate and do not leak their answers to the judge.

License

MIT.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

CodeBlackwell

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

Jun 29, 2026

This version

0.2.0

Jun 29, 2026

0.1.0

Jun 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proofbench-0.2.0.tar.gz (47.0 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

proofbench-0.2.0-py3-none-any.whl (12.8 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file proofbench-0.2.0.tar.gz.

File metadata

Download URL: proofbench-0.2.0.tar.gz
Upload date: Jun 29, 2026
Size: 47.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proofbench-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`4bd9cd67ec18c87543fb5305c796d3fa185c984c9de7aa7aa420a7161f0952ed`
MD5	`1c80de82934b3e44903dcc7d54cd0af8`
BLAKE2b-256	`16caf1e6623f96b3b03d475679967f862b99ad12a482b2221effd7421ae4e27f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for proofbench-0.2.0.tar.gz:

Publisher: release.yml on CodeBlackwell/proofbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: proofbench-0.2.0.tar.gz
- Subject digest: 4bd9cd67ec18c87543fb5305c796d3fa185c984c9de7aa7aa420a7161f0952ed
- Sigstore transparency entry: 2005255724
- Sigstore integration time: Jun 29, 2026
Source repository:
- Permalink: CodeBlackwell/proofbench@7f2d89e28b76c0d99678b96e3a9cad4ee077e2e0
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/CodeBlackwell
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7f2d89e28b76c0d99678b96e3a9cad4ee077e2e0
- Trigger Event: push

File details

Details for the file proofbench-0.2.0-py3-none-any.whl.

File metadata

Download URL: proofbench-0.2.0-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 12.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for proofbench-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`69413738e972969352fda59fa6eaffd5bcbcf40f7a46c46419b1cba9876dff80`
MD5	`1352846c5384055744847190125415f9`
BLAKE2b-256	`0494c4f0104da97be7d3ecb80d49f0c74e154219993702e8ad020b0d3f7f1e7e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for proofbench-0.2.0-py3-none-any.whl:

Publisher: release.yml on CodeBlackwell/proofbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: proofbench-0.2.0-py3-none-any.whl
- Subject digest: 69413738e972969352fda59fa6eaffd5bcbcf40f7a46c46419b1cba9876dff80
- Sigstore transparency entry: 2005255824
- Sigstore integration time: Jun 29, 2026
Source repository:
- Permalink: CodeBlackwell/proofbench@7f2d89e28b76c0d99678b96e3a9cad4ee077e2e0
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/CodeBlackwell
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@7f2d89e28b76c0d99678b96e3a9cad4ee077e2e0
- Trigger Event: push

proofbench 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

proofbench

Why it exists

Install

Quickstart (demo)

How it works

The keep/revert gate

Adapters

Bring your own corpus

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance