An independent significance referee for LLM & agent evals — is your improvement real, or noise?
Project description
holdout
An independent significance referee for LLM & agent evals. Is your improvement real — or noise, multiple-comparisons inflation, or a model that quietly memorized your test set?
Most eval "wins" don't survive a paired significance test. holdout runs the three checks your
eval dashboard skips, in your code or in CI:
- Is it signal? A paired test (exact McNemar for pass/fail, paired permutation for graded scores) with a real confidence interval — not a naked delta.
- Or did you just try a lot of things? The bar rises with how many variants you tried. The max of 37 noisy attempts is expected to look like a win.
- What would change the verdict? Power analysis: how many tasks you'd actually need.
The stats are open source (this repo). The hosted service (holdout.dev) adds the parts code can't promise: independence, a write-once holdout you can't re-tune against, a contamination scan, and a verifiable badge.
Install
pip install holdout-evals # the import name is still `import holdout`
Quickstart — Python
from holdout import compare
# per-task scores for the SAME tasks, in the same order (0/1 for pass-fail, or floats)
res = compare(baseline_scores, candidate_scores, variants_tried=37)
print(res.report())
print(res.significant) # False — gate on this
print(res.p_value, res.ci) # the honest numbers
Quickstart — CLI (drop it in CI)
python examples/make_example.py # writes a +4-point "win" that is actually noise
holdout check examples/v2.jsonl --baseline examples/v1.jsonl --variants 37
Holdout - significance check [FAIL]
baseline 73.0% -> candidate 77.0% (n = 200 tasks)
effect +4.0 pts 95% CI [-0.5, +8.5]
test mcnemar_exact p = 0.134
variants tried 37 -> adjusted p = 1.000 (any-false-win risk 85%)
paired counts +15 fixed / -7 broke (net +8)
VERDICT: WITHIN NOISE - not statistically significant.
-> Don't ship on this alone; the gain is indistinguishable from sampling noise.
You'd need ~967 tasks for an effect this size to be detectable.
holdout check exits non-zero when the improvement isn't a significant gain — so it blocks a
"ship the noise" merge. As a GitHub Action:
- run: holdout check evals/candidate.jsonl --baseline evals/baseline.jsonl --variants ${{ env.N_VARIANTS }}
Input is JSONL of { "task_id": ..., "score": ... } (also accepts correct/pass/reward;
booleans and 0/1 become 0.0/1.0). One file per system, joined on task_id — or a single
--paired file with baseline and candidate columns.
How many tasks do I need?
holdout power --baseline-acc 0.75 --effect 0.03 --variants 37
Why not just compute it yourself?
You can — that's why the math is free. The point of the hosted service is the four things a local script can't credibly promise: an independent verdict (we didn't build the agent), a write-once holdout scored exactly once per config (no quiet re-tuning), a variants bar that spans your whole team's submissions, and a verifiable badge.
Reading
The methodology follows the published literature on eval rigor — paired tests (Dietterich 1998), multiple-comparisons control (Benjamini–Hochberg 1995), benchmark contamination (Zhang et al. 2024, GSM1k), and power for evals (Miller 2024, Adding Error Bars to Evals).
MIT licensed. Contributions and corrections welcome — that's the whole point.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file holdout_evals-0.1.0.tar.gz.
File metadata
- Download URL: holdout_evals-0.1.0.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e45854c09fa1bed156261efe8827a3e2db5388c5ec3f0699e7afea9f6d8f3642
|
|
| MD5 |
5cb9e7ab59b1761c8230dabb511ffc94
|
|
| BLAKE2b-256 |
1e8f5c48ae194c36c1f7d350d03244b04fdbca9f4a970960b083bccd49429ace
|
File details
Details for the file holdout_evals-0.1.0-py3-none-any.whl.
File metadata
- Download URL: holdout_evals-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2daa8bb5eefe751cf5a550cd03470e5e77db86cfcd7611718dbac006f1d250e0
|
|
| MD5 |
3fc9b4ce3aa5b34bda7dd590d16ec78a
|
|
| BLAKE2b-256 |
eacac78aeecbb8181a933203ffca9f065c6656ab4e492d5594f33d6c1c41ffab
|