Turn LLM eval runs into auditable evidence packs with defensible statistics (optional SR 26-2 / EU AI Act mappings).
Project description
evidentry
Turn LLM eval runs into auditable evidence packs with defensible statistics.
Most eval results are reported as a bare pass rate, regenerated ad hoc, and pasted into a doc. That's fine for iteration; it's not fine the moment someone — a reviewer, a customer's risk team, an auditor, or you in six months — asks "how do you know, and can you show your work?"
Eval frameworks measure your model. evidentry packages the measurement as evidence:
evals in ──► evidentry ──► versioned evidence pack out
├─ report.md (validation report, with stats and gap table)
├─ results.json (every item, every run)
└─ manifest.json (SHA-256 of config, datasets, artifacts)
- Define a model card and acceptance thresholds in YAML.
- Run evals against Anthropic / OpenAI-compatible APIs — or ingest pre-computed outputs from your own harness as a one-line-per-item JSONL (
provider: external). evidentry is an evidence layer, not another eval framework. (There are no format adapters for DeepEval/Inspect/promptfoo yet — you export the raw outputs, evidentry re-scores them; it cannot consume those tools' own scores.) - Emit a versioned evidence pack: pass rates with Wilson 95% confidence intervals, interval-aware verdicts, sample-size certificates for unsettled verdicts, exact run-over-run drift tests with multiplicity control, and an optional requirement-coverage table mapped to governance frameworks — including an explicit list of what is not evidenced.
- Verify any pack later:
evidentry verifyrecomputes every hash, catching accidental modification or corruption. (Integrity, not provenance: packs are unsigned, so this does not stop a determined forger — see roadmap.)
The statistics are the point
A suite is marked PASS only when the lower Wilson confidence bound clears its threshold. A point estimate that clears it on a small sample gets PASS (point) — the report tells you when your evidence is too thin to be settled, which is exactly what a reviewer needs to know. Unsettled verdicts come with a sample-size certificate: how many more items it would take to settle them, by exact binomial power at the observed rate.
Run-over-run changes get Fisher's exact test instead of an eyeball comparison — exact at any sample size, including the 4-to-8-item suites real configs have, where the textbook z-test flags spurious drift. When several suites are monitored at once, p-values are Holm-adjusted so the family-wise false-drift rate stays at α. And a drift row is only computed when the runs are actually comparable (same dataset bytes, metric, and runs-per-item) — otherwise it is flagged NOT COMPARABLE rather than dressed up as a p-value.
Items that share a source (several questions about the same document, scenarios from the same template) are not independent evidence. Give them a cluster field and the interval and verdict use a cluster-adjusted effective sample size (one-way cluster-robust variance → design effect, with a t critical value carrying G−1 degrees of freedom because the variance is estimated from G clusters), so correlation widens your intervals instead of silently flattering them. Known limit, in the open: with very few clusters and high intra-cluster correlation even the adjusted interval under-covers somewhat — the fix is more clusters, not more items per cluster. runs: N repeats each item; an item passes only if every run passes, so output instability shows up as failure instead of luck.
What the statistics honestly mean — and don't: the intervals quantify sampling uncertainty on your dataset. They do not certify field performance on a different input distribution, and verify proves integrity (the bytes haven't changed since the pack was built), not provenance (packs are not yet signed — see roadmap).
Quickstart
pip install evidentry
evidentry init my-model # scaffold config + sample dataset
cd my-model
evidentry run # works out of the box with the mock provider
Or run the worked example — a fictional bank validating a credit-memo summarizer (no API key needed; the mock provider makes it fully deterministic):
cd examples/credit_memo_summarizer
evidentry run -c evidentry.yaml
evidentry verify evidence/credit-memo-summarizer-v1.2.0-*
The committed sample output is in examples/credit_memo_summarizer/evidence/ — including a failing numeric-extraction suite and a use-limit violation, because an evidence tool you only see passing is a demo, not evidence.
What a pack asserts
| Question a reviewer asks | Artifact |
|---|---|
| What is this system, who owns it, why this risk tier? | model card + tier rationale |
| How was it tested, against what thresholds? | outcomes analysis with 95% CIs |
| Is the sample large enough to settle the verdict? | PASS vs PASS (point) distinction + sample-size certificate |
| Has it changed since last validation? | Fisher's exact test vs. baseline pack, Holm-adjusted, with dataset-parity checks |
| Are these the same results that were produced then? | manifest pins SHA-256 of config + datasets + results |
| What isn't covered? | requirement gap table (NOT EVIDENCED rows) |
Metrics
exact_match, contains (all substrings), regex, numeric (tolerance-based, for extracted figures), refusal (use-limit controls: the summarizer must decline to give investment advice is a control that needs evidence like everything else). These cover deterministic, checkable properties; graded qualities (faithfulness, tone) need LLM-as-judge metrics, which are deliberately not in v0.1 — judge reliability is its own evidence problem, and we'd rather ship it with disagreement statistics than pretend a single judge is ground truth.
Two metrics deserve their caveats in the open. numeric refuses to guess: if an output contains more than one number, the item fails with an explanation unless you set an explicit extraction rule (first / last / any) — a silent wrong guess feeding a confidence interval is exactly the failure an evidence tool exists to prevent. refusal is a transparent lexical heuristic (the patterns are ~10 lines of metrics.py); it distinguishes "I can't make credit decisions" from "I can't believe this stock", but it is not a semantic classifier — audit the item-level details when a use-limit verdict matters.
Framework mappings — read this before using them
Packs can include requirement-coverage tables for SR 26-2 (US interagency model risk guidance, April 2026) and EU AI Act Annex IV. Three facts you should know, from the primary sources:
- SR 26-2 explicitly excludes generative and agentic AI from its scope (footnote 3 of the guidance) and is non-binding ("non-compliance with this guidance will not result in supervisory criticism"). Its principles apply directly to traditional statistical models and non-generative AI. Using its structure for an LLM system — as the worked example does — is an analogy: organizing evidence around principles a bank already applies elsewhere, ahead of the GenAI-specific guidance the agencies have signaled. The mapping file says this in its header, with quotes.
- EU AI Act high-risk documentation obligations are not yet in force (expected Dec 2027 / Aug 2028 for most systems, post-Omnibus).
- The mappings are interpretations that structure evidence. They are not the guidance text, not legal advice, and not a substitute for independent validation. Requirements that need human judgment (conceptual soundness review, effective challenge, governance) are deliberately surfaced as gaps rather than papered over.
Scope, honestly
v0.1 covers outcomes-analysis-style evidence for text-in/text-out systems with checkable expected outputs. Not yet covered: LLM-as-judge metrics, fairness testing, multi-turn agent traces, tool-call audit, pack signing.
Roadmap
- Pack signing + trusted timestamps, so integrity holds against tampering, not just accidents
- LLM-as-judge metrics with judge-disagreement reporting (a judge that always agrees with itself is not the same as a judge that's right)
- Format adapters for DeepEval / Inspect / promptfoo output files, so
externalingestion doesn't require hand-exported JSONL - Agent-trace evidence (multi-turn, tool calls)
- CI integration: fail the build when a high-tier model's evidence pack regresses
- More mappings (NIST AI RMF, ISO/IEC 42001)
Authors
Built by Alejandro Lizardi and John Dryden as part of Periapsis, working on statistically honest evaluation evidence for AI systems.
MIT licensed. Issues and war stories from your own eval reviews are very welcome.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evidentry-0.2.0.tar.gz.
File metadata
- Download URL: evidentry-0.2.0.tar.gz
- Upload date:
- Size: 40.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb4f10fefcba542347492ada9845184fe9dc97ac70ed3176f65159f9ffa3148e
|
|
| MD5 |
62b6bcd78ca8421e09b39039f2458d4a
|
|
| BLAKE2b-256 |
d13a28ab3e9f4a7f8331ef74cc9d1ddb9a919f0caa955dce3da2ae3cc985fa9a
|
Provenance
The following attestation bundles were made for evidentry-0.2.0.tar.gz:
Publisher:
release.yml on alejlizardi/evidentry
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evidentry-0.2.0.tar.gz -
Subject digest:
bb4f10fefcba542347492ada9845184fe9dc97ac70ed3176f65159f9ffa3148e - Sigstore transparency entry: 1795467973
- Sigstore integration time:
-
Permalink:
alejlizardi/evidentry@c80691e98101ed0fcf75ae6e19cbd12b6eec6cef -
Branch / Tag:
refs/heads/main - Owner: https://github.com/alejlizardi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c80691e98101ed0fcf75ae6e19cbd12b6eec6cef -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file evidentry-0.2.0-py3-none-any.whl.
File metadata
- Download URL: evidentry-0.2.0-py3-none-any.whl
- Upload date:
- Size: 34.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54bd2d638a01c62d9f3a49796eac4e5a97aad9214f331591b9b375973000026a
|
|
| MD5 |
4168f4499b47cdb979b6ca515e4164d0
|
|
| BLAKE2b-256 |
b7f2714de380f304105b1b04a07e926b1c6240290fe318ea288ab3369ee9d812
|
Provenance
The following attestation bundles were made for evidentry-0.2.0-py3-none-any.whl:
Publisher:
release.yml on alejlizardi/evidentry
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evidentry-0.2.0-py3-none-any.whl -
Subject digest:
54bd2d638a01c62d9f3a49796eac4e5a97aad9214f331591b9b375973000026a - Sigstore transparency entry: 1795468035
- Sigstore integration time:
-
Permalink:
alejlizardi/evidentry@c80691e98101ed0fcf75ae6e19cbd12b6eec6cef -
Branch / Tag:
refs/heads/main - Owner: https://github.com/alejlizardi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c80691e98101ed0fcf75ae6e19cbd12b6eec6cef -
Trigger Event:
workflow_dispatch
-
Statement type: