Skip to main content

Turn LLM eval runs into auditable evidence packs with defensible statistics (optional SR 26-2 / EU AI Act mappings).

Project description

evidentry

CI License: MIT Python 3.10+

Turn LLM eval runs into auditable evidence packs with defensible statistics.

Most eval results are reported as a bare pass rate, regenerated ad hoc, and pasted into a doc. That's fine for iteration; it's not fine the moment someone — a reviewer, a customer's risk team, an auditor, or you in six months — asks "how do you know, and can you show your work?"

Eval frameworks measure your model. evidentry packages the measurement as evidence:

evals in  ──►  evidentry  ──►  versioned evidence pack out
                               ├─ report.md      (validation report, with stats and gap table)
                               ├─ results.json   (every item, every run)
                               └─ manifest.json  (SHA-256 of config, datasets, artifacts)
  • Define a model card and acceptance thresholds in YAML.
  • Run evals against Anthropic / OpenAI-compatible APIs — or ingest pre-computed outputs from your own harness as a one-line-per-item JSONL (provider: external). evidentry is an evidence layer, not another eval framework. (There are no format adapters for DeepEval/Inspect/promptfoo yet — you export the raw outputs, evidentry re-scores them; it cannot consume those tools' own scores.)
  • Emit a versioned evidence pack: pass rates with Wilson 95% confidence intervals, interval-aware verdicts, sample-size certificates for unsettled verdicts, exact run-over-run drift tests with multiplicity control, and an optional requirement-coverage table mapped to governance frameworks — including an explicit list of what is not evidenced.
  • Verify any pack later: evidentry verify recomputes every hash, catching accidental modification or corruption. (Integrity, not provenance: packs are unsigned, so this does not stop a determined forger — see roadmap.)

The statistics are the point

A suite is marked PASS only when the lower Wilson confidence bound clears its threshold. A point estimate that clears it on a small sample gets PASS (point) — the report tells you when your evidence is too thin to be settled, which is exactly what a reviewer needs to know. Unsettled verdicts come with a sample-size certificate: how many more items it would take to settle them, by exact binomial power at the observed rate.

Run-over-run changes get Fisher's exact test instead of an eyeball comparison — exact at any sample size, including the 4-to-8-item suites real configs have, where the textbook z-test flags spurious drift. When several suites are monitored at once, p-values are Holm-adjusted so the family-wise false-drift rate stays at α. And a drift row is only computed when the runs are actually comparable (same dataset bytes, metric, and runs-per-item) — otherwise it is flagged NOT COMPARABLE rather than dressed up as a p-value.

Items that share a source (several questions about the same document, scenarios from the same template) are not independent evidence. Give them a cluster field and the interval and verdict use a cluster-adjusted effective sample size (one-way cluster-robust variance → design effect, with a t critical value carrying G−1 degrees of freedom because the variance is estimated from G clusters), so correlation widens your intervals instead of silently flattering them. Known limit, in the open: with very few clusters and high intra-cluster correlation even the adjusted interval under-covers somewhat — the fix is more clusters, not more items per cluster. runs: N repeats each item; an item passes only if every run passes, so output instability shows up as failure instead of luck.

What the statistics honestly mean — and don't: the intervals quantify sampling uncertainty on your dataset. They do not certify field performance on a different input distribution, and verify proves integrity (the bytes haven't changed since the pack was built), not provenance (packs are not yet signed — see roadmap).

Quickstart

pip install evidentry
evidentry init my-model   # scaffold config + sample dataset
cd my-model
evidentry run             # works out of the box with the mock provider

Or run the worked example — a fictional bank validating a credit-memo summarizer (no API key needed; the mock provider makes it fully deterministic):

cd examples/credit_memo_summarizer
evidentry run -c evidentry.yaml
evidentry verify evidence/credit-memo-summarizer-v1.2.0-*

The committed sample output is in examples/credit_memo_summarizer/evidence/ — including a failing numeric-extraction suite and a use-limit violation, because an evidence tool you only see passing is a demo, not evidence.

What a pack asserts

Question a reviewer asks Artifact
What is this system, who owns it, why this risk tier? model card + tier rationale
How was it tested, against what thresholds? outcomes analysis with 95% CIs
Is the sample large enough to settle the verdict? PASS vs PASS (point) distinction + sample-size certificate
Has it changed since last validation? Fisher's exact test vs. baseline pack, Holm-adjusted, with dataset-parity checks
Are these the same results that were produced then? manifest pins SHA-256 of config + datasets + results
What isn't covered? requirement gap table (NOT EVIDENCED rows)

Metrics

exact_match, contains (all substrings), regex, numeric (tolerance-based, for extracted figures), refusal (use-limit controls: the summarizer must decline to give investment advice is a control that needs evidence like everything else). These cover deterministic, checkable properties; graded qualities (faithfulness, tone) need LLM-as-judge metrics, which are deliberately not in v0.1 — judge reliability is its own evidence problem, and we'd rather ship it with disagreement statistics than pretend a single judge is ground truth.

Two metrics deserve their caveats in the open. numeric refuses to guess: if an output contains more than one number, the item fails with an explanation unless you set an explicit extraction rule (first / last / any) — a silent wrong guess feeding a confidence interval is exactly the failure an evidence tool exists to prevent. refusal is a transparent lexical heuristic (the patterns are ~10 lines of metrics.py); it distinguishes "I can't make credit decisions" from "I can't believe this stock", but it is not a semantic classifier — audit the item-level details when a use-limit verdict matters.

Framework mappings — read this before using them

Packs can include requirement-coverage tables for SR 26-2 (US interagency model risk guidance, April 2026) and EU AI Act Annex IV. Three facts you should know, from the primary sources:

  • SR 26-2 explicitly excludes generative and agentic AI from its scope (footnote 3 of the guidance) and is non-binding ("non-compliance with this guidance will not result in supervisory criticism"). Its principles apply directly to traditional statistical models and non-generative AI. Using its structure for an LLM system — as the worked example does — is an analogy: organizing evidence around principles a bank already applies elsewhere, ahead of the GenAI-specific guidance the agencies have signaled. The mapping file says this in its header, with quotes.
  • EU AI Act high-risk documentation obligations are not yet in force (expected Dec 2027 / Aug 2028 for most systems, post-Omnibus).
  • The mappings are interpretations that structure evidence. They are not the guidance text, not legal advice, and not a substitute for independent validation. Requirements that need human judgment (conceptual soundness review, effective challenge, governance) are deliberately surfaced as gaps rather than papered over.

Scope, honestly

v0.1 covers outcomes-analysis-style evidence for text-in/text-out systems with checkable expected outputs. Not yet covered: LLM-as-judge metrics, fairness testing, multi-turn agent traces, tool-call audit, pack signing.

Roadmap

  • Pack signing + trusted timestamps, so integrity holds against tampering, not just accidents
  • LLM-as-judge metrics with judge-disagreement reporting (a judge that always agrees with itself is not the same as a judge that's right)
  • Format adapters for DeepEval / Inspect / promptfoo output files, so external ingestion doesn't require hand-exported JSONL
  • Agent-trace evidence (multi-turn, tool calls)
  • CI integration: fail the build when a high-tier model's evidence pack regresses
  • More mappings (NIST AI RMF, ISO/IEC 42001)

Authors

Built by Alejandro Lizardi and John Dryden as part of Periapsis, working on statistically honest evaluation evidence for AI systems.

MIT licensed. Issues and war stories from your own eval reviews are very welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evidentry-0.2.0.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evidentry-0.2.0-py3-none-any.whl (34.2 kB view details)

Uploaded Python 3

File details

Details for the file evidentry-0.2.0.tar.gz.

File metadata

  • Download URL: evidentry-0.2.0.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evidentry-0.2.0.tar.gz
Algorithm Hash digest
SHA256 bb4f10fefcba542347492ada9845184fe9dc97ac70ed3176f65159f9ffa3148e
MD5 62b6bcd78ca8421e09b39039f2458d4a
BLAKE2b-256 d13a28ab3e9f4a7f8331ef74cc9d1ddb9a919f0caa955dce3da2ae3cc985fa9a

See more details on using hashes here.

Provenance

The following attestation bundles were made for evidentry-0.2.0.tar.gz:

Publisher: release.yml on alejlizardi/evidentry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evidentry-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: evidentry-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 34.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evidentry-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 54bd2d638a01c62d9f3a49796eac4e5a97aad9214f331591b9b375973000026a
MD5 4168f4499b47cdb979b6ca515e4164d0
BLAKE2b-256 b7f2714de380f304105b1b04a07e926b1c6240290fe318ea288ab3369ee9d812

See more details on using hashes here.

Provenance

The following attestation bundles were made for evidentry-0.2.0-py3-none-any.whl:

Publisher: release.yml on alejlizardi/evidentry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page