Turn LLM eval runs into auditable evidence packs with defensible statistics (optional SR 26-2 / EU AI Act mappings).

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

Alejlizardi05

These details have not been verified by PyPI

Project description

evidentry

Python 3.10+

Turn LLM eval runs into auditable evidence packs with defensible statistics.

Most eval results are reported as a bare pass rate, regenerated ad hoc, and pasted into a doc. That's fine for iteration; it's not fine the moment someone — a reviewer, a customer's risk team, an auditor, or you in six months — asks "how do you know, and can you show your work?"

Eval frameworks measure your model. evidentry packages the measurement as evidence:

evals in  ──►  evidentry  ──►  versioned evidence pack out
                               ├─ report.md      (validation report, with stats and gap table)
                               ├─ results.json   (every item, every run)
                               └─ manifest.json  (SHA-256 of config, datasets, artifacts)

Define a model card and acceptance thresholds in YAML.
Run evals against Anthropic / OpenAI-compatible APIs — or ingest pre-computed outputs from your own harness as a one-line-per-item JSONL (provider: external). evidentry is an evidence layer, not another eval framework. (There are no format adapters for DeepEval/Inspect/promptfoo yet — you export the raw outputs, evidentry re-scores them; it cannot consume those tools' own scores.)
Emit a versioned evidence pack: pass rates with Wilson 95% confidence intervals, interval-aware verdicts, sample-size certificates for unsettled verdicts, exact run-over-run drift tests with multiplicity control, and an optional requirement-coverage table mapped to governance frameworks — including an explicit list of what is not evidenced.
Verify any pack later: evidentry verify recomputes every hash, catching accidental modification or corruption. (Integrity, not provenance: packs are unsigned, so this does not stop a determined forger — see roadmap.)

The statistics are the point

A suite is marked PASS only when the lower Wilson confidence bound clears its threshold. A point estimate that clears it on a small sample gets PASS (point) — the report tells you when your evidence is too thin to be settled, which is exactly what a reviewer needs to know. Unsettled verdicts come with a sample-size certificate: how many more items it would take to settle them, by exact binomial power at the observed rate.

Run-over-run changes get Fisher's exact test instead of an eyeball comparison — exact at any sample size, including the 4-to-8-item suites real configs have, where the textbook z-test flags spurious drift. When several suites are monitored at once, p-values are Holm-adjusted so the family-wise false-drift rate stays at α. And a drift row is only computed when the runs are actually comparable (same dataset bytes, metric, and runs-per-item) — otherwise it is flagged NOT COMPARABLE rather than dressed up as a p-value.

Items that share a source (several questions about the same document, scenarios from the same template) are not independent evidence. Give them a cluster field and the interval and verdict use a cluster-adjusted effective sample size (one-way cluster-robust variance → design effect, with a t critical value carrying G−1 degrees of freedom because the variance is estimated from G clusters), so correlation widens your intervals instead of silently flattering them. Known limit, in the open: with very few clusters and high intra-cluster correlation even the adjusted interval under-covers somewhat — the fix is more clusters, not more items per cluster. runs: N repeats each item; an item passes only if every run passes, so output instability shows up as failure instead of luck.

What the statistics honestly mean — and don't: the intervals quantify sampling uncertainty on your dataset. They do not certify field performance on a different input distribution, and verify proves integrity (the bytes haven't changed since the pack was built), not provenance (packs are not yet signed — see roadmap).

Quickstart

pip install evidentry
evidentry init my-model   # scaffold config + sample dataset
cd my-model
evidentry run             # works out of the box with the mock provider

Or run the worked example — a fictional bank validating a credit-memo summarizer (no API key needed; the mock provider makes it fully deterministic):

cd examples/credit_memo_summarizer
evidentry run -c evidentry.yaml
evidentry verify evidence/credit-memo-summarizer-v1.2.0-*

The committed sample output is in examples/credit_memo_summarizer/evidence/ — including a failing numeric-extraction suite and a use-limit violation, because an evidence tool you only see passing is a demo, not evidence.

What a pack asserts

Question a reviewer asks	Artifact
What is this system, who owns it, why this risk tier?	model card + tier rationale
How was it tested, against what thresholds?	outcomes analysis with 95% CIs
Is the sample large enough to settle the verdict?	PASS vs PASS (point) distinction + sample-size certificate
Has it changed since last validation?	Fisher's exact test vs. baseline pack, Holm-adjusted, with dataset-parity checks
Are these the same results that were produced then?	manifest pins SHA-256 of config + datasets + results
What isn't covered?	requirement gap table (`NOT EVIDENCED` rows)

Metrics

exact_match, contains (all substrings), regex, numeric (tolerance-based, for extracted figures), refusal (use-limit controls: the summarizer must decline to give investment advice is a control that needs evidence like everything else). These cover deterministic, checkable properties; graded qualities (faithfulness, tone) need LLM-as-judge metrics, which are deliberately not in v0.1 — judge reliability is its own evidence problem, and we'd rather ship it with disagreement statistics than pretend a single judge is ground truth.

Two metrics deserve their caveats in the open. numeric refuses to guess: if an output contains more than one number, the item fails with an explanation unless you set an explicit extraction rule (first / last / any) — a silent wrong guess feeding a confidence interval is exactly the failure an evidence tool exists to prevent. refusal is a transparent lexical heuristic (the patterns are ~10 lines of metrics.py); it distinguishes "I can't make credit decisions" from "I can't believe this stock", but it is not a semantic classifier — audit the item-level details when a use-limit verdict matters.

Framework mappings — read this before using them

Packs can include requirement-coverage tables for SR 26-2 (US interagency model risk guidance, April 2026) and EU AI Act Annex IV. Three facts you should know, from the primary sources:

SR 26-2 explicitly excludes generative and agentic AI from its scope (footnote 3 of the guidance) and is non-binding ("non-compliance with this guidance will not result in supervisory criticism"). Its principles apply directly to traditional statistical models and non-generative AI. Using its structure for an LLM system — as the worked example does — is an analogy: organizing evidence around principles a bank already applies elsewhere, ahead of the GenAI-specific guidance the agencies have signaled. The mapping file says this in its header, with quotes.
EU AI Act high-risk documentation obligations are not yet in force (expected Dec 2027 / Aug 2028 for most systems, post-Omnibus).
The mappings are interpretations that structure evidence. They are not the guidance text, not legal advice, and not a substitute for independent validation. Requirements that need human judgment (conceptual soundness review, effective challenge, governance) are deliberately surfaced as gaps rather than papered over.

Scope, honestly

v0.1 covers outcomes-analysis-style evidence for text-in/text-out systems with checkable expected outputs. Not yet covered: LLM-as-judge metrics, fairness testing, multi-turn agent traces, tool-call audit, pack signing.

Roadmap

Pack signing + trusted timestamps, so integrity holds against tampering, not just accidents
LLM-as-judge metrics with judge-disagreement reporting (a judge that always agrees with itself is not the same as a judge that's right)
Format adapters for DeepEval / Inspect / promptfoo output files, so external ingestion doesn't require hand-exported JSONL
Agent-trace evidence (multi-turn, tool calls)
CI integration: fail the build when a high-tier model's evidence pack regresses
More mappings (NIST AI RMF, ISO/IEC 42001)

Authors

Built by Alejandro Lizardi and John Dryden as part of Periapsis, working on statistically honest evaluation evidence for AI systems.

MIT licensed. Issues and war stories from your own eval reviews are very welcome.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

Alejlizardi05

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Jun 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evidentry-0.2.0.tar.gz (40.6 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evidentry-0.2.0-py3-none-any.whl (34.2 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file evidentry-0.2.0.tar.gz.

File metadata

Download URL: evidentry-0.2.0.tar.gz
Upload date: Jun 12, 2026
Size: 40.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evidentry-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`bb4f10fefcba542347492ada9845184fe9dc97ac70ed3176f65159f9ffa3148e`
MD5	`62b6bcd78ca8421e09b39039f2458d4a`
BLAKE2b-256	`d13a28ab3e9f4a7f8331ef74cc9d1ddb9a919f0caa955dce3da2ae3cc985fa9a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evidentry-0.2.0.tar.gz:

Publisher: release.yml on alejlizardi/evidentry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evidentry-0.2.0.tar.gz
- Subject digest: bb4f10fefcba542347492ada9845184fe9dc97ac70ed3176f65159f9ffa3148e
- Sigstore transparency entry: 1795467973
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: alejlizardi/evidentry@c80691e98101ed0fcf75ae6e19cbd12b6eec6cef
- Branch / Tag: refs/heads/main
- Owner: https://github.com/alejlizardi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c80691e98101ed0fcf75ae6e19cbd12b6eec6cef
- Trigger Event: workflow_dispatch

File details

Details for the file evidentry-0.2.0-py3-none-any.whl.

File metadata

Download URL: evidentry-0.2.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 34.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evidentry-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`54bd2d638a01c62d9f3a49796eac4e5a97aad9214f331591b9b375973000026a`
MD5	`4168f4499b47cdb979b6ca515e4164d0`
BLAKE2b-256	`b7f2714de380f304105b1b04a07e926b1c6240290fe318ea288ab3369ee9d812`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evidentry-0.2.0-py3-none-any.whl:

Publisher: release.yml on alejlizardi/evidentry

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evidentry-0.2.0-py3-none-any.whl
- Subject digest: 54bd2d638a01c62d9f3a49796eac4e5a97aad9214f331591b9b375973000026a
- Sigstore transparency entry: 1795468035
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: alejlizardi/evidentry@c80691e98101ed0fcf75ae6e19cbd12b6eec6cef
- Branch / Tag: refs/heads/main
- Owner: https://github.com/alejlizardi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@c80691e98101ed0fcf75ae6e19cbd12b6eec6cef
- Trigger Event: workflow_dispatch

evidentry 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

evidentry

The statistics are the point

Quickstart

What a pack asserts

Metrics

Framework mappings — read this before using them

Scope, honestly

Roadmap

Authors

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance