Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.

These details have not been verified by PyPI

Project description

eval-toolkit

A methodology-aware evaluation harness for binary classification: metrics, bootstrap CIs, calibration, leakage detection, splitting, threshold selection, dataset loading, reproducibility manifests, and a slice-aware orchestrator that ties them together. Pure numpy/scipy/sklearn core; pandas/matplotlib/hypothesis are optional extras; PyTorch / HuggingFace / datasets are consumer-side (never required).

Library-grade by design — every public function is type-annotated, every math kernel is documented with LaTeX + literature references, statistical validity (bootstrap CIs, MDE estimates, paired-difference tests) is built in, and the JSON outputs (results.json / results_full.json / manifest.json) ship with versioned JSON Schemas so downstream parsers can gate on format changes.

Three-tier architecture

┌─ Tier 3 ─ Reproducibility scaffolding ─────────────────┐
│  manifest.json + seeds + git_sha + data_hashes +       │
│  gpu_info + leakage_report (NeurIPS-aligned)           │
├─ Tier 2 ─ Protocol-based orchestration ────────────────┤
│  Scorer / SliceAwareScorer / LeakageCheck / Splitter   │
│  ThresholdSelector / DatasetLoader / SimilarityStrategy│
│  Versioned (opt-in: per-object versions in manifest)   │
├─ Tier 1 ─ Functional core ─────────────────────────────┤
│  pr_auc / roc_auc / ECE variants / Brier / bootstrap_ci│
│  paired_bootstrap_diff / cv_clt_ci / mde_from_ci       │
│  reliability_curve / fit_temperature / fit_isotonic    │
└────────────────────────────────────────────────────────┘

Pick the tier your task needs. Ad-hoc analysis: just call the functional core. Full eval pipelines: implement the Protocols. Every run: capture the manifest.

Documentation

Getting started — end-to-end walkthrough for new users: install, define a Scorer, build slices, run evaluate(), persist results, add a claim, render a plot.
Methodology curriculum — 16 chapters on splits, metrics, calibration, evidence gates, prediction artifacts, and more.
Schema reference — field-by-field semantics for results.v1.json, results_full.v1.json, manifest.v1.json.
Migration guides — v0.6→v0.7, v0.7→v0.8, v0.8→v0.9.
Extending — Protocol-by-Protocol guide for custom Scorers, Splitters, LeakageChecks, ThresholdSelectors, DatasetLoaders, EvidenceGates.
Repo strategy — how the package is organized, the 6-bucket target shape, and the checklist that governs when to extract a sub-package into its own repo.

Methodology

What good binary-classification evaluation looks like, with each concern mapped to the toolkit primitive that operationalizes it.

docs/methodology/ — the curriculum (16 chapters). Recommended reading order: leakage → splits → thresholds → calibration → comparison → bootstrap → length_stratification → text_dedup → versioning → fairness → reproducibility → testing → reading_list.
docs/MIGRATION.md — per-version migration guides (v0.6→v0.7, v0.7→v0.8).
docs/roadmap.md — forward-looking tracker; v1.0.0 path; consumer gap-doc cross-links.

Extending eval-toolkit

How to plug your own scorers / leakage checks / splitters / loaders / threshold selectors into the harness.

docs/extending.md — Protocol-by-Protocol guide, ~50-line full-harness recipe, project-layout pointer.

Worked examples

docs/examples/prompt_injection_walkthrough.md — End-to-end prompt-injection eval on a synthetic OWASP LLM01:2025 fixture; cross-links to the showcase repo for the real Lakera PINT walkthrough.
docs/examples/pytorch_scorer_example.md — HuggingFace transformer + LoRA Scorer adapter (batched inference, GPU/CPU placement, deterministic-mode setup).
docs/examples/claims_and_gates.md — Composing reference + custom EvidenceGates into a ClaimSpec and running evaluate_claims() for release-time go/no-go checks.

Install

uv venv
uv pip install -e .[dev]

For consumers who only need the math kernels (no plotting, no pandas):

pip install eval-toolkit                        # core only: numpy/scipy/sklearn
pip install "eval-toolkit[plotting]"            # adds matplotlib + pillow
pip install "eval-toolkit[dataframe]"           # adds pandas
pip install "eval-toolkit[all]"                 # everything

Quick examples

Metrics

import numpy as np
from eval_toolkit import pr_auc, roc_auc, expected_calibration_error

rng = np.random.default_rng(42)
y = rng.integers(0, 2, size=200)
# Clip to [0, 1] — ECE only meaningful on calibrated probabilities.
s = np.clip(y + rng.normal(0, 0.3, size=200), 0, 1)

print(f"PR-AUC: {pr_auc(y, s):.3f}")
print(f"ROC-AUC: {roc_auc(y, s):.3f}")
print(f"ECE (10 bins): {expected_calibration_error(y, s, n_bins=10):.3f}")

Bootstrap confidence intervals

from eval_toolkit import bootstrap_ci, paired_bootstrap_diff, pr_auc

ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, seed=42)
print(f"PR-AUC: {ci.point_estimate:.3f}  95% CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")

# Paired bootstrap on the lift between two scorers (s_baseline must be in [0, 1] too).
s_baseline = np.clip(rng.normal(0.5, 0.3, size=200), 0, 1)
diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, seed=42)
print(f"Δ PR-AUC: {diff.delta:.3f}  overlaps zero: {diff.overlaps_zero}")

Temperature scaling (Guo et al. 2017)

from eval_toolkit import fit_temperature

logits = rng.normal(size=(500, 2))
labels = (logits[:, 1] > logits[:, 0]).astype(int)
result = fit_temperature(logits, labels)
print(f"Optimal T: {result['temperature']:.3f}")
print(f"NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")

Reproducibility manifest (NeurIPS-aligned)

import tempfile
from pathlib import Path
from eval_toolkit import build_manifest, write_manifest

with tempfile.TemporaryDirectory() as run_dir:
    # data_files: {name: path} → eval_toolkit hashes the files for you;
    # versioned: any object with a `version` attribute (e.g. a scorer or
    # leakage check) is captured by name → version in the manifest.
    manifest = build_manifest(
        run_id="quickstart-demo",
        config={"threshold_criterion": "max_f1", "seed": 42},
        seeds={"global": 42, "bootstrap": 42},
    )
    write_manifest(manifest, Path(run_dir))
    # → run_dir/manifest.json: schema_version, git_sha, dirty_flag, code_versions,
    #   env (python+platform), seeds, data_hashes, versioned_objects, gpu_info

Modules

Module	Purpose
`eval_toolkit.metrics`	PR-AUC, ROC-AUC, ECE variants, Brier decomposition, prior-shift projection
`eval_toolkit.thresholds`	`ThresholdSelector` Protocol + 6 reference impls (max-F1, target-recall/precision/FPR, Youden-J, cost-sensitive)
`eval_toolkit.operating_points`	Fit thresholds on mixed-class slices and apply them to mixed or single-class target slices with provenance
`eval_toolkit.bootstrap`	BCa + paired bootstrap, MDE estimates, two-level operating-point bootstrap, K-fold CLT-corrected CI
`eval_toolkit.calibration`	Reliability curves, Bayes-optimal thresholds, isotonic/Platt/temperature scaling
`eval_toolkit.harness`	`Scorer` Protocol + `evaluate(...)` + `evaluate_folded(...)` slice-aware orchestrators
`eval_toolkit.leakage`	`LeakageCheck` Protocol + 7 reference impls (exact / near / encoding-obfuscated / cross-split / label-conflict / group / temporal); `Versioned` opt-in Protocol
`eval_toolkit.splits`	`Splitter` Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series)
`eval_toolkit.loaders`	`DatasetLoader` Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible `describe()`
`eval_toolkit.manifest`	`RunManifest` (NeurIPS-aligned) + source-role / guardrail metadata + `build_manifest` / `write_manifest`
`eval_toolkit.claims`	`EvidenceGate` class (frozen dataclass: name + callable check + severity), reference gate factories (`required_metric_gate`, `minimum_slice_size_gate`, `metric_threshold_gate`, etc.), `evaluate_claims()`, and `ClaimReport` for claim-mode vs exploratory-mode checks. See `docs/extending.md` for writing custom gates and `docs/examples/claims_and_gates.md` for a worked end-to-end example.
`eval_toolkit.text_dedup`	`SimilarityStrategy` Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); `near_dedup` / `cross_dedup` orchestrators
`eval_toolkit.plotting`	PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs
`eval_toolkit.provenance`	File hashing, run-directory layout, figure metadata sidecar
`eval_toolkit/schemas/`	Bundled JSON Schemas (`results.v1.json`, `results_full.v1.json`, `manifest.v1.json`) — load via `importlib.resources.files("eval_toolkit") / "schemas"` (not an importable Python module)
`eval_toolkit.paths`	Repo-relative path normalization
`eval_toolkit.seeds`	`set_global_seeds` (random + numpy + optional torch)
`eval_toolkit.config`	`frozen_config` decorator + `from_yaml` loader
`eval_toolkit.docs`	Anchor-based markdown rendering with formatter registry

Fast iteration loop

For development, skip slow tests with:

make fast              # or: nox -s fast
# under the hood: uv run pytest -m "not slow" -q

CI runs the full suite (including slow) on every push. The slow marker is applied to tests exceeding ~2s (mostly Hypothesis property tests with large max_examples and a few bootstrap tests with n_resamples >= 200). make fast keeps the developer iteration loop under ~30 seconds.

Downstream contract testing (v4 sibling-smoke)

A separate CI workflow (.github/workflows/v4-smoke.yml) checks out the downstream consumer prompt-injection-v4 at main, installs it with this branch's eval-toolkit as an editable sibling dep (via v4's [tool.uv.sources]), and runs v4's fast -m smoke suite. This catches contract regressions at PR time rather than in v4's own CI post-merge.

The workflow requires a HF_TOKEN repo secret (gated HuggingFace datasets used by v4's smoke fixtures). Set it at: https://github.com/brandon-behring/eval-toolkit/settings/secrets/actions

The workflow runs with continue-on-error: true during a 2-3 week trial period; it'll be promoted to a required gate once the false- positive rate (from independent v4 main breakage or HF rate-limits) is characterized.

Standards

See STYLE.md for the full reconciled coding standards (formatting, naming, errors, docstrings, tests, packaging).

Versioning

Semver from v0.1.0. See CHANGELOG.md.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.43.0

May 19, 2026

0.42.0

May 19, 2026

0.41.0

May 19, 2026

0.40.0

May 19, 2026

0.39.0

May 18, 2026

0.38.0

May 18, 2026

0.37.0

May 18, 2026

0.36.0

May 18, 2026

0.35.0

May 18, 2026

0.34.0

May 17, 2026

0.33.1

May 17, 2026

This version

0.33.0

May 17, 2026

0.32.0

May 17, 2026

0.31.0

May 16, 2026

0.30.1

May 16, 2026

0.30.0

May 15, 2026

0.29.0

May 15, 2026

0.28.1

May 15, 2026

0.28.0

May 15, 2026

0.27.2

May 15, 2026

0.27.1

May 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eval_toolkit-0.33.0.tar.gz (635.0 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

eval_toolkit-0.33.0-py3-none-any.whl (171.4 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file eval_toolkit-0.33.0.tar.gz.

File metadata

Download URL: eval_toolkit-0.33.0.tar.gz
Upload date: May 17, 2026
Size: 635.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for eval_toolkit-0.33.0.tar.gz
Algorithm	Hash digest
SHA256	`c2cc8a2c2073a8cc437e1a8612a4047b8f909532f6b92209d5497b12e343dbcf`
MD5	`7e1c4b8bb8b22a3c69afd0655bde356c`
BLAKE2b-256	`9ce54767bd60b4a460f27b302e4b1abe1edec5cc434ee67c08c328126170d38c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_toolkit-0.33.0.tar.gz:

Publisher: publish.yml on brandon-behring/eval-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: eval_toolkit-0.33.0.tar.gz
- Subject digest: c2cc8a2c2073a8cc437e1a8612a4047b8f909532f6b92209d5497b12e343dbcf
- Sigstore transparency entry: 1562773783
- Sigstore integration time: May 17, 2026
Source repository:
- Permalink: brandon-behring/eval-toolkit@9e375a896fb3dca14ad55b21d208f1169b33b2ad
- Branch / Tag: refs/tags/v0.33.0
- Owner: https://github.com/brandon-behring
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9e375a896fb3dca14ad55b21d208f1169b33b2ad
- Trigger Event: push

File details

Details for the file eval_toolkit-0.33.0-py3-none-any.whl.

File metadata

Download URL: eval_toolkit-0.33.0-py3-none-any.whl
Upload date: May 17, 2026
Size: 171.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for eval_toolkit-0.33.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d5ab6eb72cd27cd5f93f2550b61f2dc13ab896636ef2d1e16dbb3e0a375e1c0e`
MD5	`e0435a8d597d18664fc448a6f6de717a`
BLAKE2b-256	`2bfd44dc275291ea0323cc3c80a0caaa2477860f3f12dd2f8e01776916a784d7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_toolkit-0.33.0-py3-none-any.whl:

Publisher: publish.yml on brandon-behring/eval-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: eval_toolkit-0.33.0-py3-none-any.whl
- Subject digest: d5ab6eb72cd27cd5f93f2550b61f2dc13ab896636ef2d1e16dbb3e0a375e1c0e
- Sigstore transparency entry: 1562773787
- Sigstore integration time: May 17, 2026
Source repository:
- Permalink: brandon-behring/eval-toolkit@9e375a896fb3dca14ad55b21d208f1169b33b2ad
- Branch / Tag: refs/tags/v0.33.0
- Owner: https://github.com/brandon-behring
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9e375a896fb3dca14ad55b21d208f1169b33b2ad
- Trigger Event: push

eval-toolkit 0.33.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

eval-toolkit

Three-tier architecture

Documentation

Methodology

Extending eval-toolkit

Worked examples

Install

Quick examples

Metrics

Bootstrap confidence intervals

Temperature scaling (Guo et al. 2017)

Reproducibility manifest (NeurIPS-aligned)

Modules

Fast iteration loop

Downstream contract testing (v4 sibling-smoke)

Standards

Versioning

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance