Skip to main content

Reusable evaluation contracts for binary classification: metrics, bootstrap CIs, calibration, artifacts, and evidence gates.

Project description

eval-toolkit

CI Docs PyPI version Python ≥3.13 License: MIT Binder

A methodology-aware evaluation harness for binary classification: metrics, bootstrap CIs, calibration, leakage detection, splitting, threshold selection, dataset loading, reproducibility manifests, and a slice-aware orchestrator that ties them together. Pure numpy/scipy/sklearn core; pandas/matplotlib/hypothesis are optional extras; PyTorch / HuggingFace / datasets are consumer-side (never required).

Library-grade by design — every public function is type-annotated, every math kernel is documented with LaTeX + literature references, statistical validity (bootstrap CIs, MDE estimates, paired-difference tests) is built in, and the JSON outputs (results.json / results_full.json / manifest.json) ship with versioned JSON Schemas so downstream parsers can gate on format changes.

Three-tier architecture

┌─ Tier 3 ─ Reproducibility scaffolding ─────────────────┐
│  manifest.json + seeds + git_sha + data_hashes +       │
│  gpu_info + leakage_report (NeurIPS-aligned)           │
├─ Tier 2 ─ Protocol-based orchestration ────────────────┤
│  Scorer / SliceAwareScorer / LeakageCheck / Splitter   │
│  ThresholdSelector / DatasetLoader / MetricSpec        │
│  MetaLearner / Probe / TextTransform /                 │
│  SimilarityStrategy (10 strict)                        │
│  Versioned (opt-in: per-object versions in manifest)   │
├─ Tier 1 ─ Functional core ─────────────────────────────┤
│  pr_auc / roc_auc / ECE variants / Brier / bootstrap_ci│
│  paired_bootstrap_diff / cv_clt_ci / mde_from_ci       │
│  reliability_curve / fit_temperature / fit_isotonic    │
└────────────────────────────────────────────────────────┘

Pick the tier your task needs. Ad-hoc analysis: just call the functional core. Full eval pipelines: implement the Protocols. Every run: capture the manifest.

Documentation

  • Getting started — end-to-end walkthrough for new users: install, define a Scorer, build slices, run evaluate(), persist results, add a claim, render a plot.
  • Methodology curriculum — 16 chapters on splits, metrics, calibration, evidence gates, prediction artifacts, and more.
  • Schema reference — field-by-field semantics for results.v1.json, results_full.v1.json, manifest.v1.json.
  • Migration guides — per-version migration hub (v0.7 onward).
  • Extending — Protocol-by-Protocol guide for custom Scorers, Splitters, LeakageChecks, ThresholdSelectors, DatasetLoaders, EvidenceGates.
  • Repo strategy — how the package is organized, the flat-module layout per ADR 0001, and the v2.0 trigger criteria for any future subpackage split.

Methodology

What good binary-classification evaluation looks like, with each concern mapped to the toolkit primitive that operationalizes it.

Extending eval-toolkit

How to plug your own scorers / leakage checks / splitters / loaders / threshold selectors into the harness.

Worked examples

  • docs/source/examples/ — Sphinx / MyST-NB executable notebooks covering: the evaluation harness, metrics + bootstrap, calibration, claims-and-gates, leakage detection, cross-corpus contamination scanning, character-injection adversarial sweeps, callable-embedder dedup, and the activation-delta probe.

Install

uv venv
uv pip install -e .[dev]

For consumers who only need the math kernels (no plotting, no pandas):

pip install eval-toolkit                        # core only: numpy/scipy/sklearn
pip install "eval-toolkit[plotting]"            # adds matplotlib + pillow
pip install "eval-toolkit[dataframe]"           # adds pandas
pip install "eval-toolkit[all]"                 # everything

Quick examples

Metrics

import numpy as np
from eval_toolkit.metrics import pr_auc, roc_auc, expected_calibration_error

rng = np.random.default_rng(42)
y = rng.integers(0, 2, size=200)
# Clip to [0, 1] — ECE only meaningful on calibrated probabilities.
s = np.clip(y + rng.normal(0, 0.3, size=200), 0, 1)

print(f"PR-AUC: {pr_auc(y, s):.3f}")
print(f"ROC-AUC: {roc_auc(y, s):.3f}")
print(f"ECE (10 bins): {expected_calibration_error(y, s, n_bins=10):.3f}")

Bootstrap confidence intervals

from eval_toolkit import bootstrap_ci, paired_bootstrap_diff
from eval_toolkit.metrics import pr_auc

ci = bootstrap_ci(y, s, pr_auc, n_resamples=1000, rng=42)
print(f"PR-AUC: {ci.point_estimate:.3f}  95% CI: [{ci.ci_low:.3f}, {ci.ci_high:.3f}]")

# Paired bootstrap on the lift between two scorers (s_baseline must be in [0, 1] too).
s_baseline = np.clip(rng.normal(0.5, 0.3, size=200), 0, 1)
diff = paired_bootstrap_diff(y, s_baseline, s, pr_auc, n_resamples=1000, rng=42)
print(f"Δ PR-AUC: {diff.delta:.3f}  overlaps zero: {diff.overlaps_zero}")

Temperature scaling (Guo et al. 2017)

import numpy as np
from eval_toolkit import fit_temperature

rng = np.random.default_rng(42)
logits = rng.normal(size=(500, 2))
labels = (logits[:, 1] > logits[:, 0]).astype(int)
result = fit_temperature(logits, labels)
print(f"Optimal T: {result['temperature']:.3f}")
print(f"NLL: {result['nll_pre']:.3f} -> {result['nll_post']:.3f}")

Reproducibility manifest (NeurIPS-aligned)

import tempfile
from pathlib import Path
from eval_toolkit import make_manifest, write_manifest

with tempfile.TemporaryDirectory() as run_dir:
    # data_files: {name: path} → eval_toolkit hashes the files for you;
    # versioned: any object with a `version` attribute (e.g. a scorer or
    # leakage check) is captured by name → version in the manifest.
    manifest = make_manifest(
        run_id="quickstart-demo",
        config={"threshold_criterion": "max_f1", "seed": 42},
        seeds={"global": 42, "bootstrap": 42},
    )
    write_manifest(manifest, Path(run_dir))
    # → run_dir/manifest.json: schema_version, git_sha, dirty_flag, code_versions,
    #   env (python+platform), seeds, data_hashes, versioned_objects, gpu_info

Modules

Module Purpose
eval_toolkit.metrics PR-AUC, ROC-AUC, ECE variants, Brier decomposition, prior-shift projection
eval_toolkit.thresholds ThresholdSelector Protocol + 6 reference impls (max-F1, target-recall/precision/FPR, Youden-J, cost-sensitive)
eval_toolkit.operating_points Fit thresholds on mixed-class slices and apply them to mixed or single-class target slices with provenance
eval_toolkit.bootstrap BCa + paired bootstrap, MDE estimates, two-level operating-point bootstrap, K-fold CLT-corrected CI
eval_toolkit.calibration Reliability curves, Bayes-optimal thresholds, isotonic/Platt/temperature scaling
eval_toolkit.harness Scorer Protocol + evaluate(...) + evaluate_folded(...) slice-aware orchestrators
eval_toolkit.leakage LeakageCheck Protocol + 7 reference impls (exact / near / encoding-obfuscated / cross-split / label-conflict / group / temporal); Versioned opt-in Protocol
eval_toolkit.splits Splitter Protocol + 5 reference impls (holdout / stratified / group / source-disjoint / time-series)
eval_toolkit.loaders DatasetLoader Protocol + 4 reference impls (DataFrame / SingleSlice / ParquetGlob / HF datasets) with Croissant-compatible describe()
eval_toolkit.manifest RunManifest (NeurIPS-aligned) + source-role / guardrail metadata + make_manifest / write_manifest
eval_toolkit.claims EvidenceGate class (frozen dataclass: name + callable check + severity), reference gate factories (required_metric_gate, minimum_slice_size_gate, metric_threshold_gate, etc.), evaluate_claims(), and ClaimReport for claim-mode vs exploratory-mode checks. See docs/source/extending.md for writing custom gates and docs/source/examples/claims_and_gates.md for a worked end-to-end example.
eval_toolkit.text_dedup SimilarityStrategy Protocol + 5 strategies (TF-IDF / hash / embedding / Jaccard / MinHash-LSH); near_dedup / cross_dedup orchestrators
eval_toolkit.plotting PR curves, reliability diagrams, confusion matrices, score histograms, lift CIs
eval_toolkit.provenance File hashing, run-directory layout, figure metadata sidecar
eval_toolkit/schemas/ Bundled JSON Schemas (results.v1.json, results_full.v1.json, manifest.v1.json) — load via importlib.resources.files("eval_toolkit") / "schemas" (not an importable Python module)
eval_toolkit.paths Repo-relative path normalization
eval_toolkit.seeds set_global_seeds (random + numpy + optional torch)
eval_toolkit.config frozen_config decorator + from_yaml loader
eval_toolkit.docs Anchor-based markdown rendering with formatter registry

Fast iteration loop

For development, skip slow tests with:

make fast              # or: nox -s fast
# under the hood: uv run pytest -m "not slow" -q

CI runs the full suite (including slow) on every push. The slow marker is applied to tests exceeding ~2s (mostly Hypothesis property tests with large max_examples and a few bootstrap tests with n_resamples >= 200). make fast keeps the developer iteration loop under ~30 seconds.

Standards

See STYLE.md for the full reconciled coding standards (formatting, naming, errors, docstrings, tests, packaging).

Versioning

Semver from v0.1.0. See CHANGELOG.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eval_toolkit-1.4.0.tar.gz (875.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eval_toolkit-1.4.0-py3-none-any.whl (284.1 kB view details)

Uploaded Python 3

File details

Details for the file eval_toolkit-1.4.0.tar.gz.

File metadata

  • Download URL: eval_toolkit-1.4.0.tar.gz
  • Upload date:
  • Size: 875.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for eval_toolkit-1.4.0.tar.gz
Algorithm Hash digest
SHA256 a14fcbc5e3bd905be1ef811dff1c664c850a372c87a5b0e1c82b06b5616ef6ee
MD5 fbc8eb2f78f28c9356c5a74d5484b904
BLAKE2b-256 f35fea5fcd9c29fb292c3bb60c4bb367c72c6b9a44ea2e5787dee36d107bce43

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_toolkit-1.4.0.tar.gz:

Publisher: publish.yml on brandon-behring/eval-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file eval_toolkit-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: eval_toolkit-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 284.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for eval_toolkit-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 20d6b32e249c85452e45d8cc912b6380e90e09e8c23b60a60195d082940fe461
MD5 3796b0e18d5110d089c4d86fb745dc85
BLAKE2b-256 1fa07a260590bbdd3de813d0a4b9998c19e8cc25f7545534a338831b048be151

See more details on using hashes here.

Provenance

The following attestation bundles were made for eval_toolkit-1.4.0-py3-none-any.whl:

Publisher: publish.yml on brandon-behring/eval-toolkit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page