A strict experimental harness for reproducible, statistically valid model evaluation.
Project description
statbelt
statbelt is a strict experimental harness for reproducible, statistically aware
model evaluation in Python.
Status: Alpha (APIs may evolve).
Supported Python: 3.11+.
Installation
Install from PyPI:
pip install statbelt
For local development:
uv sync --all-groups
Quick Start
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from statbelt import ExperimentalHarness
dataset = load_breast_cancer()
X, y = dataset.data, dataset.target
report = (
ExperimentalHarness()
.data(X, y)
.task("binary_classification")
.compare(
("logreg", make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))),
("rf", RandomForestClassifier(n_estimators=100, random_state=21)),
)
.metrics("accuracy", "roc_auc", "log_loss")
.design(cv=5, cv_repeats=2, random_state=42)
.inference(alpha=0.05, bootstrap_resamples=2000)
.compare_inference(method="paired_bootstrap", alternative="two-sided")
.multiplicity(method="holm", family="global")
.practical_significance(accuracy=0.005, roc_auc=0.002, log_loss=0.01)
.baseline("logreg")
.guardrails(min_improvement={"accuracy": 0.002}, confidence=0.95)
.fasten("statbelt.lock.json")
.evaluate()
)
print(report.summary())
Sample output:
Task: binary_classification
CV folds: 5
CV repeats: 2
Bootstrap resamples: 2000
Confidence interval: 95%
Model: logreg
accuracy: 0.9737 (CI 0.9596, 0.9877)
roc_auc: 0.9953 (CI 0.9902, 0.9990)
log_loss: 0.0764 (CI 0.0515, 0.1061)
Model: rf
accuracy: 0.9561 (CI 0.9509, 0.9613)
roc_auc: 0.9896 (CI 0.9832, 0.9951)
log_loss: 0.1769 (CI 0.1061, 0.3037)
When pairwise inference and guardrails are configured, summary() also includes
pairwise comparison lines and an overall guardrail pass/fail section.
Core Features
ExperimentalHarnessbuilder-style API for binary classification comparisons.- Deterministic repeated stratified k-fold evaluation with shared folds across models.
- Bootstrap confidence intervals over fold-level metrics.
- Pairwise model inference with paired bootstrap/permutation p-values.
- Multiple-comparison correction (
holm,bonferroni,fdr_bh). - Practical-significance thresholds and baseline guardrail checks.
- Machine-readable exports via
EvaluationReport.to_json()and.to_dataframe(). - Lock artifact output (
statbelt.lock.json) with config and split indices. - Strict staged workflow: configure ->
fasten()->evaluate().
Inference Configuration
Use pairwise inference to compare models directly:
.compare_inference(method="paired_bootstrap", alternative="two-sided")
Supported values:
method:paired_bootstrap,permutationalternative:two-sided,greater,less
How to choose alternative:
two-sided: use when you only care whether A and B differ.greater: use when your question is “is A better than B?”less: use when your question is “is A worse than B?”
How to choose method:
paired_bootstrap: default practical choice for CI + p-value style comparison.permutation: exact paired-randomization style null test over fold deltas.
Interpretation details:
- Pairwise rows are always
model_avsmodel_b; A/B come fromcompare(...)order. deltain the report is raw metric-spacemodel_a - model_b.- One-sided p-values are metric-direction normalized, so
greater/lesskeep the same meaning across mixed metrics (for example, bothaccuracyandlog_loss).
Quick example:
report = (
ExperimentalHarness()
.data(X, y)
.task("binary_classification")
.compare(("candidate", candidate_model), ("baseline", baseline_model))
.metrics("accuracy", "log_loss")
.compare_inference(method="paired_bootstrap", alternative="greater")
.fasten()
.evaluate()
)
# Here, p-values answer: \"is candidate better than baseline?\"
Control multiple testing with:
.multiplicity(method="holm", family="global")
Supported values:
method:holm,bonferroni,fdr_bhfamily:global,per_metric
Practical Significance and Guardrails
Practical thresholds:
.practical_significance(accuracy=0.005, log_loss=0.01)
Guardrails against a baseline:
.baseline("logreg")
.guardrails(min_improvement={"accuracy": 0.002}, confidence=0.95)
Rules:
- Threshold values must be finite and non-negative.
- Guardrails require
baseline(...). - Guardrail metrics must also be included in
.metrics(...).
Report and Export API
evaluate() returns an EvaluationReport with:
models: per-model metric intervalspairwise: pairwise deltas, CIs, raw/adjusted p-values, practical-significance flagsguardrails: per-check pass/fail and aggregateoverall_passsplitsandsplit_metadata: deterministic split definitions
Export helpers:
report.to_json("report.json")
report.to_dataframe(kind="models")
report.to_dataframe(kind="pairwise")
Lockfile Schema
fasten() writes schema version 2 lockfiles, including:
- design:
cv,cv_repeats,random_state - inference config:
alpha,bootstrap_resamples,pairwise_inference - multiplicity config
- practical-significance and guardrail config
- split indices with repeat/fold metadata
Supported Task and Metrics
Supported task:
binary_classification
Supported metrics:
accuracyprecisionrecallf1roc_auclog_loss
Validation is fail-fast. For example:
log_lossrequirespredict_proba.roc_aucrequirespredict_probaordecision_function.
Development
uv sync --all-groups
uv run ruff check .
uv run pytest
For release operations (tagging, TestPyPI gate, PyPI publish), see RELEASING.md.
Current Limits
- Binary classification only.
License
This project is licensed under the GNU Affero General Public License, version 3
or later (AGPL-3.0-or-later). See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file statbelt-0.1.2.tar.gz.
File metadata
- Download URL: statbelt-0.1.2.tar.gz
- Upload date:
- Size: 40.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93c9f568ccc6fc8a167927239c6286e74213ca7f0e5320c3e7090bf18d928af3
|
|
| MD5 |
550f271e699a3737464fe2d79a8b6eff
|
|
| BLAKE2b-256 |
31a14f5e632d84a0e8fcd24fb497bbfb8e58b9132c8ffadd56a0e44d8b4dde46
|
File details
Details for the file statbelt-0.1.2-py3-none-any.whl.
File metadata
- Download URL: statbelt-0.1.2-py3-none-any.whl
- Upload date:
- Size: 28.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbb725222f0e3b4cdb5f7ad86184d5a79213441037140ff4c60c04450d927718
|
|
| MD5 |
5b58a9095d335fbf84d7db1852d602bf
|
|
| BLAKE2b-256 |
15ce89dfac1fec89f28e187ffe7686a49f3d09fd44dadefc2b7321b3e5f91e7e
|