Quantify uncertainty around classification performance metrics
Project description
Classifier Uncertainty
About
This package implements methods from Tötsch N and Hoffmann D. 2021 to quantify the uncertainty around classification performance metrics. Classifiers are often tested on relatively small data sets, which should lead to uncertain performance metrics. Even when tested on large data sets, performance is often presented as a percentage with three decimals, and competing classifiers are ranked assuming such a precision. Reducing metric uncertainty below 0.001% would require tens of billions of data points.
The original authors' Python implementation is available at niklastoe/classifier_metric_uncertainty. This package was built independently and extends that work with:
- Score-based input — accepts raw
(y_true, y_score)pairs and sweeps thresholds; the original takes confusion matrix counts only - ROC and PR curves with uncertainty bands — including AUC posterior distributions
- Economic value analysis — Value Score (Wilks 2001) and mean expense posteriors
- Custom metrics — evaluate any
f(tp, fn, tn, fp)over the posterior CM samples
Installation
pip install classifier-uncertainty
Quick start
from classifier_uncertainty import BinaryClassifier
# From ground-truth labels and classifier scores
bc = BinaryClassifier(y_true, y_score)
# Or from published confusion matrix counts (e.g. from a paper)
bc = BinaryClassifier.from_cm(tp=26, fn=0, tn=6, fp=2)
# fix the binarization threshold
t = bc.at_threshold(0.5)
What questions can this answer?
How well is a classifier likely to perform on a new, similar dataset?
t.tpr().point_estimate, t.tpr().credible_interval()
How will performance change if prevalence changes?
t.precision().point_estimate # at observed prevalence
t.at_prevalence(0.05).precision().point_estimate # projected to production
How likely is classifier A better than classifier B on a given metric?
(bc_a.at_threshold().tpr().samples > bc_b.at_threshold().tpr().samples).mean()
How likely is this model more cost-effective than business-as-usual?
(t_model.mean_expense(C, L).samples < t_bau.mean_expense(C, L).samples).mean()
Does this classifier meet my minimum recall requirement?
(t.tpr().samples > 0.8).mean()
Do precision and recall meet requirements simultaneously?
((t.tpr().samples > 0.8) & (t.precision().samples > 0.8)).mean()
Is this classifier better than random guessing?
(t.bookmaker_informedness().samples > 0).mean()
Should I trust this published result?
BinaryClassifier.from_cm(tp=26, fn=0, tn=6, fp=2).at_threshold().tpr().credible_interval()
For Developers
Setup
uv sync # install package + dev dependencies into .venv
Development workflow
All changes should be made on a branch and merged via pull request — do not commit directly to main.
git checkout -b feat/my-feature # or fix/, docs/, refactor/, etc.
# ... make changes ...
make format # auto-fix formatting and lint violations
make check # lint, type-check, and verify docstring coverage
make test # run tests with coverage (90% minimum)
make docs-serve # preview docs locally at http://127.0.0.1:8000
git push -u origin feat/my-feature
# open a pull request on GitHub
CI runs make check and make test automatically on every push and pull request. A PR cannot be merged if CI fails.
What triggers what
| Action | CI checks | Docs deployed | Package published |
|---|---|---|---|
| Push to any branch / open PR | ✓ | ||
Merge to main |
✓ | ✓ | |
Push a v* tag |
✓ |
Docs-only change (e.g. fix a typo in docs/ or a docstring): open a PR and merge to main — docs redeploy automatically, no tag needed.
Code-only change (e.g. bug fix): merge to main, then tag when ready to publish (see below). Docs will also redeploy on merge, reflecting any updated docstrings.
Publishing a new package version
- Bump the version in
pyproject.toml:make patch # 0.1.0 → 0.1.1 (bug fixes) make minor # 0.1.0 → 0.2.0 (new features) make major # 0.1.0 → 1.0.0 (breaking changes)
- Commit, tag, and push:
git add pyproject.toml git commit -m "chore: bump version to v0.x.x" git tag v0.x.x git push && git push --tags
Pushing the tag triggers the publish workflow, which runs the test suite and publishes the package to PyPI. Check that the release appeared:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file classifier_uncertainty-0.2.0.tar.gz.
File metadata
- Download URL: classifier_uncertainty-0.2.0.tar.gz
- Upload date:
- Size: 99.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4160ed4698973db8ca9fb790a3e14403a3a3d7cf55b5c83ff93d947e700fcfd
|
|
| MD5 |
64bd66cc85615e7150b71258165dc9d5
|
|
| BLAKE2b-256 |
014ad58776ecd81aadbc6629303058d02cb162d2447f353c8a6f75a81c02c0e6
|
File details
Details for the file classifier_uncertainty-0.2.0-py3-none-any.whl.
File metadata
- Download URL: classifier_uncertainty-0.2.0-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.24 {"installer":{"name":"uv","version":"0.11.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0a7478e3e6c0ace40e6ed39601bc557ca45a9ce1c9af37a5a2e8747a9745a58
|
|
| MD5 |
4ed835136155c4fa6514c976fea26f45
|
|
| BLAKE2b-256 |
2ab7eb4e1759d534e7e7329df6c892b1f9c0d219c94e171e0e391c14bcc7f272
|