Open-Set Recognition (OSR) and OOD-detection metrics for ML research
Project description
osr-metrics
Open-Set Recognition (OSR) and OOD-detection metrics for machine-learning research.
A small, framework-agnostic Python library that bundles the metrics needed for credible OSR / OOD-detection publications, with consistent score-direction conventions and first-principles-verified formulas.
What's inside
| Group | Metrics |
|---|---|
| OOD detection | auroc, fpr_at_tpr, fpr_at_95tpr, aupr_in, aupr_out |
| Open-Set Recognition | compute_aoscr (canonical Dhamija/Vaze), oscr_curve, compute_nf_rejection_at_tpr |
| Multi-label classification | macro_auprc, macro_auprc_id_labels, macro_f1_with_thresholds, per_label_auprc, f1_per_label |
| Four-class OSR partitioning | build_fourclass_masks, compute_fourclass_metrics, partition_ood_by_purity |
| Calibration | expected_calibration_error, brier_score |
| Statistical comparison | delong_test (O(n log n) rank-based), bootstrap_ci (with optional stratification) |
All functions take plain numpy arrays and return scalars or simple
dictionaries — no PyTorch, TensorFlow, or framework lock-in.
Scope
This library targets the semantic-shift setting (OSR / near-OOD / far-OOD): novel class labels appear at test time. Covariate shift (domain generalization), regression OOD, and continual / open-world learning are out of scope.
Capability matrix — which function for which setting?
Read across to find your setting; functions marked ✅ apply directly. ⚠ = applies with a small adapter (see footnote). ❌ = not applicable.
| Function | Multi-class (single-label) |
Multi-label | Pure OOD detection |
OSR (classify+reject) |
Calibration | Statistical test |
|---|---|---|---|---|---|---|
auroc |
✅ | ✅ | ✅ | — | — | — |
fpr_at_tpr / fpr_at_95tpr |
✅ | ✅ | ✅ | — | — | — |
aupr_in / aupr_out |
✅ | ✅ | ✅ | — | — | — |
compute_aoscr / oscr_curve |
✅ | ⚠ ¹ | — | ✅ | — | — |
compute_nf_rejection_at_tpr |
❌ | ✅ | — | ✅ ² | — | — |
partition_ood_by_purity |
❌ | ✅ | — | ✅ ² | — | — |
build_fourclass_masks / compute_fourclass_metrics |
❌ | ✅ | — | ✅ ² | — | — |
macro_auprc / macro_auprc_id_labels |
❌ ³ | ✅ | — | — | — | — |
per_label_auprc / f1_per_label |
❌ ³ | ✅ | — | — | — | — |
macro_f1_with_thresholds |
❌ ³ | ✅ | — | — | — | — |
expected_calibration_error |
⚠ ⁴ | ✅ | — | — | ✅ | — |
brier_score |
⚠ ⁴ | ✅ | — | — | ✅ | — |
delong_test |
✅ | ✅ | ✅ | ✅ | — | ✅ |
bootstrap_ci |
✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
¹ Multi-label OSCR/AOSCR: pass an exact-match indicator
(1 if all labels predicted correctly, else 0) as class_predictions
with true_classes=ones(N). See compute_aoscr docstring.
² Clinical / multi-label OSR helpers — depend on a per-sample "No Finding" (all-zero label vector) indicator that has no analogue in multi-class single-label settings.
³ Multi-class single-label closed-set classification — use
sklearn.metrics.accuracy_score and
sklearn.metrics.f1_score(..., average='macro') directly. A native
multi-class wrapper is on the roadmap.
⁴ Multi-class softmax calibration (Guo 2017 form) is not yet the
form implemented here. Current functions flatten across (sample, label).
For multi-class softmax, use sklearn.calibration.calibration_curve or
torchmetrics.CalibrationError until the multi-class overload lands.
Score-direction convention
For every OOD/novelty metric in this library, higher score = more OOD.
ID-positive metrics (aupr_in) handle the sign flip internally so you don't
have to.
Install
pip install osr-metrics
Requires Python 3.10+, numpy, scikit-learn, scipy.
Development install
git clone https://github.com/hxtruong6/osr-metrics.git
cd osr-metrics
pip install -e .[dev]
Quick start
import numpy as np
from osr_metrics import auroc, fpr_at_95tpr, compute_aoscr, expected_calibration_error
# OOD detection
scores = np.random.randn(1000) # higher = more OOD
labels = np.random.randint(0, 2, 1000) # 1 = OOD, 0 = ID
print("AUROC:", auroc(scores, labels))
print("FPR@95TPR:", fpr_at_95tpr(scores, labels))
# Open-Set Classification Rate (joint classify+reject)
cls_pred = np.random.randint(0, 5, 1000)
cls_true = np.random.randint(0, 5, 1000)
print("AOSCR:", compute_aoscr(scores, labels, cls_pred, cls_true))
# Calibration
probs = np.random.uniform(0, 1, (1000, 14))
multi_labels = (np.random.uniform(0, 1, (1000, 14)) < probs).astype(int)
print("ECE:", expected_calibration_error(probs, multi_labels))
Statistical comparison
from osr_metrics import delong_test, bootstrap_ci, auroc
# Pairwise AUROC comparison (DeLong 1988)
z, p = delong_test(scores_method_a, scores_method_b, labels)
print(f"DeLong z={z:.3f}, p={p:.4f}")
# Bootstrap CI (use stratify=True for imbalanced data)
lo, mean, hi = bootstrap_ci(scores, labels, auroc, n_bootstrap=1000, stratify=True)
print(f"AUROC = {mean:.4f} 95% CI = [{lo:.4f}, {hi:.4f}]")
Four-class OSR partitioning
For multi-label problems with held-out labels (chest X-ray OSR style):
from osr_metrics import build_fourclass_masks, compute_fourclass_metrics
label_names = ["A", "B", "C", "D"]
held_out = ["C", "D"]
metrics = compute_fourclass_metrics(scores, label_vecs, label_names, held_out)
# Returns: auroc_full, fpr95_full, auroc_pure, auroc_mixed,
# auroc_mixed_vs_id_disease, auroc_nf_vs_pure,
# auroc_disease_only, counts...
Partitions images into four mutually exclusive classes:
id_disease— only known labelsno_finding— all-zero label vectorpure_ood— only held-out labelsmixed_ood— both known + held-out labels
Five AUROC pairings answer different questions:
| Key | Negatives | Positives | What it asks |
|---|---|---|---|
auroc_pure |
ID-disease + NF | Pure OOD | Upper-bound separability |
auroc_mixed |
ID-disease + NF | Mixed OOD | Mixed-OOD detection difficulty |
auroc_mixed_vs_id_disease |
ID-disease only | Mixed OOD | Near-OOD sensitivity (NF removed) |
auroc_nf_vs_pure |
NF only | Pure OOD | Diagnostic floor: healthy-vs-anything |
auroc_full |
ID-disease + NF | Pure + Mixed OOD | Full population measurement |
Why another metrics library?
Most OOD/OSR libraries (PyTorch-OOD, OpenOOD) couple metrics with detection
methods, datasets, and a heavy framework. osr-metrics is just the metrics —
useful when you want to compute AOSCR or DeLong on cached scores from any
pipeline, regardless of how those scores were produced.
Documentation
docs/USAGE.md— "which metric should I use?" decision tree.docs/EXAMPLES.md— end-to-end runnable examples including the full publication metric panel, DeLong comparison, and seed aggregation.CHANGELOG.md— version history.CITATION.cff— citation metadata.
Testing
pytest tests/ -v
Each metric is verified against a first-principles brute-force reference; the test suite covers numerical equivalence, edge cases (empty class, single-value scores), and known properties (DeLong z=0 on identical inputs, ECE=0.9 on overconfident-wrong, etc.).
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file osr_metrics-0.1.3.tar.gz.
File metadata
- Download URL: osr_metrics-0.1.3.tar.gz
- Upload date:
- Size: 30.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0fc8a65563a69bfa1dc3f499c974713fd863b7dc1d6c43472986ba194836bae
|
|
| MD5 |
d22b36f28af7a32d4aad60037f6523b3
|
|
| BLAKE2b-256 |
83ab39d5a2c6402cd8167024c2ed03723353a6764ffd90b595ce95d58b15dae5
|
Provenance
The following attestation bundles were made for osr_metrics-0.1.3.tar.gz:
Publisher:
release.yml on hxtruong6/osr-metrics
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
osr_metrics-0.1.3.tar.gz -
Subject digest:
b0fc8a65563a69bfa1dc3f499c974713fd863b7dc1d6c43472986ba194836bae - Sigstore transparency entry: 1402169365
- Sigstore integration time:
-
Permalink:
hxtruong6/osr-metrics@09b0d3445210db4caaa4c8158eb6a54001b51a8b -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/hxtruong6
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@09b0d3445210db4caaa4c8158eb6a54001b51a8b -
Trigger Event:
push
-
Statement type:
File details
Details for the file osr_metrics-0.1.3-py3-none-any.whl.
File metadata
- Download URL: osr_metrics-0.1.3-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88116ccef47eaca7aa742f470a3fcc6675fe78860cbff64f5c3fae7cfecded4a
|
|
| MD5 |
320f3541dfedf2c27689e2501268507a
|
|
| BLAKE2b-256 |
ee66c2c00c3e496816cb9ca4565ba7f3b4fe914f7472820f71217e7b9830e3ab
|
Provenance
The following attestation bundles were made for osr_metrics-0.1.3-py3-none-any.whl:
Publisher:
release.yml on hxtruong6/osr-metrics
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
osr_metrics-0.1.3-py3-none-any.whl -
Subject digest:
88116ccef47eaca7aa742f470a3fcc6675fe78860cbff64f5c3fae7cfecded4a - Sigstore transparency entry: 1402169483
- Sigstore integration time:
-
Permalink:
hxtruong6/osr-metrics@09b0d3445210db4caaa4c8158eb6a54001b51a8b -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/hxtruong6
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@09b0d3445210db4caaa4c8158eb6a54001b51a8b -
Trigger Event:
push
-
Statement type: