Skip to main content

Open-Set Recognition (OSR) and OOD-detection metrics for ML research

Project description

osr-metrics

PyPI version Python versions Downloads Downloads/month License: MIT CI

Plain-numpy metrics for Open-Set Recognition and OOD-detection research — no PyTorch, no datasets, just the math.

Why osr-metrics?

Most OSR / OOD libraries (PyTorch-OOD, OpenOOD) couple metrics with detection methods, datasets, and a heavy framework. osr-metrics is just the metrics — useful when you have cached scores from any pipeline and want to compute AOSCR, FPR@95TPR, or DeLong on them, regardless of how those scores were produced.

  • Framework-agnosticnumpy arrays in, scalars or dicts out. No PyTorch / TensorFlow / dataset dependencies.
  • Verified formulas — every metric checked against a first-principles brute-force reference.
  • Consistent conventions — for every OOD/novelty score, higher = more OOD. ID-positive metrics (aupr_in) handle the sign flip internally.
  • Statistical rigor — DeLong (O(n log n) rank-based) and stratified bootstrap CIs are first-class, not afterthoughts.

What's inside

Group Metrics
OOD detection auroc, fpr_at_tpr, fpr_at_95tpr, aupr_in, aupr_out
Open-Set Recognition compute_aoscr (canonical Dhamija/Vaze), compute_aoscr_multiclass, oscr_curve, compute_nf_rejection_at_tpr
Multi-class (single-label) classification top1_accuracy, macro_f1_multiclass, balanced_accuracy
Multi-label classification macro_auprc, macro_auprc_id_labels, macro_f1_with_thresholds, per_label_auprc, f1_per_label
Four-class OSR partitioning build_fourclass_masks, compute_fourclass_metrics, partition_ood_by_purity
Calibration expected_calibration_error, expected_calibration_error_multiclass, brier_score, brier_score_multiclass
Statistical comparison delong_test (O(n log n) rank-based), bootstrap_ci (with optional stratification)
Selective prediction rc_curve, aurc, eaurc, selective_risk_at_coverage, selective_accuracy_at_coverage, warn_if_inverted_aurc
Utilities as_ood_scores (score-direction adapter), warn_if_inverted_scores, compute_panel (one-call publication panel)

All functions take plain numpy arrays and return scalars or simple dictionaries — no PyTorch, TensorFlow, or framework lock-in.

Scope

This library targets the semantic-shift setting (OSR / near-OOD / far-OOD): novel class labels appear at test time. Covariate shift (domain generalization), regression OOD, and continual / open-world learning are out of scope.

Capability matrix — which function for which setting?

Read across to find your setting; functions marked ✅ apply directly. ⚠ = applies with a small adapter (see footnote). ❌ = not applicable.

Function Multi-class
(single-label)
Multi-label Pure OOD
detection
OSR
(classify+reject)
Calibration Statistical
test
auroc
fpr_at_tpr / fpr_at_95tpr
aupr_in / aupr_out
compute_aoscr / oscr_curve ⚠ ¹
compute_aoscr_multiclass
compute_nf_rejection_at_tpr ✅ ²
partition_ood_by_purity ✅ ²
build_fourclass_masks / compute_fourclass_metrics ✅ ²
top1_accuracy / macro_f1_multiclass / balanced_accuracy
macro_auprc / macro_auprc_id_labels
per_label_auprc / f1_per_label
macro_f1_with_thresholds
expected_calibration_error
expected_calibration_error_multiclass
brier_score
brier_score_multiclass
delong_test
bootstrap_ci
rc_curve / aurc / eaurc
selective_risk_at_coverage / selective_accuracy_at_coverage
warn_if_inverted_aurc
as_ood_scores / warn_if_inverted_scores
compute_panel

¹ Multi-label OSCR/AOSCR: pass an exact-match indicator (1 if all labels predicted correctly, else 0) as class_predictions with true_classes=ones(N). See compute_aoscr docstring. (For multi-class, use compute_aoscr_multiclass instead — it accepts logits or class-IDs directly.)

² Clinical / multi-label OSR helpers — depend on a per-sample "No Finding" (all-zero label vector) indicator that has no analogue in multi-class single-label settings.

Score-direction convention

For every OOD/novelty metric in this library, higher score = more OOD. ID-positive metrics (aupr_in) handle the sign flip internally so you don't have to.

Install

pip install osr-metrics

Requires Python 3.10–3.14, numpy, scikit-learn, scipy. CI runs on Python 3.10, 3.11, 3.12, 3.13, and 3.14.

Development install

git clone https://github.com/hxtruong6/osr-metrics.git
cd osr-metrics
pip install -e .[dev]

Quick start

import numpy as np
from osr_metrics import (
    auroc, fpr_at_95tpr,
    compute_aoscr_multiclass,
    expected_calibration_error_multiclass,
)

rng = np.random.default_rng(0)

# OOD detection: 800 ID points, 200 OOD points with shifted score distribution.
id_scores  = rng.normal(0.0, 1.0, size=800)   # ID:  N(0, 1)
ood_scores = rng.normal(2.0, 1.0, size=200)   # OOD: N(2, 1) — higher = more OOD
scores = np.concatenate([id_scores, ood_scores])
labels = np.concatenate([np.zeros(800), np.ones(200)])  # 1 = OOD
print(f"AUROC:     {auroc(scores, labels):.3f}")        # ~0.92
print(f"FPR@95TPR: {fpr_at_95tpr(scores, labels):.3f}") # ~0.36

# Open-Set Classification Rate: joint classify + reject, 80% closed-set accuracy.
n, k = 1000, 5
true_cls = rng.integers(0, k, size=n)
correct  = rng.random(n) < 0.80
pred_cls = np.where(correct, true_cls, (true_cls + 1) % k)
print(f"AOSCR:     {compute_aoscr_multiclass(scores, labels, pred_cls, true_cls):.3f}")

# Multi-class softmax calibration (Guo 2017 form).
probs = rng.dirichlet(np.ones(k) * 0.5, size=n)
probs[np.arange(n), true_cls] += 1.0          # bias toward the correct class
probs /= probs.sum(axis=1, keepdims=True)
print(f"ECE:       {expected_calibration_error_multiclass(probs, true_cls):.3f}")

One-call publication panel

When you have all the inputs and just want the table:

from osr_metrics import compute_panel

# Multi-class
out = compute_panel(scores, ood_labels, probs=softmax_NK, y=y_N)

# Multi-label
out = compute_panel(
    scores, ood_labels,
    preds=preds_NK, probs=probs_NK,
    label_vecs=labels_NK, label_names=names, held_out_labels=held_out,
    setting="multilabel",
)

The panel infers your setting from input shapes and computes every metric whose required inputs are present.

Selective prediction (risk–coverage)

When you have per-sample losses (e.g. 0/1 for misclassification) and a confidence-style ranking score, report AURC and selective risk at a chosen coverage:

from osr_metrics import aurc, eaurc, selective_risk_at_coverage

# Convention: ood_score is "higher = more OOD" (i.e., reject first).
# If you have a confidence score, pass `-confidence`.
print(f"AURC:  {aurc(ood_score, loss):.4f}")
print(f"E-AURC: {eaurc(ood_score, loss):.4f}")
print(f"Risk@95% coverage: {selective_risk_at_coverage(ood_score, loss, 0.95):.4f}")

compute_panel(..., loss=loss) adds these to the publication panel. See docs/USAGE.md and docs/PITFALLS.md for score-direction guidance.

Statistical comparison

from osr_metrics import delong_test, bootstrap_ci, auroc

# Pairwise AUROC comparison (DeLong 1988)
z, p = delong_test(scores_method_a, scores_method_b, labels)
print(f"DeLong z={z:.3f}, p={p:.4f}")

# Bootstrap CI (use stratify=True for imbalanced data)
lo, mean, hi = bootstrap_ci(scores, labels, auroc, n_bootstrap=1000, stratify=True)
print(f"AUROC = {mean:.4f}  95% CI = [{lo:.4f}, {hi:.4f}]")

Four-class OSR partitioning

For multi-label problems with held-out labels (chest X-ray OSR style):

from osr_metrics import build_fourclass_masks, compute_fourclass_metrics

label_names = ["A", "B", "C", "D"]
held_out = ["C", "D"]
metrics = compute_fourclass_metrics(scores, label_vecs, label_names, held_out)
# Returns: auroc_full, fpr95_full, auroc_pure, auroc_mixed,
#          auroc_mixed_vs_id_disease, auroc_nf_vs_pure,
#          auroc_disease_only, counts...

Partitions images into four mutually exclusive classes:

  • id_disease — only known labels
  • no_finding — all-zero label vector
  • pure_ood — only held-out labels
  • mixed_ood — both known + held-out labels

Five AUROC pairings answer different questions:

Key Negatives Positives What it asks
auroc_pure ID-disease + NF Pure OOD Upper-bound separability
auroc_mixed ID-disease + NF Mixed OOD Mixed-OOD detection difficulty
auroc_mixed_vs_id_disease ID-disease only Mixed OOD Near-OOD sensitivity (NF removed)
auroc_nf_vs_pure NF only Pure OOD Diagnostic floor: healthy-vs-anything
auroc_full ID-disease + NF Pure + Mixed OOD Full population measurement

Documentation

  • docs/CONCEPTS.md — glossary: ID/OOD, OSR, semantic vs covariate shift, near vs far OOD, multi-class vs multi-label.
  • docs/USAGE.md — "which metric should I use?" decision tree.
  • docs/PITFALLS.md — the eight most common mistakes, with bad-vs-good code side by side.
  • docs/EXAMPLES.md — end-to-end runnable examples including the full publication metric panel, DeLong comparison, and seed aggregation.
  • REFERENCES.md — bibliographic source for every metric.
  • CHANGELOG.md — version history.
  • CITATION.cff — citation metadata.

Testing

pytest tests/ -v

Each metric is verified against a first-principles brute-force reference; the test suite covers numerical equivalence, edge cases (empty class, single-value scores), and known properties (DeLong z=0 on identical inputs, ECE=0.9 on overconfident-wrong, etc.).

Citation

If osr-metrics is useful in your research, please cite it:

@software{osr_metrics,
  author  = {Hoang Xuan Truong},
  title   = {osr-metrics: Open-Set Recognition and OOD-Detection Metrics for ML Research},
  url     = {https://github.com/hxtruong6/osr-metrics},
  year    = {2026}
}

Machine-readable metadata is in CITATION.cff. When citing a specific version, append version = {X.Y.Z} and reference the matching GitHub Release.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osr_metrics-0.3.1.tar.gz (51.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

osr_metrics-0.3.1-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file osr_metrics-0.3.1.tar.gz.

File metadata

  • Download URL: osr_metrics-0.3.1.tar.gz
  • Upload date:
  • Size: 51.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for osr_metrics-0.3.1.tar.gz
Algorithm Hash digest
SHA256 fd5785feb270b60e31b39041b24e1fa81b9420ef0a91244abf0ce04b5a17cb20
MD5 daf2b619151258b38efd762f67af4c58
BLAKE2b-256 5adcedb96aaa972c11a4132cab4f3177833eaaaa2327eab0bcb0530fc8496137

See more details on using hashes here.

Provenance

The following attestation bundles were made for osr_metrics-0.3.1.tar.gz:

Publisher: release.yml on hxtruong6/osr-metrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file osr_metrics-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: osr_metrics-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for osr_metrics-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 237d97b49ce7371c7fe526e3743bf0d7611c856cc8a457a2029886be8ac5075e
MD5 04c717fef548a70892be4f1c0cead7fb
BLAKE2b-256 087a888e237142e699aaf06d0dedb461eb9b45b308402d4297d864c0bf030426

See more details on using hashes here.

Provenance

The following attestation bundles were made for osr_metrics-0.3.1-py3-none-any.whl:

Publisher: release.yml on hxtruong6/osr-metrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page