Skip to main content

Open-Set Recognition (OSR) and OOD-detection metrics for ML research

Project description

osr-metrics

PyPI version Python versions License: MIT CI

Open-Set Recognition (OSR) and OOD-detection metrics for machine-learning research.

A small, framework-agnostic Python library that bundles the metrics needed for credible OSR / OOD-detection publications, with consistent score-direction conventions and first-principles-verified formulas.

What's inside

Group Metrics
OOD detection auroc, fpr_at_tpr, fpr_at_95tpr, aupr_in, aupr_out
Open-Set Recognition compute_aoscr (canonical Dhamija/Vaze), oscr_curve, compute_nf_rejection_at_tpr
Multi-label classification macro_auprc, macro_auprc_id_labels, macro_f1_with_thresholds, per_label_auprc, f1_per_label
Four-class OSR partitioning build_fourclass_masks, compute_fourclass_metrics, partition_ood_by_purity
Calibration expected_calibration_error, brier_score
Statistical comparison delong_test (O(n log n) rank-based), bootstrap_ci (with optional stratification)

All functions take plain numpy arrays and return scalars or simple dictionaries — no PyTorch, TensorFlow, or framework lock-in.

Score-direction convention

For every OOD/novelty metric in this library, higher score = more OOD. ID-positive metrics (aupr_in) handle the sign flip internally so you don't have to.

Install

pip install osr-metrics

Requires Python 3.10+, numpy, scikit-learn, scipy.

Development install

git clone https://github.com/hxtruong6/osr-metrics.git
cd osr-metrics
pip install -e .[dev]

Quick start

import numpy as np
from osr_metrics import auroc, fpr_at_95tpr, compute_aoscr, expected_calibration_error

# OOD detection
scores = np.random.randn(1000)          # higher = more OOD
labels = np.random.randint(0, 2, 1000)  # 1 = OOD, 0 = ID
print("AUROC:", auroc(scores, labels))
print("FPR@95TPR:", fpr_at_95tpr(scores, labels))

# Open-Set Classification Rate (joint classify+reject)
cls_pred = np.random.randint(0, 5, 1000)
cls_true = np.random.randint(0, 5, 1000)
print("AOSCR:", compute_aoscr(scores, labels, cls_pred, cls_true))

# Calibration
probs = np.random.uniform(0, 1, (1000, 14))
multi_labels = (np.random.uniform(0, 1, (1000, 14)) < probs).astype(int)
print("ECE:", expected_calibration_error(probs, multi_labels))

Statistical comparison

from osr_metrics import delong_test, bootstrap_ci, auroc

# Pairwise AUROC comparison (DeLong 1988)
z, p = delong_test(scores_method_a, scores_method_b, labels)
print(f"DeLong z={z:.3f}, p={p:.4f}")

# Bootstrap CI (use stratify=True for imbalanced data)
lo, mean, hi = bootstrap_ci(scores, labels, auroc, n_bootstrap=1000, stratify=True)
print(f"AUROC = {mean:.4f}  95% CI = [{lo:.4f}, {hi:.4f}]")

Four-class OSR partitioning

For multi-label problems with held-out labels (chest X-ray OSR style):

from osr_metrics import build_fourclass_masks, compute_fourclass_metrics

label_names = ["A", "B", "C", "D"]
held_out = ["C", "D"]
metrics = compute_fourclass_metrics(scores, label_vecs, label_names, held_out)
# Returns: auroc_full, fpr95_full, auroc_pure, auroc_mixed,
#          auroc_mixed_vs_id_disease, auroc_nf_vs_pure,
#          auroc_disease_only, counts...

Partitions images into four mutually exclusive classes:

  • id_disease — only known labels
  • no_finding — all-zero label vector
  • pure_ood — only held-out labels
  • mixed_ood — both known + held-out labels

Five AUROC pairings answer different questions:

Key Negatives Positives What it asks
auroc_pure ID-disease + NF Pure OOD Upper-bound separability
auroc_mixed ID-disease + NF Mixed OOD Mixed-OOD detection difficulty
auroc_mixed_vs_id_disease ID-disease only Mixed OOD Near-OOD sensitivity (NF removed)
auroc_nf_vs_pure NF only Pure OOD Diagnostic floor: healthy-vs-anything
auroc_full ID-disease + NF Pure + Mixed OOD Full population measurement

Why another metrics library?

Most OOD/OSR libraries (PyTorch-OOD, OpenOOD) couple metrics with detection methods, datasets, and a heavy framework. osr-metrics is just the metrics — useful when you want to compute AOSCR or DeLong on cached scores from any pipeline, regardless of how those scores were produced.

Documentation

  • docs/USAGE.md — "which metric should I use?" decision tree.
  • docs/EXAMPLES.md — end-to-end runnable examples including the full publication metric panel, DeLong comparison, and seed aggregation.
  • CHANGELOG.md — version history.
  • CITATION.cff — citation metadata.

Testing

pytest tests/ -v

Each metric is verified against a first-principles brute-force reference; the test suite covers numerical equivalence, edge cases (empty class, single-value scores), and known properties (DeLong z=0 on identical inputs, ECE=0.9 on overconfident-wrong, etc.).

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

osr_metrics-0.1.2.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

osr_metrics-0.1.2-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file osr_metrics-0.1.2.tar.gz.

File metadata

  • Download URL: osr_metrics-0.1.2.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for osr_metrics-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4c0d731b35e3e82b6837f474aaa942af3189c094e9ae7d906e9729b01688ca43
MD5 cb642e87addbf079a23d5801a9537fd7
BLAKE2b-256 d314e880b7993c053bb6c1e0d6180e6e506f6020b07e23e58f7d9058b4494b22

See more details on using hashes here.

Provenance

The following attestation bundles were made for osr_metrics-0.1.2.tar.gz:

Publisher: release.yml on hxtruong6/osr-metrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file osr_metrics-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: osr_metrics-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 18.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for osr_metrics-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d260a21ed07a8000f1c8aee14b7d44b50892c51e9927d581c33bb6d416936e7c
MD5 9413b932f9ed8fc59083f8250eabd461
BLAKE2b-256 fb667de0bd4e0c5ba348eb2450837e2df3647a620f0a060a15e12301031b2af0

See more details on using hashes here.

Provenance

The following attestation bundles were made for osr_metrics-0.1.2-py3-none-any.whl:

Publisher: release.yml on hxtruong6/osr-metrics

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page