Utilities for computing optimal classification cutoffs for binary and multiclass classification
Project description
Optimal Classification Cutoffs
Transform your ML model performance with optimal decision thresholds.
Most classifiers output probabilities, but decisions need thresholds. The default τ = 0.5 is almost always wrong for real objectives like F1, precision/recall, or business costs. This library finds the exact optimal threshold using O(n log n) algorithms, delivering 40%+ metric improvements in 3 lines of code.
The Problem: Default 0.5 Thresholds Are Wrong
# ❌ WRONG: Default 0.5 threshold (what everyone does)
y_pred = (model.predict_proba(X)[:, 1] >= 0.5).astype(int)
# F1 Score: 0.654
# ✅ RIGHT: Optimal threshold (3 lines of code)
from optimal_cutoffs import optimize_thresholds
result = optimize_thresholds(y_true, y_scores, metric="f1")
y_pred = result.predict(y_scores_test)
# F1 Score: 0.891 (+36% improvement!)
Why this matters: Default 0.5 assumes equal costs and balanced classes. Real problems have imbalanced data (fraud: 1%, disease: 5%) and asymmetric costs (missing fraud costs $1000, false alarm costs $1). Optimal thresholds are typically 0.05-0.30, not 0.50.
Installation
pip install optimal-classification-cutoffs
Optional Performance Boost:
# For 10-100× speedups with Numba JIT compilation
pip install optimal-classification-cutoffs[performance]
# For Jupyter examples and visualizations
pip install optimal-classification-cutoffs[examples]
Python 3.14+ Support: The package works on all Python versions 3.12+, including cutting-edge Python 3.14. Numba acceleration is optional and will automatically fall back to pure Python when unavailable.
Quick Start
Binary Classification: 40%+ F1 Improvement
from optimal_cutoffs import optimize_thresholds
# Your existing model probabilities
y_scores = model.predict_proba(X_test)[:, 1]
# Find optimal threshold (exact solution, O(n log n))
result = optimize_thresholds(y_true, y_scores, metric="f1")
print(f"Optimal threshold: {result.threshold:.3f}") # e.g., 0.127 not 0.5!
print(f"Expected F1: {result.scores[0]:.3f}")
# Make optimal predictions
y_pred = result.predict(y_scores_new)
Multiclass Classification: Per-Class Thresholds
import numpy as np
from optimal_cutoffs import optimize_thresholds
# Multiclass probabilities (n_samples, n_classes)
y_scores = model.predict_proba(X_test)
# Automatically detects multiclass, optimizes per-class thresholds
result = optimize_thresholds(y_true, y_scores, metric="f1")
print(f"Per-class thresholds: {result.thresholds}")
print(f"Task detected: {result.task.value}") # "multiclass"
print(f"Method used: {result.method}") # "coord_ascent"
# Predictions use optimal thresholds
y_pred = result.predict(y_scores_new)
Cost-Sensitive Decisions: No Thresholds Needed
from optimal_cutoffs import optimize_decisions
# Cost matrix: rows=true class, cols=predicted class
# False negatives cost 10x more than false positives
cost_matrix = [[0, 1], [10, 0]]
result = optimize_decisions(y_probs, cost_matrix)
y_pred = result.predict(y_probs_new) # Bayes-optimal decisions
API Overview
Clean, minimal API designed around user jobs-to-be-done:
Core Functions (The Only Two You Need)
from optimal_cutoffs import optimize_thresholds, optimize_decisions
# For threshold-based optimization (F1, precision, recall, etc.)
result = optimize_thresholds(y_true, y_scores, metric="f1")
# For cost matrix optimization (no thresholds)
result = optimize_decisions(y_probs, cost_matrix)
Progressive Disclosure: Power When You Need It
from optimal_cutoffs import metrics, bayes, cv, algorithms
# Custom metrics
custom_f2 = lambda tp, tn, fp, fn: (5*tp) / (5*tp + 4*fn + fp)
metrics.register("f2", custom_f2)
# Cross-validation with threshold tuning
thresholds = cv.cross_validate(model, X, y, metric="f1")
# Advanced algorithms
result = algorithms.multiclass.coordinate_ascent(y_true, y_scores)
Auto-Selection with Explanations
Everything is explainable. The library tells you what it detected and why:
result = optimize_thresholds(y_true, y_scores) # All defaults
print(f"Task: {result.task.value}") # "binary" (auto-detected)
print(f"Method: {result.method}") # "sort_scan" (O(n log n))
print(f"Notes: {result.notes}") # ["Detected binary task...", "Selected sort_scan for O(n log n) optimization..."]
Why This Works: Mathematical Foundations
Piecewise Structure
Most metrics (F1, precision, recall) are piecewise-constant in threshold τ. Sorting scores once enables exact optimization in O(n log n) time.
Bayes Decision Theory
Under calibrated probabilities, optimal binary thresholds have closed form:
τ* = cost_fp / (cost_fp + cost_fn)
Independent of class priors, depends only on cost ratio.
Multiclass Extensions
- One-vs-Rest: Independent per-class thresholds (macro averaging)
- Coordinate Ascent: Coupled thresholds for single-label consistency
- General Costs: Skip thresholds, apply Bayes rule on probability vectors
Performance
- O(n log n) exact optimization for piecewise metrics
- O(1) closed-form solutions for cost-sensitive objectives
- Optional Numba acceleration (10-100× speedups) with automatic pure Python fallback
- Python 3.14+ compatible - works on all modern Python versions
- 640+ tests ensuring correctness
Typical speedups: 10-100× faster than grid search, with exact solutions. Performance optimizations are optional - core functionality works everywhere.
Complete Example: Real Impact
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from optimal_cutoffs import optimize_thresholds
# Realistic imbalanced dataset (like fraud detection)
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
# Train any classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_scores = model.predict_proba(X_test)[:, 1]
# ❌ Default threshold
y_pred_default = (y_scores >= 0.5).astype(int)
f1_default = f1_score(y_test, y_pred_default)
print(f"Default F1: {f1_default:.3f}") # ~0.65
# ✅ Optimal threshold
result = optimize_thresholds(y_test, y_scores, metric="f1")
y_pred_optimal = result.predict(y_scores)
f1_optimal = f1_score(y_test, y_pred_optimal)
print(f"Optimal F1: {f1_optimal:.3f}") # ~0.89
improvement = (f1_optimal - f1_default) / f1_default * 100
print(f"Improvement: {improvement:+.1f}%") # ~+40%
When to Use This
Perfect for:
- Imbalanced classification (fraud, medical, spam)
- Cost-sensitive decisions (business impact)
- Performance-critical applications (exact solutions)
- Research requiring theoretical optimality
Not needed for:
- Perfectly balanced classes with symmetric costs
- Problems requiring probabilistic outputs
- Uncalibrated models (calibrate first)
Advanced Usage
Cross-Validation with Thresholds
from optimal_cutoffs import cv
# Thresholds are hyperparameters - validate them!
scores = cv.cross_validate(
model, X, y,
metric="f1",
cv=5,
return_thresholds=True
)
Custom Metrics
from optimal_cutoffs import metrics
# Register custom Fβ score
def f_beta(tp, tn, fp, fn, beta=2.0):
return (1 + beta**2) * tp / ((1 + beta**2) * tp + beta**2 * fn + fp)
metrics.register("f2", lambda tp, tn, fp, fn: f_beta(tp, tn, fp, fn, 2.0))
# Use like any built-in metric
result = optimize_thresholds(y_true, y_scores, metric="f2")
Multiple Metrics
# Optimize different metrics
f1_result = optimize_thresholds(y_true, y_scores, metric="f1")
precision_result = optimize_thresholds(y_true, y_scores, metric="precision")
print(f"F1 optimal τ: {f1_result.threshold:.3f}")
print(f"Precision optimal τ: {precision_result.threshold:.3f}")
References
- Lipton et al. (2014) Optimal Thresholding of Classifiers to Maximize F1
- Elkan (2001) The Foundations of Cost-Sensitive Learning
- Dinkelbach (1967) Nonlinear Fractional Programming
- Platt (1999) Probabilistic Outputs for Support Vector Machines
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file optimal_classification_cutoffs-2.0.0.tar.gz.
File metadata
- Download URL: optimal_classification_cutoffs-2.0.0.tar.gz
- Upload date:
- Size: 56.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c054b95afdacee7c494fa2f03b6e67a9daa23c51d20371c1f5945d071328e29
|
|
| MD5 |
4f888fa5ee095e9ce5dba6462f46598d
|
|
| BLAKE2b-256 |
6f4adb293d671dec0a985cd47bfbf762fec7b1083d717d77c12503dc4df34f63
|
Provenance
The following attestation bundles were made for optimal_classification_cutoffs-2.0.0.tar.gz:
Publisher:
python-publish.yml on finite-sample/optimal-classification-cutoffs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
optimal_classification_cutoffs-2.0.0.tar.gz -
Subject digest:
2c054b95afdacee7c494fa2f03b6e67a9daa23c51d20371c1f5945d071328e29 - Sigstore transparency entry: 779893020
- Sigstore integration time:
-
Permalink:
finite-sample/optimal-classification-cutoffs@8b5cef542786085f2fb7ea7fca310e756e8436d9 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/finite-sample
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@8b5cef542786085f2fb7ea7fca310e756e8436d9 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file optimal_classification_cutoffs-2.0.0-py3-none-any.whl.
File metadata
- Download URL: optimal_classification_cutoffs-2.0.0-py3-none-any.whl
- Upload date:
- Size: 69.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
864e7122828f69eae705fe59226c0190cc114fede2779be12f66f2397709ba38
|
|
| MD5 |
667e5419c39fe6ee0de212c49cb0a8ed
|
|
| BLAKE2b-256 |
8f3dc8bd25391ec7da1a33b9e17af0024ed4c25d0e1213ee1769f73741a361fc
|
Provenance
The following attestation bundles were made for optimal_classification_cutoffs-2.0.0-py3-none-any.whl:
Publisher:
python-publish.yml on finite-sample/optimal-classification-cutoffs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
optimal_classification_cutoffs-2.0.0-py3-none-any.whl -
Subject digest:
864e7122828f69eae705fe59226c0190cc114fede2779be12f66f2397709ba38 - Sigstore transparency entry: 779893022
- Sigstore integration time:
-
Permalink:
finite-sample/optimal-classification-cutoffs@8b5cef542786085f2fb7ea7fca310e756e8436d9 -
Branch / Tag:
refs/heads/master - Owner: https://github.com/finite-sample
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@8b5cef542786085f2fb7ea7fca310e756e8436d9 -
Trigger Event:
workflow_dispatch
-
Statement type: