Skip to main content

PBS and PLL are superior evaluation metrics for probabilistic classifiers, fixing flaws in Brier Score (MSE) and Log Loss (Cross-Entropy). Strictly proper, consistent, and better for model selection, early stopping, and checkpointing.

Project description

Superior Scoring Rules: Enhanced Calibrated Metrics for Probabilistic Evaluation

GitHub, arXiv Preprint

superior-scoring-rules is a Python library that provides strictly proper, confidence-aware evaluation metrics for probabilistic multi-class classification. Unlike traditional metrics such as Brier Score or Log Loss, these scoring rules penalize overconfident mispredictions, ensuring correct predictions are always scored better.


Why Accuracy, F1, Brier Score, and Log-Loss Fall Short in Probabilistic Classification

In many high-stakes applications, confidence calibration is critical. Traditional accuracy-based metrics (Accuracy, F1) ignore prediction confidence. Consider:

  • Cancer Diagnosis: Differentiating 51% vs. 99% confidence in malignancy
  • ICU Triage: Overconfident mispredictions risk patient safety
  • Autonomous Vehicles: Handling uncertainties about obstacles
  • Financial Risk Modeling: Pricing and investment decisions
  • Security Threat Detection: High-confidence false negatives

Accuracy or F1 score alone cannot capture this nuance.

Limitations of Brier Score & Log Loss

Brier Score (Mean Squared Error, MSE, Quadratic Score) and Log Loss (Cross-Entropy, Negative Log-Likelihood, NLL, Logarithmic Score) are strictly proper scoring rules, rewarding calibration. However, they can still favor incorrect predictions over correct ones. Example:

Vector True Label (Y) Predicted Probabilities (P) Brier Score Log Loss State
A [0, 1, 0] [0.33, 0.34, 0.33] 0.6534 0.4685 Correct
B [0, 1, 0] [0.51, 0.49, 0.00] 0.5202 0.3098 Incorrect

Both MSE and Log Loss favor B over A, contradicting the principle of rewarding correct predictions.

Our Solution: PBS & PLL

To ensure correct predictions always receive better scores, we introduce a penalty term for misclassifications:

  • Penalized Brier Score (PBS)

  • Penalized Logarithmic Loss (PLL)

These metrics are both strictly proper and superior (never favor wrong over right).

Quick Start

Installation from PyPI

pip install superior-scoring-rules

Install from Source (Development)

Clone the repository:

git clone https://github.com/Ruhallah93/superior-scoring-rules.git

Basic Usage

import tensorflow as tf
from superior_scoring_rules import pbs, pll

# Sample data (batch_size=3, num_classes=4)
y_true = tf.constant([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1]])
y_pred = tf.constant([[0.9, 0.05, 0.05, 0], 
                     [0.1, 0.8, 0.05, 0.05],
                     [0.1, 0.1, 0.1, 0.7]])

print("PBS:", pbs(y_true, y_pred).numpy())
print("PLL:", pll(y_true, y_pred).numpy())

Early Stopping & Checkpointing

Use PBS/PLL instead of val_loss:

class PBSCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        logs['val_pbs'] = pbs(self.validation_data[1], self.model.predict(self.validation_data[0]))
        # or
        logs['val_pll'] = pll(self.validation_data[1], self.model.predict(self.validation_data[0]))

model.fit(..., callbacks=[PBSCallback(),
    tf.keras.callbacks.EarlyStopping(monitor='val_pbs', patience=5, mode='min'),
    tf.keras.callbacks.ModelCheckpoint('best.h5', monitor='val_pbs', save_best_only=True)
])

Paper & Citation

@article{ahmadian2025superior,
  title={Superior scoring rules for probabilistic evaluation of single-label multi-class classification tasks},
  author={Ahmadian, Rouhollah and Ghatee, Mehdi and Wahlstr{\"o}m, Johan},
  journal={International Journal of Approximate Reasoning},
  pages={109421},
  year={2025},
  publisher={Elsevier}
}

Related Topics

  • Probabilistic classification evaluation
  • Strictly proper scoring rules in machine learning
  • Calibrated metrics for deep learning
  • TensorFlow / Keras custom evaluation metrics
  • AI safety and confidence in model predictions
  • Penalized loss functions for classification

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

superior_scoring_rules-1.0.6.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

superior_scoring_rules-1.0.6-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file superior_scoring_rules-1.0.6.tar.gz.

File metadata

  • Download URL: superior_scoring_rules-1.0.6.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for superior_scoring_rules-1.0.6.tar.gz
Algorithm Hash digest
SHA256 ddd0e8779f1b6c0ebf8b8653a0371112e47c9ae3a33b1c1aafc1784af98c3865
MD5 6368e9948dad09eb13755cd133ffe038
BLAKE2b-256 14fe61faf410e9690f96b0e160c375c763d470c92ee4d8a05adde10652c315a3

See more details on using hashes here.

File details

Details for the file superior_scoring_rules-1.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for superior_scoring_rules-1.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 6281399a263091e9e0f0c3bb8d040718d43b246d391742ed6afb6cbf16f38b37
MD5 c88f47419ca5dbb881f645a0312aa247
BLAKE2b-256 7a54d2d3236a6040d7cf8b158e6f783b074428dc65d5ad784b68df2295763a71

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page