Skip to main content

PBS and PLL are superior evaluation metrics for probabilistic classifiers, fixing flaws in Brier Score (MSE) and Log Loss (Cross-Entropy). Strictly proper, consistent, and better for model selection, early stopping, and checkpointing.

Project description

Superior Scoring Rules: Better Metrics for Probabilistic Evaluation

[GitHub], arXiv Preprint

PBS and PLL are superior evaluation metrics for probabilistic classifiers, fixing flaws in Brier Score (MSE) and Log Loss (Cross-Entropy). Strictly proper, consistent, and better for model selection, early stopping, and checkpointing.

Problem with Traditional Metrics

Accuracy-based metrics (Accuracy, F1) treat all correct predictions equally, ignoring confidence. In high-stakes domains, confidence calibration is critical:

  • Cancer Diagnosis: 51% vs. 99% confidence in malignancy should not be treated differently.

  • ICU Triage & Mortality: Overconfident mispredictions risk patient safety.

  • Autonomous Vehicles: Decisions depend on uncertainty about obstacles.

  • Financial Risk Modeling: Pricing and investment hinge on calibrated probabilities.

  • Security Threat Detection: High-confidence false negatives undermine defenses.

Thus, Accuracy or F1 Score alone is insufficient: they ignore the confidence of predictions.

Limitations of MSE & Cross-Entropy

Mean Squared Error (Brier Score) and Cross-Entropy (Log Loss) are strictly proper scoring rules, rewarding calibration. However, they can still favor incorrect predictions over correct ones. Example:

Vector True Label (Y) Predicted Probabilities (P) Brier Score Log Loss State
A [0, 1, 0] [0.33, 0.34, 0.33] 0.6534 0.4685 Correct
B [0, 1, 0] [0.51, 0.49, 0.00] 0.5202 0.3098 Incorrect

Both MSE and Log Loss favor B over A, contradicting the principle of rewarding correct predictions.

Our Solution: PBS & PLL

To ensure correct predictions always receive better scores, we introduce a penalty term for misclassifications:

  • Penalized Brier Score (PBS)

  • Penalized Logarithmic Loss (PLL)

These metrics are both strictly proper and superior (never favor wrong over right).

Quick Start

Installation from PyPI

pip install superior-scoring-rules

Install from Source (Development)

Clone the repository:

git clone https://github.com/Ruhallah93/superior-scoring-rules.git

Basic Usage

import tensorflow as tf
from superior_scoring_rules import pbs, pll

# Sample data (batch_size=3, num_classes=4)
y_true = tf.constant([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1]])
y_pred = tf.constant([[0.9, 0.05, 0.05, 0], 
                     [0.1, 0.8, 0.05, 0.05],
                     [0.1, 0.1, 0.1, 0.7]])

print("PBS:", pbs(y_true, y_pred).numpy())
print("PLL:", pll(y_true, y_pred).numpy())

Early Stopping & Checkpointing

Use PBS/PLL instead of val_loss:

class PBSCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        logs['val_pbs'] = pbs(self.validation_data[1], self.model.predict(self.validation_data[0]))
        # or
        logs['val_pll'] = pll(self.validation_data[1], self.model.predict(self.validation_data[0]))

model.fit(..., callbacks=[PBSCallback(),
    tf.keras.callbacks.EarlyStopping(monitor='val_pbs', patience=5, mode='min'),
    tf.keras.callbacks.ModelCheckpoint('best.h5', monitor='val_pbs', save_best_only=True)
])

Paper & Citation

@article{ahmadian2025superior,
  title={Superior scoring rules for probabilistic evaluation of single-label multi-class classification tasks},
  author={Ahmadian, Rouhollah and Ghatee, Mehdi and Wahlstr{\"o}m, Johan},
  journal={International Journal of Approximate Reasoning},
  pages={109421},
  year={2025},
  publisher={Elsevier}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

superior_scoring_rules-1.0.2.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

superior_scoring_rules-1.0.2-py3-none-any.whl (4.2 kB view details)

Uploaded Python 3

File details

Details for the file superior_scoring_rules-1.0.2.tar.gz.

File metadata

  • Download URL: superior_scoring_rules-1.0.2.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for superior_scoring_rules-1.0.2.tar.gz
Algorithm Hash digest
SHA256 51a237e458c4e3e97eb7d79a35256be1d05cb80b99862713687e8382adeb8346
MD5 fd416b9e5945877a7ed7ad43a224a2a6
BLAKE2b-256 9026e88a86fd3e2bcbea07ae27e3fcc5ab8b9e5bea2b84fcda954abe9f86f5f4

See more details on using hashes here.

File details

Details for the file superior_scoring_rules-1.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for superior_scoring_rules-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0dde3365af12800fc668199612bb350180d96d5adc1f48d330b2c7e5b4df8fe4
MD5 5b043f1908ceb4c9d50f2096d46d2d75
BLAKE2b-256 8928129d89c6bcf110d446c8b14fac469aa7f5155e33d893430c67e2a7998b5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page