A Unified Python Library for Standardized Metric Evaluation in Machine Learning
Project description
AllMetrics: A Unified Python Library for Standardized Metric Evaluation in Machine Learning
Paper Title: AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning.
Paper: coming Soon.
PyPI: https://pypi.org/project/allmetrics/
GitHub: https://github.com/MohammadRSalmanpour/AllMetrics
Python Version: 3.11+
AllMetrics is a comprehensive Python library designed to standardize performance metric evaluation across diverse machine learning tasks. It provides a unified API for computing metrics while ensuring robust data validation, consistent implementations, and standardized reporting formats.
Many existing libraries compute evaluation metrics differently, leading to inconsistent results across research papers, tools, and frameworks. AllMetrics addresses these issues by providing consistent implementations, validation mechanisms, and unified outputs across multiple machine learning domains.
AllMetrics supports evaluation for:
-
Regression
-
Classification
-
Clustering
-
Segmentation
-
Image-to-Image Translation
๐ Table of Contents
๐ Motivation
โจ Key Features
๐ฅ Installation
๐ Quick Start
๐ Task Examples
๐ Supported Metrics
-
Regression Metrics
-
Classification Metrics
-
Clustering Metrics
-
Segmentation Metrics
-
Image-to-Image Translation Metrics
๐งฉ Library Design Principles
๐ Output Format
โ ๏ธ Data Validation
๐ API Structure
โ Troubleshooting
๐ Version History
๐ฌ Maintenance
๐ Citation
๐ License
๐ฌ Contact
๐ Motivation
Evaluation metrics play a central role in machine learning research and practice. They are used to compare models, report experimental results, and guide model selection. However, despite their importance, metric implementations across existing libraries are often inconsistent. Differences in mathematical definitions, preprocessing assumptions, aggregation strategies, and reporting formats can lead to substantially different resultsโeven when the same metric name is used.
These inconsistencies arise mainly from two sources:
1๏ธโฃ Implementation Differences (ID)
variations in how metrics are mathematically defined or computed across tools.
2๏ธโฃ Reporting Differences (RD)
variations in how results are aggregated or summarized (e.g., micro, macro, weighted, class-wise reporting).
As a consequence, identical models evaluated on identical datasets may produce different results depending on the library, framework, or configuration used. This makes cross-study comparisons difficult and may undermine reproducibility in machine learning research.
AllMetrics was developed to address this challenge. It provides a unified and transparent framework for computing evaluation metrics across a wide range of machine learning tasks. The library standardizes metric implementations, explicitly exposes evaluation assumptions, and integrates robust data validation mechanisms to ensure reliable and reproducible results.
By unifying metric evaluation for classification, regression, clustering, segmentation, and image-to-image analysis, AllMetrics enables consistent benchmarking and facilitates reproducible experimentation across diverse ML workflows.
โจ Key Features
Unified Metric Evaluation
AllMetrics provides standardized implementations of evaluation metrics across multiple machine learning tasks, including classification, regression, clustering, segmentation, and image-to-image translation.
Explicit Control of Evaluation Assumptions
The library explicitly distinguishes between Implementation Differences (ID) and Reporting Differences (RD), enabling transparent and reproducible metric evaluation.
Robust Data Validation
Automatic validation checks help detect common data issues before metrics are computed, including:
-
shape mismatches
-
invalid value ranges
-
class imbalance
-
missing or empty labels
-
outliers and abnormal distributions
Task-Agnostic API
A unified API design allows users to evaluate models across different ML tasks using consistent function interfaces and parameter conventions.
Extensible Architecture
Users can extend the library by adding custom metrics or integrating new validation rules while preserving standardized reporting and evaluation workflows.
Broad Metric Coverage
The library includes more than 50 evaluation metrics spanning multiple ML domains.
Support for Advanced Applications
AllMetrics supports specialized scenarios such as:
-
multi-class and multi-label classification
-
2D and 3D medical image segmentation
-
clustering quality evaluation
-
image-to-image translation assessment (SSIM, PSNR)
๐ฅ Installation
AllMetrics can be installed directly from PyPI using pip:
`pip install allmetrics`
After installation, the library can be imported in Python:
`import allmetrics`
The package is designed to integrate easily with common scientific computing and machine learning libraries such as NumPy, PyTorch, and standard Python data pipelines.
๐ Quick Start
The following example demonstrates how to compute a simple classification metric using AllMetrics.
from allmetrics.classification import accuracy_score
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 0, 1, 0]
acc = accuracy_score(y_true, y_pred)
print("Accuracy:", acc)
AllMetrics provides additional configuration options that allow users to control validation behavior and evaluation assumptions.
Example with validation options:
from allmetrics.classification import accuracy_score
acc = accuracy_score(
y_true,
y_pred,
normalize=True,
check_outliers=False,
check_distribution=False,
check_correlation=False,
check_missing_large=False,
check_class_balance=False
)
Users can also explore available metrics programmatically:
import allmetrics
allmetrics.classification.list_of_metrics()
To retrieve detailed information about a specific metric:
allmetrics.classification.get_metric_details("accuracy_score")
This discovery mechanism allows users to easily inspect available metrics, understand parameter configurations, and select appropriate evaluation measures for their experiments.
๐ Task Examples
The following examples demonstrate how AllMetrics can be used across different machine learning tasks. The library provides a consistent API for computing evaluation metrics in classification, regression, segmentation, clustering, and image-to-image analysis.
๐ Classification Example
from allmetrics.classification import f1_score
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
f1 = f1_score(
y_true,
y_pred,
average="macro", # options: micro, macro, weighted, none
zero_division=0,
check_class_balance=True
)
print("F1 Score:", f1)
Key point:
All classification metrics follow a unified API. The average parameter explicitly controls Reporting Differences (RD) such as micro, macro, or weighted aggregation.
๐ Regression Example
from allmetrics.regression import mean_absolute_error
y_true = [3.2, 2.8, 4.1, 5.0]
y_pred = [2.9, 3.0, 3.8, 5.3]
mae = mean_absolute_error(
y_true,
y_pred,
check_outliers=True,
check_distribution=True
)
print("MAE:", mae)
Key point:
Built-in validation can automatically detect issues such as abnormal distributions, outliers, or invalid numeric values before computing the metric.
๐ง Segmentation Example (2D/3D)
import numpy as np
from allmetrics.segmentation import dice_score
# Example 2D masks
y_true = np.array([[1, 1, 0], [0, 1, 0]])
y_pred = np.array([[1, 0, 0], [0, 1, 1]])
dice = dice_score(
y_true,
y_pred,
mode="binary", # multi-class also supported
ignore_background=True,
check_empty_masks=True
)
print("Dice Score:", dice)
Key point:
AllMetrics supports 2D and 3D segmentation evaluation, including metrics such as Dice, IoU, Hausdorff Distance, and ASSD, which are widely used in medical imaging.
๐ Supported Metrics
AllMetrics includes 50+ standardized evaluation metrics covering multiple machine learning tasks. The implementations follow consistent definitions and transparent evaluation assumptions.
๐ Regression
-
mean_absolute_error
-
mean_squared_error
-
mean_bias_deviation
-
r_squared
-
r_squared(adjusted)
-
mean_absolute_percentage_error
-
symmetric_mean_absolute_percentage_error
-
huber_loss
-
relative_squared_error
-
mean_squared_log_error
-
log_cosh_loss
-
explained_variance
-
median_absolute_error
-
max_error
-
mean_tweedie_deviance
-
mean_pinball_loss
๐ Classification
-
accuracy_score
-
precision_score
-
recall_score
-
balanced_accuracy
-
matthews_correlation_coefficient
-
cohens_kappa
-
f1_score
-
confusion_matrix
-
fbeta_score
-
jaccard_score
-
log_loss
-
hamming_loss
-
top_k_accuracy
๐ Clustering
-
adjusted_rand_index
-
normalized_mutual_info_score
-
silhouette_score
-
calinski_harabasz_index
-
homogeneity_score
-
completeness_score
-
davies_bouldin_index
-
mutual_information
-
v_measure_score
-
rand_score
-
adjusted_mutual_info_score
-
fowlkes_mallows_score
๐ง Segmentation (2D/3D)
-
dice_score
-
iou_score
-
sensitivity
-
specificity
-
precision
-
hausdorff_distance
๐ผ๏ธ Image-to-Image Translation
-
ssim
-
psnr
๐งฉ Library Design Principles
AllMetrics is designed around a set of principles aimed at ensuring reproducibility, transparency, and consistency in machine learning metric evaluation.
-
Standardized Metric Implementations All metrics are implemented using consistent mathematical definitions and verified mplementations. This minimizes Implementation Differences (ID) that often arise across different libraries.
-
Explicit ID/RD Control A core design concept of AllMetrics is the explicit distinction between:
-
Implementation Differences (ID)
-
Reporting Differences (RD)
-
Examples include:
-
averaging strategies in classification
-
handling of missing classes
-
background handling in segmentation
-
surface distance computation methods Making these assumptions explicit improves reproducibility and transparency in reported results.
- Layered Architecture The library follows a layered architecture that separates different responsibilities:
- Preprocessing Layer
Handles input validation, shape checking, and normalization rules.
- Metrics Core Layer
Provides standardized implementations of evaluation metrics.
- ID/RD Control Layer
Manages configuration related to evaluation assumptions.
- Reporting Layer
Generates interpretable and structured evaluation results.
- Extensions Layer
Allows users to add custom metrics or extend the library.
-
Task-Agnostic API AllMetrics provides a task-agnostic API design. Metrics across different tasks follow similar function signatures and parameter conventions, making the library easy to learn and use.
-
Robust Validation by Default To prevent misleading results, AllMetrics performs automatic validation checks before computing metrics. These checks may include:
-
missing classes
-
abnormal correlations
-
outliers
-
empty segmentation masks
- invalid value ranges Users can customize or disable these checks depending on their workflow.xtensible and Research-Friendly The library is designed to support research workflows. Users can easily extend AllMetrics by implementing new metrics while leveraging the existing validation and reporting infrastructure.
This extensibility makes AllMetrics suitable for both applied machine learning projects and academic research.
๐ Output Format
AllMetrics is designed to produce clear, interpretable, and reproducible outputs. Metric functions typically return a numerical value, but they can also provide structured summaries when detailed reporting is enabled.
Standard Output
Most metrics return a single numeric value:
from allmetrics.classification import accuracy_score
acc = accuracy_score(y_true, y_pred)
print(acc)
Output example:
0.84
Multi-Class Reporting
For metrics that involve multiple classes, users can control the aggregation strategy using the average parameter.
Example:
from allmetrics.classification import precision_score
precision = precision_score(
y_true,
y_pred,
average="macro"
)
print(precision)
Available aggregation modes:
-
micro โ global aggregation over all samples
-
macro โ unweighted mean across classes
-
weighted โ class-weighted mean
-
none โ class-wise results Example class-wise output (average="none"):
[0.82, 0.76, 0.91]
โ ๏ธ Data Validation
Reliable evaluation requires reliable inputs. AllMetrics includes an integrated data validation layer that automatically checks input data before computing metrics.
These checks help detect common issues that can silently distort evaluation results.
Shape Consistency
Ensures that predicted and ground-truth arrays have compatible shapes.
Example problem detected:
-
mismatched lengths
-
incompatible tensor shapes
Value Range Checks
Validates that predictions and labels fall within acceptable ranges.
Examples:
-
classification labels outside expected class indices
-
segmentation masks containing invalid values
-
regression predictions with NaN or infinite values
Class Presence & Balance
For classification tasks, AllMetrics can check:
-
missing classes in predictions
-
severe class imbalance
-
degenerate predictions (predicting a single class only) Example option:
check_class_balance=True
Outlier Detection
For regression tasks, optional outlier checks can identify abnormal values that may distort evaluation.
Example option:
`check_outliers=True`
Segmentation-Specific Checks
For image segmentation tasks, additional checks are available:
-
empty masks
-
background-only predictions
-
mismatched mask dimensions Example:
check_empty_masks=True
Configurable Validation
All validation checks are fully configurable and can be enabled or disabled depending on the application.
Example:
accuracy_score(
y_true,
y_pred,
check_outliers=False,
check_distribution=False,
check_class_balance=False
)
๐ API Structure
AllMetrics follows a task-oriented modular API design, where metrics are organized by machine learning task. This structure keeps the library intuitive and easy to navigate.
Main Modules
allmetrics โโโ classification
โโโ regression
โโโ clustering
โโโ segmentation
โโโ image_translation
Each module contains task-specific evaluation metrics.
Example imports:
from allmetrics.classification import accuracy_score
from allmetrics.regression import mean_squared_error
from allmetrics.clustering import rand_score
from allmetrics.segmentation import dice_score
from allmetrics.imagetoimage import psnr
Metric Discovery Utilities
AllMetrics provides built-in utilities to explore available metrics.
List metrics within a module:
import allmetrics
allmetrics.classification.list_of_metrics()
Get details about a specific metric:
allmetrics.classification.get_metric_details("f1_score")
These utilities help users quickly discover supported metrics and understand their parameters.
Consistent Function Signatures
Most metric functions follow a consistent structure:
metric_function(
y_true,
y_pred,
**options
)
Where:
- y_true โ ground truth values
- y_pred โ predicted values
- options โ configuration parameters controlling validation, averaging, and evaluation behavior This consistent API allows users to switch between metrics without changing their workflow.
โ Troubleshooting
This section addresses common issues users may encounter when using AllMetrics.
Shape Mismatch Errors
Problem
`ValueError: y_true and y_pred must have the same shape`
Solution
Ensure that both arrays contain the same number of samples and compatible dimensions.
Example:
`len(y_true) == len(y_pred)`
Invalid Label Values
Problem
Labels contain values outside the expected range.
Solution
Verify that classification labels correspond to valid class indices and do not contain unexpected values.
Empty Segmentation Masks
Problem
Segmentation metrics fail when masks contain no foreground pixels.
Solution
Enable the built-in validation checks:
`check_empty_masks=True`
or ensure masks contain valid foreground regions.
NaN or Infinite Values
Problem
Metrics return NaN due to invalid numeric values.
Solution
Check the dataset for:
-
NaN values
-
infinite values
-
invalid predictions
Unexpected Metric Results
If results differ from those produced by another library, possible reasons include:
-
different aggregation strategies
-
different implementation assumptions
-
different handling of edge cases AllMetrics makes these assumptions explicit through configuration parameters.
๐ Version History
v1.0.10 โ 2025-03-05
Initial public release of AllMetrics. Key features:
-
unified evaluation framework for machine learning metrics
-
support for classification, regression, clustering, segmentation, and image-to-image evaluation
-
implementation of 50+ standardized metrics
-
explicit control of Implementation Differences (ID) and Reporting Differences (RD)
-
integrated data validation layer
-
modular task-based API
-
support for 2D and 3D segmentation evaluation Future versions will expand metric coverage, improve reporting capabilities, and introduce additional validation tools for advanced machine learning workflows.
๐ฌ Maintenance
For technical support and maintenance inquiries, please contact:
Dr. Mohammad R. Salmanpour (Team Lead)
msalman@bccrc.ca โ m.salmanpoor66@gmail.com โ m.salmanpour@ubc.ca
Morteza Alizadeh (Assistant Team Lead)
๐ฅAuthors
-
Morteza Alizadeh (Backend Development, Code Refactoring, Debugging, Library Management)
-
Mehrdad Oveisi (Evaluator, Software Engineer, AI Expert, and Advisor)
-
Sonya Falahati (Testing and Data prepration)
-
Ghazal Mousavi (Backend Development, Testing, and Data prepration)
-
Mohsen Alambardar Meybodi (Advisor and Evaluator)
-
Somayeh Sadat Mehrnia (Coordinator and Evaluator)
-
Ilker Hacihaliloglu (Medical Imaging Expert and Advisor)
-
Arman Rahmim (Fund Provider, Medical Imaging Expert, Evaluator, and Advisor)
-
Mohammad R. Salmanpour (Team Lead, Conceptualization, Supervisor, Fund Provider, AI and Medical Imaging Expert, and Evaluator)
๐Citation
@misc{abcdefgh,
title={AllMetrics: A Unified Python Library for Standardized Metric Evaluation and Robust Data Validation in Machine Learning},
author={Morteza Alizadeh and Mehrdad Oveisi and Sonya Falahati and Ghazal Mousavi and Mohsen Alambardar Meybodi and Somayeh Sadat Mehrnia and Ilker Hacihaliloglu and Arman Rahmim and Mohammad R. Salmanpour.},
year={2025},
eprint={2511.15963},
archivePrefix={arXiv},
primaryClass={physics.med-ph},
url={https://arxiv.org/abs/2505.15931},
}
๐License
This open-source software is released under the MIT License, which grants permission to use, modify, and distribute it for any purpose, including research or commercial use, without requiring modified versions to be shared as open source. See the LICENSE file for details.
Support
-
Issues: GitHub Issues
-
Documentation: This README and the included guides
-
Examples: See examples/basic_usage.py
Acknowledgment
This study was supported by:
๐ป Virtual Collaboration (VirCollab) Group, Vancouver, BC, Canada
๐ญ Technological Virtual Collaboration Corporation (TECVICO Corp.), Vancouver, BC, Canada
๐ฌ Quantitative Radiomolecular Imaging and Therapy (Qurit) Lab, University of British Columbia, Vancouver, BC, Canada
๐ฅ BC Cancer Research Institute, Department of Basic and Translational Research, Vancouver, BC, Canada
๐ฌContact
AllMetrics is available free of charge. For access, questions, or feedback:
Morteza Alizadeh (Backend Developer)
๐งAlizadehMorteza2020@gmail.com
Dr. Mohammad R. Salmanpour (Team Lead)
๐งmsalman@bccrc.ca | m.salmanpoor66@gmail.com | m.salmanpour@ubc.ca
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file allmetrics-0.1.0.tar.gz.
File metadata
- Download URL: allmetrics-0.1.0.tar.gz
- Upload date:
- Size: 41.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9911f6db631d0e27e37b6e5bfe2e3868e399bb371cd79e38d9a4a529efdfd51f
|
|
| MD5 |
b5cd98644c683ff2efc16a9a4948a86c
|
|
| BLAKE2b-256 |
07a3a12cc4c31869654b03c1df2b1c6212a82d1d1cf7420ca6a6cc1ac202c338
|
File details
Details for the file allmetrics-0.1.0-py3-none-any.whl.
File metadata
- Download URL: allmetrics-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc18e7b9ecb14779124d1f06011780e621a10ee7c9bd2c77d58c76d6b769237a
|
|
| MD5 |
d2f87c43c105918c38c9cafe7d1b0a3a
|
|
| BLAKE2b-256 |
b5fc213c475d8f856c81207dc330ff657d1b4ff0aee81ae8118cdc94f81b29c7
|