ML benchmark ranking tools — IQM, Bayesian pairwise, and Dolan-Moré performance profiles
Project description
evaluma
A small Python package for comparing machine learning models across benchmark suites. Given a CSV of per-model, per-dataset scores, evaluma can compute three complementary views of the results:
- IQM ranking — interquartile mean with bootstrapped confidence intervals, following Agarwal et al. (2021)
- Bayesian pairwise comparison — posterior probabilities that model A beats model B (or is practically equivalent), via baycomp
- Dolan-Moré performance profiles — cumulative distribution of performance ratios and area-under-profile scores
Documentation
Full documentation, including tutorials and API reference, is available at evaluma.readthedocs.io.
Installation
pip install evaluma
For a development install from source:
git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"
Quick start
Python API
import evaluma
bench = evaluma.load_df(
"results.csv",
model="model",
dataset="dataset",
metric="metric",
score="score",
)
# IQM ranking with 95% bootstrap CI
iqm = bench.iqm_ranking()
print(iqm.table)
fig = iqm.plot()
fig.savefig("iqm.png")
# Bayesian pairwise probabilities
bayes = bench.bayesian_comparison()
print(bayes.table)
# Dolan-Moré performance profiles
profiles = bench.performance_profiles()
fig = profiles.plot()
CLI
# Run all three analyses and write six output files
evaluma report results.csv \
--model model --dataset dataset --metric metric --score score \
--output results/
# Individual subcommands
evaluma rank results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma compare results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma profiles results.csv --model model --dataset dataset --metric metric --score score --output results/
Each subcommand writes a .csv table and a .png figure to --output.
Column mapping
If your CSV uses different column names, pass them explicitly:
evaluma report results.csv \
--model experiment --dataset task --metric measure --score value \
--output results/
Or put them in a YAML config file:
# config.yaml
model: experiment
dataset: task
metric: measure
score: value
evaluma report results.csv --config config.yaml --output results/
Lower-is-better metrics
bench = evaluma.load_df(
"results.csv",
model="model", dataset="dataset", metric="metric", score="score",
metric_direction={"rmse": "min"},
)
evaluma report results.csv ... --metric-direction rmse:min
Filtering models or datasets
bench_ab = bench.select_models(["ModelA", "ModelB"])
bench_core = bench.select_datasets(["dataset1", "dataset2", "dataset3"])
Input format
evaluma expects a long-format CSV with one row per (model, dataset) combination:
model,dataset,metric,score
ModelA,dataset1,acc,0.91
ModelA,dataset2,acc,0.87
ModelB,dataset1,acc,0.84
...
Multiple seeds are supported — pass --seed seed_col and evaluma aggregates by mean before analysis.
Contributing
git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"
# Run tests
pytest --cov=evaluma --cov-report=term-missing
# Lint and format
ruff check .
ruff format .
# Type checking
ty check
Bug reports and pull requests are welcome on GitHub.
License
Apache License 2.0. See LICENSE for the full text.
Citation
If you use evaluma in your research, please cite:
@software{lehmann2026evaluma,
author = {Lehmann, Nils},
title = {evaluma: ML Benchmark Ranking Tools},
year = {2026},
url = {https://github.com/nilsleh/evaluma},
version = {0.1.0},
}
also cite the works of the underlying methods and frameworks used:
@inproceedings{agarwal2021deep,
title = {Deep Reinforcement Learning at the Edge of the Statistical Precipice},
author = {Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
and Courville, Aaron and Bellemare, Marc G.},
booktitle = {Advances in Neural Information Processing Systems},
year = {2021},
}
@article{benavoli2017time,
title = {Time for a Change: a Tutorial for Comparing Multiple Classifiers
Through Bayesian Analysis},
author = {Benavoli, Alessio and Corani, Giorgio and Dem{\v{s}}ar, Janez
and Zaffalon, Marco},
journal = {Journal of Machine Learning Research},
volume = {18},
number = {77},
pages = {1--36},
year = {2017},
}
@article{dolan2002benchmarking,
title = {Benchmarking Optimization Software with Performance Profiles},
author = {Dolan, Elizabeth D. and Mor{\'e}, Jorge J.},
journal = {Mathematical Programming},
volume = {91},
pages = {201--213},
year = {2002},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evaluma-0.1.0.tar.gz.
File metadata
- Download URL: evaluma-0.1.0.tar.gz
- Upload date:
- Size: 544.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38be13c9e6d7d0a00e65c458d1bb1bbd9f6e406cc0c085079aaf5555ab512ddc
|
|
| MD5 |
d034e44988da1445ca05e2e4810f2577
|
|
| BLAKE2b-256 |
41410de91d8a98e344f5025835e56ae707a9a5e573e28299e9be8449888a18f5
|
Provenance
The following attestation bundles were made for evaluma-0.1.0.tar.gz:
Publisher:
ci.yml on nilsleh/evaluma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evaluma-0.1.0.tar.gz -
Subject digest:
38be13c9e6d7d0a00e65c458d1bb1bbd9f6e406cc0c085079aaf5555ab512ddc - Sigstore transparency entry: 1461790319
- Sigstore integration time:
-
Permalink:
nilsleh/evaluma@af056ed5cfdcda890730933b0b1146bf211ca724 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/nilsleh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@af056ed5cfdcda890730933b0b1146bf211ca724 -
Trigger Event:
push
-
Statement type:
File details
Details for the file evaluma-0.1.0-py3-none-any.whl.
File metadata
- Download URL: evaluma-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce5126cf6a48d49b6b07067db7342cdb6376782cee491bf315da09e63d9e42ce
|
|
| MD5 |
b359f3b5371632ae9eea96e52e3542d6
|
|
| BLAKE2b-256 |
a2b461c1b5e7ad636d8cf51ab224bd992e75cb82dba1f02fe4e9d51950104f68
|
Provenance
The following attestation bundles were made for evaluma-0.1.0-py3-none-any.whl:
Publisher:
ci.yml on nilsleh/evaluma
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evaluma-0.1.0-py3-none-any.whl -
Subject digest:
ce5126cf6a48d49b6b07067db7342cdb6376782cee491bf315da09e63d9e42ce - Sigstore transparency entry: 1461790334
- Sigstore integration time:
-
Permalink:
nilsleh/evaluma@af056ed5cfdcda890730933b0b1146bf211ca724 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/nilsleh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@af056ed5cfdcda890730933b0b1146bf211ca724 -
Trigger Event:
push
-
Statement type: