Skip to main content

ML benchmark ranking tools — IQM, Bayesian pairwise, and Dolan-Moré performance profiles

Project description

evaluma_logo

CI Python 3.11+ Coverage License PyPI Docs

evaluma

A small Python package for comparing machine learning models across benchmark suites. Given a CSV of per-model, per-dataset scores, evaluma can compute three complementary views of the results:

  • IQM ranking — interquartile mean with bootstrapped confidence intervals, following Agarwal et al. (2021)
  • Bayesian pairwise comparison — posterior probabilities that model A beats model B (or is practically equivalent), via baycomp
  • Dolan-Moré performance profiles — cumulative distribution of performance ratios and area-under-profile scores

Documentation

Full documentation, including tutorials and API reference, is available at evaluma.readthedocs.io.

Installation

pip install evaluma

For a development install from source:

git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"

Quick start

Python API

import evaluma

bench = evaluma.load_df(
    "results.csv",
    model="model",
    dataset="dataset",
    metric="metric",
    score="score",
)

# IQM ranking with 95% bootstrap CI
iqm = bench.iqm_ranking()
print(iqm.table)
fig = iqm.plot()
fig.savefig("iqm.png")

# Bayesian pairwise probabilities
bayes = bench.bayesian_comparison()
print(bayes.table)

# Dolan-Moré performance profiles
profiles = bench.performance_profiles()
fig = profiles.plot()

CLI

# Run all three analyses and write six output files
evaluma report results.csv \
    --model model --dataset dataset --metric metric --score score \
    --output results/

# Individual subcommands
evaluma rank    results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma compare results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma profiles results.csv --model model --dataset dataset --metric metric --score score --output results/

Each subcommand writes a .csv table and a .png figure to --output.

Column mapping

If your CSV uses different column names, pass them explicitly:

evaluma report results.csv \
    --model experiment --dataset task --metric measure --score value \
    --output results/

Or put them in a YAML config file:

# config.yaml
model: experiment
dataset: task
metric: measure
score: value
evaluma report results.csv --config config.yaml --output results/

Lower-is-better metrics

bench = evaluma.load_df(
    "results.csv",
    model="model", dataset="dataset", metric="metric", score="score",
    metric_direction={"rmse": "min"},
)
evaluma report results.csv ... --metric-direction rmse:min

Filtering models or datasets

bench_ab = bench.select_models(["ModelA", "ModelB"])
bench_core = bench.select_datasets(["dataset1", "dataset2", "dataset3"])

Input format

evaluma expects a long-format CSV with one row per (model, dataset) combination:

model,dataset,metric,score
ModelA,dataset1,acc,0.91
ModelA,dataset2,acc,0.87
ModelB,dataset1,acc,0.84
...

Multiple seeds are supported — pass --seed seed_col and evaluma aggregates by mean before analysis.

Contributing

git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"

# Run tests
pytest --cov=evaluma --cov-report=term-missing

# Lint and format
ruff check .
ruff format .

# Type checking
ty check

Bug reports and pull requests are welcome on GitHub.

License

Apache License 2.0. See LICENSE for the full text.

Citation

If you use evaluma in your research, please cite:

@software{lehmann2026evaluma,
  author  = {Lehmann, Nils},
  title   = {evaluma: ML Benchmark Ranking Tools},
  year    = {2026},
  url     = {https://github.com/nilsleh/evaluma},
  version = {0.1.0},
}

also cite the works of the underlying methods and frameworks used:

@inproceedings{agarwal2021deep,
  title     = {Deep Reinforcement Learning at the Edge of the Statistical Precipice},
  author    = {Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
               and Courville, Aaron and Bellemare, Marc G.},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2021},
}

@article{benavoli2017time,
  title   = {Time for a Change: a Tutorial for Comparing Multiple Classifiers
             Through Bayesian Analysis},
  author  = {Benavoli, Alessio and Corani, Giorgio and Dem{\v{s}}ar, Janez
             and Zaffalon, Marco},
  journal = {Journal of Machine Learning Research},
  volume  = {18},
  number  = {77},
  pages   = {1--36},
  year    = {2017},
}

@article{dolan2002benchmarking,
  title   = {Benchmarking Optimization Software with Performance Profiles},
  author  = {Dolan, Elizabeth D. and Mor{\'e}, Jorge J.},
  journal = {Mathematical Programming},
  volume  = {91},
  pages   = {201--213},
  year    = {2002},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaluma-0.1.0.tar.gz (544.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evaluma-0.1.0-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file evaluma-0.1.0.tar.gz.

File metadata

  • Download URL: evaluma-0.1.0.tar.gz
  • Upload date:
  • Size: 544.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evaluma-0.1.0.tar.gz
Algorithm Hash digest
SHA256 38be13c9e6d7d0a00e65c458d1bb1bbd9f6e406cc0c085079aaf5555ab512ddc
MD5 d034e44988da1445ca05e2e4810f2577
BLAKE2b-256 41410de91d8a98e344f5025835e56ae707a9a5e573e28299e9be8449888a18f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for evaluma-0.1.0.tar.gz:

Publisher: ci.yml on nilsleh/evaluma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file evaluma-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: evaluma-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evaluma-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ce5126cf6a48d49b6b07067db7342cdb6376782cee491bf315da09e63d9e42ce
MD5 b359f3b5371632ae9eea96e52e3542d6
BLAKE2b-256 a2b461c1b5e7ad636d8cf51ab224bd992e75cb82dba1f02fe4e9d51950104f68

See more details on using hashes here.

Provenance

The following attestation bundles were made for evaluma-0.1.0-py3-none-any.whl:

Publisher: ci.yml on nilsleh/evaluma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page