ML benchmark ranking tools — IQM, Bayesian pairwise, and Dolan-Moré performance profiles

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nilsleh

These details have not been verified by PyPI

Project links

Documentation

Project description

evaluma_logo

evaluma

A small Python package for comparing machine learning models across benchmark suites. Given a CSV of per-model, per-dataset scores, evaluma can compute three complementary views of the results:

IQM ranking — interquartile mean with bootstrapped confidence intervals, following Agarwal et al. (2021)
Bayesian pairwise comparison — posterior probabilities that model A beats model B (or is practically equivalent), via baycomp
Dolan-Moré performance profiles — cumulative distribution of performance ratios and area-under-profile scores

Documentation

Full documentation, including tutorials and API reference, is available at evaluma.readthedocs.io.

Installation

pip install evaluma

For a development install from source:

git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"

Quick start

Python API

import evaluma

bench = evaluma.load_df(
    "results.csv",
    model="model",
    dataset="dataset",
    metric="metric",
    score="score",
)

# IQM ranking with 95% bootstrap CI
iqm = bench.iqm_ranking()
print(iqm.table)
fig = iqm.plot()
fig.savefig("iqm.png")

# Bayesian pairwise probabilities
bayes = bench.bayesian_comparison()
print(bayes.table)

# Dolan-Moré performance profiles
profiles = bench.performance_profiles()
fig = profiles.plot()

CLI

# Run all three analyses and write six output files
evaluma report results.csv \
    --model model --dataset dataset --metric metric --score score \
    --output results/

# Individual subcommands
evaluma rank    results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma compare results.csv --model model --dataset dataset --metric metric --score score --output results/
evaluma profiles results.csv --model model --dataset dataset --metric metric --score score --output results/

Each subcommand writes a .csv table and a .png figure to --output.

Column mapping

If your CSV uses different column names, pass them explicitly:

evaluma report results.csv \
    --model experiment --dataset task --metric measure --score value \
    --output results/

Or put them in a YAML config file:

# config.yaml
model: experiment
dataset: task
metric: measure
score: value

evaluma report results.csv --config config.yaml --output results/

Lower-is-better metrics

bench = evaluma.load_df(
    "results.csv",
    model="model", dataset="dataset", metric="metric", score="score",
    metric_direction={"rmse": "min"},
)

evaluma report results.csv ... --metric-direction rmse:min

Filtering models or datasets

bench_ab = bench.select_models(["ModelA", "ModelB"])
bench_core = bench.select_datasets(["dataset1", "dataset2", "dataset3"])

Input format

evaluma expects a long-format CSV with one row per (model, dataset) combination:

model,dataset,metric,score
ModelA,dataset1,acc,0.91
ModelA,dataset2,acc,0.87
ModelB,dataset1,acc,0.84
...

Multiple seeds are supported — pass --seed seed_col and evaluma aggregates by mean before analysis.

Contributing

git clone https://github.com/nilsleh/evaluma
cd evaluma
pip install -e ".[dev]"

# Run tests
pytest --cov=evaluma --cov-report=term-missing

# Lint and format
ruff check .
ruff format .

# Type checking
ty check

Bug reports and pull requests are welcome on GitHub.

License

Apache License 2.0. See LICENSE for the full text.

Citation

If you use evaluma in your research, please cite:

@software{lehmann2026evaluma,
  author  = {Lehmann, Nils},
  title   = {evaluma: ML Benchmark Ranking Tools},
  year    = {2026},
  url     = {https://github.com/nilsleh/evaluma},
  version = {0.1.0},
}

also cite the works of the underlying methods and frameworks used:

@inproceedings{agarwal2021deep,
  title     = {Deep Reinforcement Learning at the Edge of the Statistical Precipice},
  author    = {Agarwal, Rishabh and Schwarzer, Max and Castro, Pablo Samuel
               and Courville, Aaron and Bellemare, Marc G.},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2021},
}

@article{benavoli2017time,
  title   = {Time for a Change: a Tutorial for Comparing Multiple Classifiers
             Through Bayesian Analysis},
  author  = {Benavoli, Alessio and Corani, Giorgio and Dem{\v{s}}ar, Janez
             and Zaffalon, Marco},
  journal = {Journal of Machine Learning Research},
  volume  = {18},
  number  = {77},
  pages   = {1--36},
  year    = {2017},
}

@article{dolan2002benchmarking,
  title   = {Benchmarking Optimization Software with Performance Profiles},
  author  = {Dolan, Elizabeth D. and Mor{\'e}, Jorge J.},
  journal = {Mathematical Programming},
  volume  = {91},
  pages   = {201--213},
  year    = {2002},
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nilsleh

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.2.0

Jun 12, 2026

0.1.0

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evaluma-0.2.0.tar.gz (584.6 kB view details)

Uploaded Jun 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

evaluma-0.2.0-py3-none-any.whl (44.5 kB view details)

Uploaded Jun 12, 2026 Python 3

File details

Details for the file evaluma-0.2.0.tar.gz.

File metadata

Download URL: evaluma-0.2.0.tar.gz
Upload date: Jun 12, 2026
Size: 584.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evaluma-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`67b8e4884273a6b9f7d56bbc2e91b9d009f66c24ff5a1b9812c92e4e588eefc2`
MD5	`ebd6598fc93beabbeee51931d3d64165`
BLAKE2b-256	`d5fb0b36acd8a6a1934d4dcdf255e419d9dbae8fce70c49ad887b96b15fb6550`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evaluma-0.2.0.tar.gz:

Publisher: ci.yml on nilsleh/evaluma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evaluma-0.2.0.tar.gz
- Subject digest: 67b8e4884273a6b9f7d56bbc2e91b9d009f66c24ff5a1b9812c92e4e588eefc2
- Sigstore transparency entry: 1802588655
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: nilsleh/evaluma@db640d12eeaf73378995595a4f0cecd0b95e3043
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/nilsleh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@db640d12eeaf73378995595a4f0cecd0b95e3043
- Trigger Event: push

File details

Details for the file evaluma-0.2.0-py3-none-any.whl.

File metadata

Download URL: evaluma-0.2.0-py3-none-any.whl
Upload date: Jun 12, 2026
Size: 44.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for evaluma-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f702c1f2282173fefcb2cc70386d2c0b1e9f0ac1f7d52eacd3a41902cf19ed58`
MD5	`ede5f81a5cbd022d385b61e3f06b129c`
BLAKE2b-256	`808cef079b13b29fcccdf3a9c418b46f113ec1eefdeef7c1adaacfb09caf55fe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for evaluma-0.2.0-py3-none-any.whl:

Publisher: ci.yml on nilsleh/evaluma

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: evaluma-0.2.0-py3-none-any.whl
- Subject digest: f702c1f2282173fefcb2cc70386d2c0b1e9f0ac1f7d52eacd3a41902cf19ed58
- Sigstore transparency entry: 1802588965
- Sigstore integration time: Jun 12, 2026
Source repository:
- Permalink: nilsleh/evaluma@db640d12eeaf73378995595a4f0cecd0b95e3043
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/nilsleh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@db640d12eeaf73378995595a4f0cecd0b95e3043
- Trigger Event: push

evaluma 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

evaluma

Documentation

Installation

Quick start

Python API

CLI

Column mapping

Lower-is-better metrics

Filtering models or datasets

Input format

Contributing

License

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance