Skip to main content

Bayesian evaluation and ranking toolkit

Project description

Scorio

scorio implements the Bayes@N framework introduced in Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation.

arXiv (Bayes Evaluation) arXiv (Bayes Ranking) ICLR 2026 PyPI version Python versions License: MIT Python Docs


News


Installation

# Install from PyPI
pip install scorio

# Install latest from GitHub
pip install "git+https://github.com/mohsenhariri/scorio.git"

# Install a specific tag
pip install "git+https://github.com/mohsenhariri/scorio.git@v0.2.0"

# Install from local repository
pip install -e .

Requires Python 3.10+, NumPy, SciPy.

Data and shape conventions

  • Categories: encode outcomes per trial as integers in {0, ..., C}.
  • Weights: choose rubric weights w of length C+1 (e.g., [0, 1] for binary outcomes).
  • Shapes: R is M x N, R0 is M x D (if provided); both must share the same M and category set.

APIs

  • scorio.eval.bayes(R, w, R0=None) -> (mu: float, sigma: float)
    • R: M x N int array with entries in {0, ..., C}
    • w: length C+1 float array of rubric weights
    • R0 (optional): M x D int array of prior outcomes (same category set as R)
    • Returns posterior estimate mu of the rubric-weighted performance and uncertainty sigma.
  • scorio.eval.avg(R) -> float
    • Returns the naive mean of elements in R. For binary accuracy, encode incorrect=0 and correct=1.

How to use

import numpy as np
from scorio import eval

# Outcomes R: shape (M, N) with integer categories in {0, ..., C}
R = np.array([[0, 1, 2, 2, 1], [1, 1, 0, 2, 2]])

# Rubric weights w: length C+1
# Here: 0=incorrect(0.0), 1=partial(0.5), 2=correct(1.0)
w = np.array([0.0, 0.5, 1.0])

# Optional prior outcomes R0: shape (M, D)
R0 = np.array([[0, 2], [1, 2]])

# Bayesian evaluation with prior
mu, sigma = eval.bayes(R, w, R0)
print(f"mu = {mu:.6f}, sigma = {sigma:.6f}")
# Expected: mu ~ 0.575, sigma ~ 0.084275

# Bayesian evaluation without prior
mu2, sigma2 = eval.bayes(R, w)
print(f"mu = {mu2:.6f}, sigma = {sigma2:.6f}")
# Expected: mu ~ 0.5625, sigma ~ 0.091998

# Simple average
accuracy = eval.avg(R)
print(f"Average: {accuracy:.6f}")

Citing

If you use scorio in your research, please cite:

@inproceedings{hariri2026don,
  title={Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},
  author={Hariri, Mohsen and Samandar, Amirhossein and Hinczewski, Michael and Chaudhary, Vipin},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://arxiv.org/abs/2510.04265}
}
@article{hariri2026ranking,
  title={Ranking Reasoning LLMs under Test-Time Scaling},
  author={Hariri, Mohsen and Hinczewski, Michael and Ma, Jing and Chaudhary, Vipin},
  journal={arXiv preprint arXiv:2510.04265},
  year={2026},
  url={https://arxiv.org/abs/2510.04265}
}

License

MIT License. See the LICENSE file for details.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scorio-0.2.0.tar.gz (72.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scorio-0.2.0-py3-none-any.whl (82.1 kB view details)

Uploaded Python 3

File details

Details for the file scorio-0.2.0.tar.gz.

File metadata

  • Download URL: scorio-0.2.0.tar.gz
  • Upload date:
  • Size: 72.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for scorio-0.2.0.tar.gz
Algorithm Hash digest
SHA256 62fb1fc1a4eff30aeca7db3a32783a66e7d638c916cc6889a225f99e01e5a6a5
MD5 2ddfbf5616c183d91b247880c67522aa
BLAKE2b-256 53cf711b5b4aae277855d86af709d114f7a02b994c9d79cbb7b589a419a03b20

See more details on using hashes here.

File details

Details for the file scorio-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: scorio-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 82.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for scorio-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ce1aaf22324b1f56ffbd670770b21287368b0fb894a60599263463793896398
MD5 c05f29b0072a3517c773bf913f7579c4
BLAKE2b-256 64f6f0d2cc3ce593f44b594e0652837120a958c71ce43b377f8acb475ba651d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page