Skip to main content

Tools for measuring sensitivity and diversity of multi-task benchmarks.

Project description

BenchBench is a Python package that provides a suite of tools to evaluate multi-task benchmarks focusing on diversity and sensitivity against irrelevant variations, such as label noise injection and the addition of irrelevant candidate models. This package facilitates comprehensive analysis of multi-task benchmarks through a social choice lens, exposing the fundamental trade-off between diversity and stability in both cardinal and ordinal benchmarks.

For more information, including the motivations behind the measures and our empirical findings, please see our paper.

Quick Start

To install the package, simply run:

pip install benchbench

Example Usage

To evaluate a cardinal benchmark, you can use the following code:

from benchbench.data import load_cardinal_benchmark
from benchbench.measures.cardinal import get_diversity, get_sensitivity

data, cols = load_cardinal_benchmark('GLUE')
diversity = get_diversity(data, cols)
sensitivity = get_sensitivity(data, cols)

To evaluate an ordinal benchmark, you can use the following code:

from benchbench.data import load_ordinal_benchmark
from benchbench.measures.ordinal import get_diversity, get_sensitivity

data, cols = load_ordinal_benchmark('HELM-accuracy')
diversity = get_diversity(data, cols)
sensitivity = get_sensitivity(data, cols)

To use your own benchmark, you just need to provide a pandas DataFrame and a list of columns indicating the tasks. Check the documentation for more details.

Reproduce the Paper

One could check out cardinal.ipynb, ordinal.ipynb and banner.ipynb to reproduce our results using Google Colab with one click.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchbench-1.0.0.tar.gz (209.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchbench-1.0.0-py3-none-any.whl (243.2 kB view details)

Uploaded Python 3

File details

Details for the file benchbench-1.0.0.tar.gz.

File metadata

  • Download URL: benchbench-1.0.0.tar.gz
  • Upload date:
  • Size: 209.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for benchbench-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f7c3a7ed05c87b928676230bb00142d0e5081fe653205a6f6c79145aa2d7be1a
MD5 84c6a203ea2935a04d2dcbaa947d9481
BLAKE2b-256 339e5343fc7affadb088d843229f83506cfe272df9ab3e1936591dd746ef0425

See more details on using hashes here.

File details

Details for the file benchbench-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: benchbench-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 243.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.19

File hashes

Hashes for benchbench-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dcc5a97c6bda191c50134441b986839546636d3959eff3380a46c31d2e062405
MD5 ab6f9bc76ec4a6f221bd682474eed43e
BLAKE2b-256 5139033c843e3f9e6aec8ac4a0102a154ccdfa897a80a85a3f6dababba355b66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page