Benchmark and regression gate tooling for Refua workflows.

These details have not been verified by PyPI

Project links

Project description

refua-bench

refua-bench is a standalone benchmark and regression-gating project for Refua model workflows. It benchmarks current and future models via adapter interfaces and enforces safe regression gates.

What It Provides

Benchmark suite schema (yaml/json) for tasks, metrics, tolerances, and case sets.
Adapter system for model execution:
- golden: uses expected outputs (sanity checks)
- file: reads predictions from a JSON artifact
- command: calls any executable that reads JSON stdin and returns JSON stdout
- custom adapters via module.path:AdapterClass
Run artifacts in JSON + markdown.
Automatic run provenance capture (git/runtime/model/dependencies).
Statistical regression gating (minimum practical effect + bootstrap confidence intervals).
Baseline registry with named baselines and safe promotion flow.

Install

cd refua-bench
poetry install

Build

poetry build

CLI

poetry run refua-bench --help

1. Run a benchmark

poetry run refua-bench run \
  --suite benchmarks/sample_suite.yaml \
  --adapter file \
  --adapter-config benchmarks/sample_file_adapter_config.yaml \
  --model-name boltz2-affinity \
  --model-version 2026-02-12 \
  --output artifacts/candidate_run.json \
  --markdown artifacts/candidate_run.md

By default, each run stores provenance in run.provenance.

2. Compare candidate vs baseline

poetry run refua-bench compare \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --candidate artifacts/candidate_run.json \
  --output artifacts/compare.json \
  --markdown artifacts/compare.md

3. Statistical gating

poetry run refua-bench compare \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --candidate artifacts/candidate_run.json \
  --output artifacts/compare_stats.json \
  --min-effect-size 0.02 \
  --bootstrap-resamples 2000 \
  --confidence-level 0.95 \
  --bootstrap-seed 7 \
  --fail-on-uncertain

Interpretation:

min-effect-size: ignores changes too small to matter practically.
bootstrap-resamples: enables CI-based robustness checks.
--fail-on-uncertain: optional strict mode for inconclusive bootstrap tasks.

4. Run + compare in one command (`gate`)

poetry run refua-bench gate \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --adapter file \
  --adapter-config benchmarks/sample_file_adapter_config.yaml \
  --candidate-output artifacts/candidate_run.json \
  --output artifacts/gate_report.json \
  --min-effect-size 0.02 \
  --bootstrap-resamples 1000

5. Baseline registry and promotion

Promote an initial baseline:

poetry run refua-bench baseline promote \
  --registry artifacts/baseline_registry.json \
  --suite benchmarks/sample_suite.yaml \
  --baseline-name stable \
  --candidate benchmarks/sample_baseline_run.json

Compare against named baseline:

poetry run refua-bench compare \
  --suite benchmarks/sample_suite.yaml \
  --registry artifacts/baseline_registry.json \
  --baseline-name stable \
  --candidate artifacts/candidate_run.json \
  --output artifacts/compare_named.json

Promote a new candidate safely (fails if regression is detected):

poetry run refua-bench baseline promote \
  --registry artifacts/baseline_registry.json \
  --suite benchmarks/sample_suite.yaml \
  --baseline-name stable \
  --candidate artifacts/candidate_run.json \
  --min-effect-size 0.02 \
  --bootstrap-resamples 2000

List/resolve baselines:

poetry run refua-bench baseline list --registry artifacts/baseline_registry.json
poetry run refua-bench baseline resolve \
  --registry artifacts/baseline_registry.json \
  --suite benchmarks/sample_suite.yaml \
  --baseline-name stable

6. Scaffold a new suite

poetry run refua-bench init --directory benchmarks/new_suite --name refua-next

Suite Schema

name: refua-core-smoke
version: 1.0.0
description: smoke checks
tasks:
  - id: affinity_mae
    metric: mae
    prediction_key: affinity
    expected_key: affinity  # optional, defaults to prediction_key
    regression_tolerance: 0.05
    weight: 2.0
    positive_label: 1       # used by f1/enrichment_factor/bedroc
    enrichment_fraction: 0.01  # used by enrichment_factor/ef
    bedroc_alpha: 20.0  # used by bedroc
    cases:
      - id: case_1
        input: {target: KRAS, ligand: MRTX1133}
        expected: {affinity: -9.3}

Supported metrics:

mae
rmse
accuracy
exact_match
f1 (binary)
enrichment_factor / ef (binary labels + ranking scores)
bedroc (binary labels + ranking scores with early-recognition emphasis)

Prediction File Format (`file` adapter)

{
  "affinity_mae": {
    "case_1": {"affinity": -9.1}
  }
}

Command Adapter Contract

Input (stdin):

{
  "task_id": "affinity_mae",
  "prediction_key": "affinity",
  "case_id": "case_1",
  "input": {"target": "KRAS", "ligand": "MRTX1133"}
}

Output (stdout):

{"affinity": -9.2}

Tests

poetry run pytest

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.1

Mar 2, 2026

This version

0.7.0

Mar 2, 2026

0.6.0

Feb 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refua_bench-0.7.0.tar.gz (24.0 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

refua_bench-0.7.0-py3-none-any.whl (28.1 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file refua_bench-0.7.0.tar.gz.

File metadata

Download URL: refua_bench-0.7.0.tar.gz
Upload date: Mar 2, 2026
Size: 24.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0

File hashes

Hashes for refua_bench-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`ad0df92bde590a77401e214befebb1ac6a14fc1ccd6009a1f78be6989da59f4b`
MD5	`c4d2ceeeadf8dd8e7eb1313837a17071`
BLAKE2b-256	`8dcb89efa78a775f5f830c76831770a5d94ba95cb5f28c623a57cd74bb587d60`

See more details on using hashes here.

File details

Details for the file refua_bench-0.7.0-py3-none-any.whl.

File metadata

Download URL: refua_bench-0.7.0-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 28.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0

File hashes

Hashes for refua_bench-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`732d61451de3c19dfac5fdac6778ff9ee048adbf714e0fdd406a9c261fa595c8`
MD5	`7542772b7f5aea71b79cb80624b14655`
BLAKE2b-256	`28edcb72c25d321813168c3cc7bc5d4aac15ec798a5afa7c141a967c92d53aa6`

See more details on using hashes here.

refua-bench 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

refua-bench

What It Provides

Install

Build

CLI

1. Run a benchmark

2. Compare candidate vs baseline

3. Statistical gating

4. Run + compare in one command (gate)

5. Baseline registry and promotion

6. Scaffold a new suite

Suite Schema

Prediction File Format (file adapter)

Command Adapter Contract

Tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

4. Run + compare in one command (`gate`)

Prediction File Format (`file` adapter)