Skip to main content

Benchmark and regression gate tooling for Refua workflows.

Project description

refua-bench

refua-bench is a standalone benchmark and regression-gating project for Refua model workflows. It benchmarks current and future models via adapter interfaces and enforces safe regression gates.

What It Provides

  • Benchmark suite schema (yaml/json) for tasks, metrics, tolerances, and case sets.
  • Adapter system for model execution:
    • golden: uses expected outputs (sanity checks)
    • file: reads predictions from a JSON artifact
    • command: calls any executable that reads JSON stdin and returns JSON stdout
    • custom adapters via module.path:AdapterClass
  • Run artifacts in JSON + markdown.
  • Automatic run provenance capture (git/runtime/model/dependencies).
  • Statistical regression gating (minimum practical effect + bootstrap confidence intervals).
  • Baseline registry with named baselines and safe promotion flow.

Install

cd refua-bench
poetry install

Build

poetry build

CLI

poetry run refua-bench --help

1. Run a benchmark

poetry run refua-bench run \
  --suite benchmarks/sample_suite.yaml \
  --adapter file \
  --adapter-config benchmarks/sample_file_adapter_config.yaml \
  --model-name boltz2-affinity \
  --model-version 2026-02-12 \
  --output artifacts/candidate_run.json \
  --markdown artifacts/candidate_run.md

By default, each run stores provenance in run.provenance.

2. Compare candidate vs baseline

poetry run refua-bench compare \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --candidate artifacts/candidate_run.json \
  --output artifacts/compare.json \
  --markdown artifacts/compare.md

3. Statistical gating

poetry run refua-bench compare \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --candidate artifacts/candidate_run.json \
  --output artifacts/compare_stats.json \
  --min-effect-size 0.02 \
  --bootstrap-resamples 2000 \
  --confidence-level 0.95 \
  --bootstrap-seed 7 \
  --fail-on-uncertain

Interpretation:

  • min-effect-size: ignores changes too small to matter practically.
  • bootstrap-resamples: enables CI-based robustness checks.
  • --fail-on-uncertain: optional strict mode for inconclusive bootstrap tasks.

4. Run + compare in one command (gate)

poetry run refua-bench gate \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --adapter file \
  --adapter-config benchmarks/sample_file_adapter_config.yaml \
  --candidate-output artifacts/candidate_run.json \
  --output artifacts/gate_report.json \
  --min-effect-size 0.02 \
  --bootstrap-resamples 1000

5. Baseline registry and promotion

Promote an initial baseline:

poetry run refua-bench baseline promote \
  --registry artifacts/baseline_registry.json \
  --suite benchmarks/sample_suite.yaml \
  --baseline-name stable \
  --candidate benchmarks/sample_baseline_run.json

Compare against named baseline:

poetry run refua-bench compare \
  --suite benchmarks/sample_suite.yaml \
  --registry artifacts/baseline_registry.json \
  --baseline-name stable \
  --candidate artifacts/candidate_run.json \
  --output artifacts/compare_named.json

Promote a new candidate safely (fails if regression is detected):

poetry run refua-bench baseline promote \
  --registry artifacts/baseline_registry.json \
  --suite benchmarks/sample_suite.yaml \
  --baseline-name stable \
  --candidate artifacts/candidate_run.json \
  --min-effect-size 0.02 \
  --bootstrap-resamples 2000

List/resolve baselines:

poetry run refua-bench baseline list --registry artifacts/baseline_registry.json
poetry run refua-bench baseline resolve \
  --registry artifacts/baseline_registry.json \
  --suite benchmarks/sample_suite.yaml \
  --baseline-name stable

6. Scaffold a new suite

poetry run refua-bench init --directory benchmarks/new_suite --name refua-next

Suite Schema

name: refua-core-smoke
version: 1.0.0
description: smoke checks
tasks:
  - id: affinity_mae
    metric: mae
    prediction_key: affinity
    expected_key: affinity  # optional, defaults to prediction_key
    regression_tolerance: 0.05
    weight: 2.0
    positive_label: 1       # used by f1/enrichment_factor/bedroc
    enrichment_fraction: 0.01  # used by enrichment_factor/ef
    bedroc_alpha: 20.0  # used by bedroc
    cases:
      - id: case_1
        input: {target: KRAS, ligand: MRTX1133}
        expected: {affinity: -9.3}

Supported metrics:

  • mae
  • rmse
  • accuracy
  • exact_match
  • f1 (binary)
  • enrichment_factor / ef (binary labels + ranking scores)
  • bedroc (binary labels + ranking scores with early-recognition emphasis)

Prediction File Format (file adapter)

{
  "affinity_mae": {
    "case_1": {"affinity": -9.1}
  }
}

Command Adapter Contract

Input (stdin):

{
  "task_id": "affinity_mae",
  "prediction_key": "affinity",
  "case_id": "case_1",
  "input": {"target": "KRAS", "ligand": "MRTX1133"}
}

Output (stdout):

{"affinity": -9.2}

Tests

poetry run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refua_bench-0.7.0.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refua_bench-0.7.0-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file refua_bench-0.7.0.tar.gz.

File metadata

  • Download URL: refua_bench-0.7.0.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0

File hashes

Hashes for refua_bench-0.7.0.tar.gz
Algorithm Hash digest
SHA256 ad0df92bde590a77401e214befebb1ac6a14fc1ccd6009a1f78be6989da59f4b
MD5 c4d2ceeeadf8dd8e7eb1313837a17071
BLAKE2b-256 8dcb89efa78a775f5f830c76831770a5d94ba95cb5f28c623a57cd74bb587d60

See more details on using hashes here.

File details

Details for the file refua_bench-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: refua_bench-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0

File hashes

Hashes for refua_bench-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 732d61451de3c19dfac5fdac6778ff9ee048adbf714e0fdd406a9c261fa595c8
MD5 7542772b7f5aea71b79cb80624b14655
BLAKE2b-256 28edcb72c25d321813168c3cc7bc5d4aac15ec798a5afa7c141a967c92d53aa6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page