Skip to main content

Benchmark and regression gate tooling for Refua workflows.

Project description

refua-bench

refua-bench is a standalone benchmark and regression-gating project for Refua model workflows. It benchmarks current and future models via adapter interfaces and enforces safe regression gates.

What It Provides

  • Benchmark suite schema (yaml/json) for tasks, metrics, tolerances, and case sets.
  • Adapter system for model execution:
    • golden: uses expected outputs (sanity checks)
    • file: reads predictions from a JSON artifact
    • command: calls any executable that reads JSON stdin and returns JSON stdout
    • custom adapters via module.path:AdapterClass
  • Run artifacts in JSON + markdown.
  • Automatic run provenance capture (git/runtime/model/dependencies).
  • Statistical regression gating (minimum practical effect + bootstrap confidence intervals).
  • Baseline registry with named baselines and safe promotion flow.

Install

cd refua-bench
poetry install

Build

poetry build

CLI

poetry run refua-bench --help

1. Run a benchmark

poetry run refua-bench run \
  --suite benchmarks/sample_suite.yaml \
  --adapter file \
  --adapter-config benchmarks/sample_file_adapter_config.yaml \
  --model-name boltz2-affinity \
  --model-version 2026-02-12 \
  --output artifacts/candidate_run.json \
  --markdown artifacts/candidate_run.md

By default, each run stores provenance in run.provenance.

2. Compare candidate vs baseline

poetry run refua-bench compare \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --candidate artifacts/candidate_run.json \
  --output artifacts/compare.json \
  --markdown artifacts/compare.md

3. Statistical gating

poetry run refua-bench compare \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --candidate artifacts/candidate_run.json \
  --output artifacts/compare_stats.json \
  --min-effect-size 0.02 \
  --bootstrap-resamples 2000 \
  --confidence-level 0.95 \
  --bootstrap-seed 7 \
  --fail-on-uncertain

Interpretation:

  • min-effect-size: ignores changes too small to matter practically.
  • bootstrap-resamples: enables CI-based robustness checks.
  • --fail-on-uncertain: optional strict mode for inconclusive bootstrap tasks.

4. Run + compare in one command (gate)

poetry run refua-bench gate \
  --suite benchmarks/sample_suite.yaml \
  --baseline benchmarks/sample_baseline_run.json \
  --adapter file \
  --adapter-config benchmarks/sample_file_adapter_config.yaml \
  --candidate-output artifacts/candidate_run.json \
  --output artifacts/gate_report.json \
  --min-effect-size 0.02 \
  --bootstrap-resamples 1000

5. Baseline registry and promotion

Promote an initial baseline:

poetry run refua-bench baseline promote \
  --registry artifacts/baseline_registry.json \
  --suite benchmarks/sample_suite.yaml \
  --baseline-name stable \
  --candidate benchmarks/sample_baseline_run.json

Compare against named baseline:

poetry run refua-bench compare \
  --suite benchmarks/sample_suite.yaml \
  --registry artifacts/baseline_registry.json \
  --baseline-name stable \
  --candidate artifacts/candidate_run.json \
  --output artifacts/compare_named.json

Promote a new candidate safely (fails if regression is detected):

poetry run refua-bench baseline promote \
  --registry artifacts/baseline_registry.json \
  --suite benchmarks/sample_suite.yaml \
  --baseline-name stable \
  --candidate artifacts/candidate_run.json \
  --min-effect-size 0.02 \
  --bootstrap-resamples 2000

List/resolve baselines:

poetry run refua-bench baseline list --registry artifacts/baseline_registry.json
poetry run refua-bench baseline resolve \
  --registry artifacts/baseline_registry.json \
  --suite benchmarks/sample_suite.yaml \
  --baseline-name stable

6. Scaffold a new suite

poetry run refua-bench init --directory benchmarks/new_suite --name refua-next

Suite Schema

name: refua-core-smoke
version: 1.0.0
description: smoke checks
tasks:
  - id: affinity_mae
    metric: mae
    prediction_key: affinity
    expected_key: affinity  # optional, defaults to prediction_key
    regression_tolerance: 0.05
    weight: 2.0
    positive_label: 1       # used by f1/enrichment_factor/bedroc
    enrichment_fraction: 0.01  # used by enrichment_factor/ef
    bedroc_alpha: 20.0  # used by bedroc
    cases:
      - id: case_1
        input: {target: KRAS, ligand: MRTX1133}
        expected: {affinity: -9.3}

Supported metrics:

  • mae
  • rmse
  • accuracy
  • exact_match
  • f1 (binary)
  • enrichment_factor / ef (binary labels + ranking scores)
  • bedroc (binary labels + ranking scores with early-recognition emphasis)

Prediction File Format (file adapter)

{
  "affinity_mae": {
    "case_1": {"affinity": -9.1}
  }
}

Command Adapter Contract

Input (stdin):

{
  "task_id": "affinity_mae",
  "prediction_key": "affinity",
  "case_id": "case_1",
  "input": {"target": "KRAS", "ligand": "MRTX1133"}
}

Output (stdout):

{"affinity": -9.2}

Tests

poetry run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refua_bench-0.7.1.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refua_bench-0.7.1-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file refua_bench-0.7.1.tar.gz.

File metadata

  • Download URL: refua_bench-0.7.1.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0

File hashes

Hashes for refua_bench-0.7.1.tar.gz
Algorithm Hash digest
SHA256 e89d69361ef89d5a30b3dc353f34f4a89447574135a7ee8475e21d61d19384a5
MD5 f5ee8a34075b3d5515709eb8cfe681e7
BLAKE2b-256 f1ff20c0d95e3a3990ec650f544447fd477887d726cdeacdcf5201d7fe105269

See more details on using hashes here.

File details

Details for the file refua_bench-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: refua_bench-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0

File hashes

Hashes for refua_bench-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4583a5414dff998c3d0a20812fab7c603bd01b712ab151a3cb2a281d8f9d102b
MD5 65c6a60639941614de3e112f50cdcfdb
BLAKE2b-256 7dfca4251896371b927d6449a1f456ef36880506436927e65f94faa7146bfeeb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page