Benchmark and regression gate tooling for Refua workflows.
Project description
refua-bench
refua-bench is a standalone benchmark and regression-gating project for Refua model workflows.
It benchmarks current and future models via adapter interfaces and enforces safe regression gates.
What It Provides
- Benchmark suite schema (
yaml/json) for tasks, metrics, tolerances, and case sets. - Adapter system for model execution:
golden: uses expected outputs (sanity checks)file: reads predictions from a JSON artifactcommand: calls any executable that reads JSON stdin and returns JSON stdout- custom adapters via
module.path:AdapterClass
- Run artifacts in JSON + markdown.
- Automatic run provenance capture (git/runtime/model/dependencies).
- Statistical regression gating (minimum practical effect + bootstrap confidence intervals).
- Baseline registry with named baselines and safe promotion flow.
Install
cd refua-bench
poetry install
Build
poetry build
CLI
poetry run refua-bench --help
1. Run a benchmark
poetry run refua-bench run \
--suite benchmarks/sample_suite.yaml \
--adapter file \
--adapter-config benchmarks/sample_file_adapter_config.yaml \
--model-name boltz2-affinity \
--model-version 2026-02-12 \
--output artifacts/candidate_run.json \
--markdown artifacts/candidate_run.md
By default, each run stores provenance in run.provenance.
2. Compare candidate vs baseline
poetry run refua-bench compare \
--suite benchmarks/sample_suite.yaml \
--baseline benchmarks/sample_baseline_run.json \
--candidate artifacts/candidate_run.json \
--output artifacts/compare.json \
--markdown artifacts/compare.md
3. Statistical gating
poetry run refua-bench compare \
--suite benchmarks/sample_suite.yaml \
--baseline benchmarks/sample_baseline_run.json \
--candidate artifacts/candidate_run.json \
--output artifacts/compare_stats.json \
--min-effect-size 0.02 \
--bootstrap-resamples 2000 \
--confidence-level 0.95 \
--bootstrap-seed 7 \
--fail-on-uncertain
Interpretation:
min-effect-size: ignores changes too small to matter practically.bootstrap-resamples: enables CI-based robustness checks.--fail-on-uncertain: optional strict mode for inconclusive bootstrap tasks.
4. Run + compare in one command (gate)
poetry run refua-bench gate \
--suite benchmarks/sample_suite.yaml \
--baseline benchmarks/sample_baseline_run.json \
--adapter file \
--adapter-config benchmarks/sample_file_adapter_config.yaml \
--candidate-output artifacts/candidate_run.json \
--output artifacts/gate_report.json \
--min-effect-size 0.02 \
--bootstrap-resamples 1000
5. Baseline registry and promotion
Promote an initial baseline:
poetry run refua-bench baseline promote \
--registry artifacts/baseline_registry.json \
--suite benchmarks/sample_suite.yaml \
--baseline-name stable \
--candidate benchmarks/sample_baseline_run.json
Compare against named baseline:
poetry run refua-bench compare \
--suite benchmarks/sample_suite.yaml \
--registry artifacts/baseline_registry.json \
--baseline-name stable \
--candidate artifacts/candidate_run.json \
--output artifacts/compare_named.json
Promote a new candidate safely (fails if regression is detected):
poetry run refua-bench baseline promote \
--registry artifacts/baseline_registry.json \
--suite benchmarks/sample_suite.yaml \
--baseline-name stable \
--candidate artifacts/candidate_run.json \
--min-effect-size 0.02 \
--bootstrap-resamples 2000
List/resolve baselines:
poetry run refua-bench baseline list --registry artifacts/baseline_registry.json
poetry run refua-bench baseline resolve \
--registry artifacts/baseline_registry.json \
--suite benchmarks/sample_suite.yaml \
--baseline-name stable
6. Scaffold a new suite
poetry run refua-bench init --directory benchmarks/new_suite --name refua-next
Suite Schema
name: refua-core-smoke
version: 1.0.0
description: smoke checks
tasks:
- id: affinity_mae
metric: mae
prediction_key: affinity
expected_key: affinity # optional, defaults to prediction_key
regression_tolerance: 0.05
weight: 2.0
positive_label: 1 # used by f1/enrichment_factor/bedroc
enrichment_fraction: 0.01 # used by enrichment_factor/ef
bedroc_alpha: 20.0 # used by bedroc
cases:
- id: case_1
input: {target: KRAS, ligand: MRTX1133}
expected: {affinity: -9.3}
Supported metrics:
maermseaccuracyexact_matchf1(binary)enrichment_factor/ef(binary labels + ranking scores)bedroc(binary labels + ranking scores with early-recognition emphasis)
Prediction File Format (file adapter)
{
"affinity_mae": {
"case_1": {"affinity": -9.1}
}
}
Command Adapter Contract
Input (stdin):
{
"task_id": "affinity_mae",
"prediction_key": "affinity",
"case_id": "case_1",
"input": {"target": "KRAS", "ligand": "MRTX1133"}
}
Output (stdout):
{"affinity": -9.2}
Tests
poetry run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file refua_bench-0.7.1.tar.gz.
File metadata
- Download URL: refua_bench-0.7.1.tar.gz
- Upload date:
- Size: 24.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e89d69361ef89d5a30b3dc353f34f4a89447574135a7ee8475e21d61d19384a5
|
|
| MD5 |
f5ee8a34075b3d5515709eb8cfe681e7
|
|
| BLAKE2b-256 |
f1ff20c0d95e3a3990ec650f544447fd477887d726cdeacdcf5201d7fe105269
|
File details
Details for the file refua_bench-0.7.1-py3-none-any.whl.
File metadata
- Download URL: refua_bench-0.7.1-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.14.3 Darwin/25.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4583a5414dff998c3d0a20812fab7c603bd01b712ab151a3cb2a281d8f9d102b
|
|
| MD5 |
65c6a60639941614de3e112f50cdcfdb
|
|
| BLAKE2b-256 |
7dfca4251896371b927d6449a1f456ef36880506436927e65f94faa7146bfeeb
|