Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks

These details have not been verified by PyPI

Project links

Project description

BenchmarkReliability - BRF Python Package

Target

Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.

Method

The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:

from brf import BRFAnalyzer
from brf.phase import plot_phase_diagram
from brf.report import export_json

analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
print(analyzer.brf_vector)   # (B, I, N, M) -> (S, E) -> class

# Visualization
plot_phase_diagram(
    [analyzer.S], [analyzer.E],
    labels=[analyzer.class_],
    classes=[analyzer.class_],
)

# Export
export_json(analyzer.brf_vector, "results.json")

Package Structure

brf/
|-- __init__.py
|-- analyzer.py          <- BRFAnalyzer main class
|-- metrics/
|   |-- baseline_gap.py  <- B
|   |-- instability.py   <- I
|   |-- null_test.py     <- N (permutation test)
|   |-- metadata.py      <- M
|-- phase/
|   |-- embedding.py     <- S = N - I, E = B + M
|   |-- classifier.py    <- Reliable / Fragile / Void
|   |-- visualization.py <- phase diagram, clustering plot
|-- report/
|   |-- json_export.py
|   |-- latex_export.py

Steps

Phase 1: Package skeleton (1-2 weeks)

Initialize Python project with pyproject.toml
Implement BRFAnalyzer main class with fit/predict interface
Port compute_b, compute_i, compute_n, compute_m from BehaviorAudit
Write unit tests for each metric

Phase 2: Phase embedding + classification (1 week)

Implement compute_phase(S, E) and classify_dataset(S, E)
Build phase diagram visualization (matplotlib)
Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results

Phase 3: Documentation + distribution (1-2 weeks)

Write README with quick-start tutorial and API docs
Publish to TestPyPI -> PyPI
Set up ReadTheDocs for auto-generated documentation
Add GitHub Actions CI (test on Python 3.9-3.12)

Phase 4: HuggingFace Hub integration (optional, 1 week)

Add HF dataset loading wrapper
Allow brf.fit(dataset_id="OULAD") shorthand

Dependencies

numpy>=1.21
scikit-learn>=1.0
matplotlib>=3.5
No deep learning dependencies required

Relationship to Sister Repos

BehaviorAudit/: source of the audit logic; this package refactors and generalizes it
LLMScoringAudit/: first applied use case (MM-TBA x multiple LLMs)
BenchmarkPhase/: large-scale application (30 datasets BRF leaderboard)
llm-annotation/: cited for complementary MLLM pseudo-label reliability findings

Target Journal

Journal of Open Source Software (JOSS) - tool paper, lightweight submission
Followed by application papers in C&E / BJET

Timeline

Phase 1-2: 3 weeks
Phase 3: 2 weeks
Phase 4: optional
JOSS submission: after Phase 3

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Jul 1, 2026

0.2.0

Jul 1, 2026

0.1.9

Jul 1, 2026

0.1.8

Jul 1, 2026

0.1.7

Jul 1, 2026

0.1.6

Jul 1, 2026

0.1.5

Jun 30, 2026

0.1.4

Jun 30, 2026

This version

0.1.3

Jun 30, 2026

0.1.2

Jun 30, 2026

0.1.1

Jun 30, 2026

0.1.0

Jun 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchmark_reliability-0.1.3.tar.gz (10.6 kB view details)

Uploaded Jun 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

benchmark_reliability-0.1.3-py3-none-any.whl (9.8 kB view details)

Uploaded Jun 30, 2026 Python 3

File details

Details for the file benchmark_reliability-0.1.3.tar.gz.

File metadata

Download URL: benchmark_reliability-0.1.3.tar.gz
Upload date: Jun 30, 2026
Size: 10.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for benchmark_reliability-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`ba8395f702e9f92ddb60dc7922d35fa425c0524a670f8722cf768d72a3bd0736`
MD5	`44f5bb1ca6384319aa27d105034c91de`
BLAKE2b-256	`b796f42ecc640b30735dd5453da9dca4d5318065993ed4fdee2b0cb43158558c`

See more details on using hashes here.

File details

Details for the file benchmark_reliability-0.1.3-py3-none-any.whl.

File metadata

Download URL: benchmark_reliability-0.1.3-py3-none-any.whl
Upload date: Jun 30, 2026
Size: 9.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for benchmark_reliability-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`923cbb7c763265282c61f9bbca4384dc8ca5abcfc11e73ca075b097eb5c658e3`
MD5	`7c36821268464721c5badbd17bddf75a`
BLAKE2b-256	`8b76d926190c3362c90b45820834c86590d5e4f05e653bc81ab4607a09eabddc`

See more details on using hashes here.

benchmark-reliability 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BenchmarkReliability - BRF Python Package

Target

Method

Package Structure

Steps

Phase 1: Package skeleton (1-2 weeks)

Phase 2: Phase embedding + classification (1 week)

Phase 3: Documentation + distribution (1-2 weeks)

Phase 4: HuggingFace Hub integration (optional, 1 week)

Dependencies

Relationship to Sister Repos

Target Journal

Timeline

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes