Skip to main content

Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks

Project description

BenchmarkReliability - BRF Python Package

Target

Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.

Method

The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:

from brf import BRFAnalyzer
from brf.phase import plot_phase_diagram
from brf.report import export_json

analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
print(analyzer.brf_vector)   # (B, I, N, M) -> (S, E) -> class

# Visualization
plot_phase_diagram(
    [analyzer.S], [analyzer.E],
    labels=[analyzer.class_],
    classes=[analyzer.class_],
)

# Export
export_json(analyzer.brf_vector, "results.json")

Package Structure

brf/
|-- __init__.py
|-- analyzer.py          <- BRFAnalyzer main class
|-- metrics/
|   |-- baseline_gap.py  <- B
|   |-- instability.py   <- I
|   |-- null_test.py     <- N (permutation test)
|   |-- metadata.py      <- M
|-- phase/
|   |-- embedding.py     <- S = N - I, E = B + M
|   |-- classifier.py    <- Reliable / Fragile / Void
|   |-- visualization.py <- phase diagram, clustering plot
|-- report/
|   |-- json_export.py
|   |-- latex_export.py

Steps

Phase 1: Package skeleton (1-2 weeks)

  • Initialize Python project with pyproject.toml
  • Implement BRFAnalyzer main class with fit/predict interface
  • Port compute_b, compute_i, compute_n, compute_m from BehaviorAudit
  • Write unit tests for each metric

Phase 2: Phase embedding + classification (1 week)

  • Implement compute_phase(S, E) and classify_dataset(S, E)
  • Build phase diagram visualization (matplotlib)
  • Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results

Phase 3: Documentation + distribution (1-2 weeks)

  • Write README with quick-start tutorial and API docs
  • Publish to TestPyPI -> PyPI
  • Set up ReadTheDocs for auto-generated documentation
  • Add GitHub Actions CI (test on Python 3.9-3.12)

Phase 4: HuggingFace Hub integration (optional, 1 week)

  • Add HF dataset loading wrapper
  • Allow brf.fit(dataset_id="OULAD") shorthand

Dependencies

  • numpy>=1.21
  • scikit-learn>=1.0
  • matplotlib>=3.5
  • No deep learning dependencies required

Relationship to Sister Repos

  • BehaviorAudit/: source of the audit logic; this package refactors and generalizes it
  • LLMScoringAudit/: first applied use case (MM-TBA x multiple LLMs)
  • BenchmarkPhase/: large-scale application (30 datasets BRF leaderboard)
  • llm-annotation/: cited for complementary MLLM pseudo-label reliability findings

Target Journal

  • Journal of Open Source Software (JOSS) - tool paper, lightweight submission
  • Followed by application papers in C&E / BJET

Timeline

  • Phase 1-2: 3 weeks
  • Phase 3: 2 weeks
  • Phase 4: optional
  • JOSS submission: after Phase 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchmark_reliability-0.1.3.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchmark_reliability-0.1.3-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file benchmark_reliability-0.1.3.tar.gz.

File metadata

  • Download URL: benchmark_reliability-0.1.3.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for benchmark_reliability-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ba8395f702e9f92ddb60dc7922d35fa425c0524a670f8722cf768d72a3bd0736
MD5 44f5bb1ca6384319aa27d105034c91de
BLAKE2b-256 b796f42ecc640b30735dd5453da9dca4d5318065993ed4fdee2b0cb43158558c

See more details on using hashes here.

File details

Details for the file benchmark_reliability-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for benchmark_reliability-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 923cbb7c763265282c61f9bbca4384dc8ca5abcfc11e73ca075b097eb5c658e3
MD5 7c36821268464721c5badbd17bddf75a
BLAKE2b-256 8b76d926190c3362c90b45820834c86590d5e4f05e653bc81ab4607a09eabddc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page