Skip to main content

Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks

Project description

BenchmarkReliability - BRF Python Package

Target

Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.

Method

The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:

from brf import BRFAnalyzer
from brf.phase import plot_phase_diagram
from brf.report import export_json

analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
print(analyzer.brf_vector)   # (B, I, N, M) ��� (S, E) ��� class

# Visualization
plot_phase_diagram(
    [analyzer.S], [analyzer.E],
    labels=[analyzer.class_],
    classes=[analyzer.class_],
)

# Export
export_json(analyzer.brf_vector, "results.json")

Package Structure

brf/
��������� __init__.py
��������� analyzer.py          ��� BRFAnalyzer main class
��������� metrics/
���   ��������� baseline_gap.py  ��� B
���   ��������� instability.py   ��� I
���   ��������� null_test.py     ��� N (permutation test)
���   ��������� metadata.py      ��� M
��������� phase/
���   ��������� embedding.py     ��� S = N - I, E = B + M
���   ��������� classifier.py    ��� Reliable / Fragile / Void
���   ��������� visualization.py ��� phase diagram, clustering plot
��������� report/
���   ��������� json_export.py
���   ��������� latex_export.py

Steps

Phase 1: Package skeleton (1-2 weeks)

  • Initialize Python project with pyproject.toml
  • Implement BRFAnalyzer main class with fit/predict interface
  • Port compute_b, compute_i, compute_n, compute_m from BehaviorAudit
  • Write unit tests for each metric

Phase 2: Phase embedding + classification (1 week)

  • Implement compute_phase(S, E) and classify_dataset(S, E)
  • Build phase diagram visualization (matplotlib)
  • Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results

Phase 3: Documentation + distribution (1-2 weeks)

  • Write README with quick-start tutorial and API docs
  • Publish to TestPyPI ��� PyPI
  • Set up ReadTheDocs for auto-generated documentation
  • Add GitHub Actions CI (test on Python 3.9���3.12)

Phase 4: HuggingFace Hub integration (optional, 1 week)

  • Add HF dataset loading wrapper
  • Allow brf.fit(dataset_id="OULAD") shorthand

Dependencies

  • numpy>=1.21
  • scikit-learn>=1.0
  • matplotlib>=3.5
  • No deep learning dependencies required

Relationship to Sister Repos

  • BehaviorAudit/: source of the audit logic; this package refactors and generalizes it
  • LLMScoringAudit/: first applied use case (MM-TBA �� multiple LLMs)
  • BenchmarkPhase/: large-scale application (30 datasets BRF leaderboard)
  • llm-annotation/: cited for complementary MLLM pseudo-label reliability findings

Target Journal

  • Journal of Open Source Software (JOSS) - tool paper, lightweight submission
  • Followed by application papers in C&E / BJET

Timeline

  • Phase 1���2: 3 weeks
  • Phase 3: 2 weeks
  • Phase 4: optional
  • JOSS submission: after Phase 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchmark_reliability-0.1.2.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchmark_reliability-0.1.2-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file benchmark_reliability-0.1.2.tar.gz.

File metadata

  • Download URL: benchmark_reliability-0.1.2.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for benchmark_reliability-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a12d889aca596e58370e17ef58cb0c6c49bd038f09e39431109eb7dc056e94c5
MD5 70aa97fc052a2be1020a5b56af0e9801
BLAKE2b-256 881e1349d0e1205a6257c4fba571dc6c0271a475f3743fcb59a88f21b9044e1f

See more details on using hashes here.

File details

Details for the file benchmark_reliability-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for benchmark_reliability-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d0cca49e04ac74097c0d5976c4915e93290788e5c4447a8e12bcbd6728715309
MD5 4556250cdbc4133979fcae9d6fa49f1a
BLAKE2b-256 c9b811f43b25895b77e9664f10f6a4bbe13e1f649055892c72f8539f75e4879c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page