Skip to main content

Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks

Project description

BenchmarkReliability ��� BRF Python Package

Target

Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.

Method

The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:

from brf import BRFAnalyzer
from brf.phase import plot_phase_diagram
from brf.report import export_json

analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
print(analyzer.brf_vector)   # (B, I, N, M) ��� (S, E) ��� class

# Visualization
plot_phase_diagram(
    [analyzer.S], [analyzer.E],
    labels=[analyzer.class_],
    classes=[analyzer.class_],
)

# Export
export_json(analyzer.brf_vector, "results.json")

Package Structure

brf/
��������� __init__.py
��������� analyzer.py          ��� BRFAnalyzer main class
��������� metrics/
���   ��������� baseline_gap.py  ��� B
���   ��������� instability.py   ��� I
���   ��������� null_test.py     ��� N (permutation test)
���   ��������� metadata.py      ��� M
��������� phase/
���   ��������� embedding.py     ��� S = N - I, E = B + M
���   ��������� classifier.py    ��� Reliable / Fragile / Void
���   ��������� visualization.py ��� phase diagram, clustering plot
��������� report/
���   ��������� json_export.py
���   ��������� latex_export.py

Steps

Phase 1: Package skeleton (1-2 weeks)

  • Initialize Python project with pyproject.toml
  • Implement BRFAnalyzer main class with fit/predict interface
  • Port compute_b, compute_i, compute_n, compute_m from BehaviorAudit
  • Write unit tests for each metric

Phase 2: Phase embedding + classification (1 week)

  • Implement compute_phase(S, E) and classify_dataset(S, E)
  • Build phase diagram visualization (matplotlib)
  • Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results

Phase 3: Documentation + distribution (1-2 weeks)

  • Write README with quick-start tutorial and API docs
  • Publish to TestPyPI ��� PyPI
  • Set up ReadTheDocs for auto-generated documentation
  • Add GitHub Actions CI (test on Python 3.9���3.12)

Phase 4: HuggingFace Hub integration (optional, 1 week)

  • Add HF dataset loading wrapper
  • Allow brf.fit(dataset_id="OULAD") shorthand

Dependencies

  • numpy>=1.21
  • scikit-learn>=1.0
  • matplotlib>=3.5
  • No deep learning dependencies required

Relationship to Sister Repos

  • BehaviorAudit/: source of the audit logic; this package refactors and generalizes it
  • LLMScoringAudit/: first applied use case (MM-TBA �� multiple LLMs)
  • BenchmarkPhase/: large-scale application (30 datasets BRF leaderboard)
  • llm-annotation/: cited for complementary MLLM pseudo-label reliability findings

Target Journal

  • Journal of Open Source Software (JOSS) ��� tool paper, lightweight submission
  • Followed by application papers in C&E / BJET

Timeline

  • Phase 1���2: 3 weeks
  • Phase 3: 2 weeks
  • Phase 4: optional
  • JOSS submission: after Phase 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchmark_reliability-0.1.1.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchmark_reliability-0.1.1-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file benchmark_reliability-0.1.1.tar.gz.

File metadata

  • Download URL: benchmark_reliability-0.1.1.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for benchmark_reliability-0.1.1.tar.gz
Algorithm Hash digest
SHA256 67ef55b6f9d90ad67e9e3b98a40481da3c75b71110d12d9bcdb27c28063e5977
MD5 34906f1d0eb45a9f8eb208b9ba8c111a
BLAKE2b-256 0cfd2cdccdead14e2f7b8ed4b5b634054cfcb249eb259dbe419aec7527aa0bb3

See more details on using hashes here.

File details

Details for the file benchmark_reliability-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for benchmark_reliability-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b1177eb2ae4be24538b70555d600e5045eba7f8663667f160539292c6119398b
MD5 fd8f943b70908af936c828ab6f4b618d
BLAKE2b-256 b78be78d6a3b20683b27c7663c21a3797e89f964dbe9f50dcd0cf9b8f89462b1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page