Skip to main content

Benchmark Reliability Framework (BRF) =?unknown-8bit?b?4oCU?= dataset-level reliability auditing for predictive benchmarks

Project description

BenchmarkReliability ��� BRF Python Package

Target

Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.

Method

The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:

from brf import BRFAnalyzer
from brf.phase import plot_phase_diagram
from brf.report import export_json

analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
print(analyzer.brf_vector)   # (B, I, N, M) ��� (S, E) ��� class

# Visualization
plot_phase_diagram(
    [analyzer.S], [analyzer.E],
    labels=[analyzer.class_],
    classes=[analyzer.class_],
)

# Export
export_json(analyzer.brf_vector, "results.json")

Package Structure

brf/
��������� __init__.py
��������� analyzer.py          ��� BRFAnalyzer main class
��������� metrics/
���   ��������� baseline_gap.py  ��� B
���   ��������� instability.py   ��� I
���   ��������� null_test.py     ��� N (permutation test)
���   ��������� metadata.py      ��� M
��������� phase/
���   ��������� embedding.py     ��� S = N - I, E = B + M
���   ��������� classifier.py    ��� Reliable / Fragile / Void
���   ��������� visualization.py ��� phase diagram, clustering plot
��������� report/
���   ��������� json_export.py
���   ��������� latex_export.py

Steps

Phase 1: Package skeleton (1-2 weeks)

  • Initialize Python project with pyproject.toml
  • Implement BRFAnalyzer main class with fit/predict interface
  • Port compute_b, compute_i, compute_n, compute_m from BehaviorAudit
  • Write unit tests for each metric

Phase 2: Phase embedding + classification (1 week)

  • Implement compute_phase(S, E) and classify_dataset(S, E)
  • Build phase diagram visualization (matplotlib)
  • Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results

Phase 3: Documentation + distribution (1-2 weeks)

  • Write README with quick-start tutorial and API docs
  • Publish to TestPyPI ��� PyPI
  • Set up ReadTheDocs for auto-generated documentation
  • Add GitHub Actions CI (test on Python 3.9���3.12)

Phase 4: HuggingFace Hub integration (optional, 1 week)

  • Add HF dataset loading wrapper
  • Allow brf.fit(dataset_id="OULAD") shorthand

Dependencies

  • numpy>=1.21
  • scikit-learn>=1.0
  • matplotlib>=3.5
  • No deep learning dependencies required

Relationship to Sister Repos

  • BehaviorAudit/: source of the audit logic; this package refactors and generalizes it
  • LLMScoringAudit/: first applied use case (MM-TBA �� multiple LLMs)
  • BenchmarkPhase/: large-scale application (30 datasets BRF leaderboard)
  • llm-annotation/: cited for complementary MLLM pseudo-label reliability findings

Target Journal

  • Journal of Open Source Software (JOSS) ��� tool paper, lightweight submission
  • Followed by application papers in C&E / BJET

Timeline

  • Phase 1���2: 3 weeks
  • Phase 3: 2 weeks
  • Phase 4: optional
  • JOSS submission: after Phase 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

benchmark_reliability-0.1.0.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

benchmark_reliability-0.1.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file benchmark_reliability-0.1.0.tar.gz.

File metadata

  • Download URL: benchmark_reliability-0.1.0.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.10

File hashes

Hashes for benchmark_reliability-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ffa4845844e4da564637870e18f958b2dc39cde0b43b8bb582f199ef68915289
MD5 ec669855c4da676e2d958b09fa6c5219
BLAKE2b-256 5329df6ef5f400a2c673b49cbb664e9dfd894109d6985d904e018e0f484834aa

See more details on using hashes here.

File details

Details for the file benchmark_reliability-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for benchmark_reliability-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7d0bb496a63f8fba6f42a3e86adc94d5417641c032b7f083eb92b213e47b1a66
MD5 6e493e4637fd1652c187854d4a3ce34c
BLAKE2b-256 9b2cd011f09c375e7a64d6e5366a8f010d0bf0e00bf9d81eaed36e0c3b29f3bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page