Benchmark Reliability Framework (BRF) =?unknown-8bit?b?4oCU?= dataset-level reliability auditing for predictive benchmarks
Project description
BenchmarkReliability ��� BRF Python Package
Target
Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.
Method
The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:
from brf import BRFAnalyzer
from brf.phase import plot_phase_diagram
from brf.report import export_json
analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
print(analyzer.brf_vector) # (B, I, N, M) ��� (S, E) ��� class
# Visualization
plot_phase_diagram(
[analyzer.S], [analyzer.E],
labels=[analyzer.class_],
classes=[analyzer.class_],
)
# Export
export_json(analyzer.brf_vector, "results.json")
Package Structure
brf/
��������� __init__.py
��������� analyzer.py ��� BRFAnalyzer main class
��������� metrics/
��� ��������� baseline_gap.py ��� B
��� ��������� instability.py ��� I
��� ��������� null_test.py ��� N (permutation test)
��� ��������� metadata.py ��� M
��������� phase/
��� ��������� embedding.py ��� S = N - I, E = B + M
��� ��������� classifier.py ��� Reliable / Fragile / Void
��� ��������� visualization.py ��� phase diagram, clustering plot
��������� report/
��� ��������� json_export.py
��� ��������� latex_export.py
Steps
Phase 1: Package skeleton (1-2 weeks)
- Initialize Python project with
pyproject.toml - Implement
BRFAnalyzermain class with fit/predict interface - Port
compute_b,compute_i,compute_n,compute_mfrom BehaviorAudit - Write unit tests for each metric
Phase 2: Phase embedding + classification (1 week)
- Implement
compute_phase(S, E)andclassify_dataset(S, E) - Build phase diagram visualization (matplotlib)
- Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results
Phase 3: Documentation + distribution (1-2 weeks)
- Write README with quick-start tutorial and API docs
- Publish to TestPyPI ��� PyPI
- Set up ReadTheDocs for auto-generated documentation
- Add GitHub Actions CI (test on Python 3.9���3.12)
Phase 4: HuggingFace Hub integration (optional, 1 week)
- Add HF dataset loading wrapper
- Allow
brf.fit(dataset_id="OULAD")shorthand
Dependencies
numpy>=1.21scikit-learn>=1.0matplotlib>=3.5- No deep learning dependencies required
Relationship to Sister Repos
BehaviorAudit/: source of the audit logic; this package refactors and generalizes itLLMScoringAudit/: first applied use case (MM-TBA �� multiple LLMs)BenchmarkPhase/: large-scale application (30 datasets BRF leaderboard)llm-annotation/: cited for complementary MLLM pseudo-label reliability findings
Target Journal
- Journal of Open Source Software (JOSS) ��� tool paper, lightweight submission
- Followed by application papers in C&E / BJET
Timeline
- Phase 1���2: 3 weeks
- Phase 3: 2 weeks
- Phase 4: optional
- JOSS submission: after Phase 3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file benchmark_reliability-0.1.0.tar.gz.
File metadata
- Download URL: benchmark_reliability-0.1.0.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffa4845844e4da564637870e18f958b2dc39cde0b43b8bb582f199ef68915289
|
|
| MD5 |
ec669855c4da676e2d958b09fa6c5219
|
|
| BLAKE2b-256 |
5329df6ef5f400a2c673b49cbb664e9dfd894109d6985d904e018e0f484834aa
|
File details
Details for the file benchmark_reliability-0.1.0-py3-none-any.whl.
File metadata
- Download URL: benchmark_reliability-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d0bb496a63f8fba6f42a3e86adc94d5417641c032b7f083eb92b213e47b1a66
|
|
| MD5 |
6e493e4637fd1652c187854d4a3ce34c
|
|
| BLAKE2b-256 |
9b2cd011f09c375e7a64d6e5366a8f010d0bf0e00bf9d81eaed36e0c3b29f3bb
|