Marginal Baseline Eval (MBE): A framework for rigorously auditing representation metrics in deep neural networks.
Project description
The Marginal Baseline Eval (MBE)
Welcome to the Marginal Baseline Eval (MBE) repository!
This repository provides the formal implementation of the MBE protocol — a strict, 4-stage validation methodology designed to rigorously audit representation metrics in deep neural networks.
It was originally built during a massive case study that mathematically falsified the Gradient Effective Rank (FIM_norm) metric.
Why Do We Need MBE?
The AI safety and interpretability communities frequently propose internal structural metrics (e.g., representation geometry, effective rank, gradient coherence) to predict generalization or track model health.
However, many of these metrics are secretly Loss Proxies. Because early validation loss trivially predicts final test accuracy, any metric that mathematically correlates with the magnitude of the loss will automatically correlate with generalization. Such a metric provides zero independent structural insight.
The MBE protocol catches these false positive metrics using a rigorous partial-correlation baseline control.
Installation
You can install the framework directly from PyPI:
pip install mbe-eval
Or, if you want to run the PyTorch demos, clone the repository:
git clone https://github.com/AparajeetS/metric-audit-paper-code.git
cd metric-audit-paper-code
pip install -r requirements.txt
The MBE API
MBE is a fully importable Python framework powered by pandas and pingouin. You can integrate it directly into your own model evaluation pipelines.
from mbe_eval import MBEEvaluator
# Pass your experimental arrays (numpy arrays)
evaluator = MBEEvaluator(metric_name="My Cool Metric", baseline_name="Epoch 20 Val Loss")
report = evaluator.evaluate(metric_vals, baseline_vals, target_vals)
This automatically prints a beautiful rich diagnostics table to the console and generates a high-resolution seaborn graphical report in the mbe_reports/ directory.
Real PyTorch Demos
We provide two end-to-end PyTorch scripts in the examples/ directory that actually train neural networks and run the evaluation live.
1. The Acid Test (Stage 1) Shows how a metric can successfully track capacity and noise, giving false assurance.
python examples/01_run_acid_test.py
2. The Heterogeneous Grid (Stage 4) The killer demo. Trains 20 models with randomized hyperparameters, computes the Gradient Effective Rank, and runs the final MBE Partial Correlation control to prove the metric is a disguised loss proxy.
python examples/02_run_heterogeneous_grid.py
Repository Structure
metric-audit-paper-code/
├── mbe_eval/ # The core MBE evaluation API
│ ├── __init__.py
│ ├── core.py # MBEEvaluator class
│ ├── utils.py # PyTorch FIM_norm extraction
│ └── sample_eval.py # Basic synthetic simulation
├── examples/ # Real end-to-end PyTorch demos
│ ├── 01_run_acid_test.py
│ └── 02_run_heterogeneous_grid.py
├── experiments/ # All 12 original paper experiment scripts
├── metric_audit/ # Core FIM_norm computation library
├── docs/
│ └── RESULTS.md # Raw numerical results for the paper
├── PAPER.md # Full technical writeup
├── requirements.txt
├── LICENSE
└── README.md
Citation
If you use the Marginal Baseline Eval in your own representation evaluation, please cite the accompanying manuscript:
Shadangi, A. (2026). Does It Beat the Baseline? A Comprehensive Negative Result
on Gradient Effective Rank as a Generalization Predictor. arXiv preprint.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mbe_eval-0.1.0.tar.gz.
File metadata
- Download URL: mbe_eval-0.1.0.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
996131c5bc493dda4b33c34330c2e55c369ca56083c134e524d9b1efa158ea11
|
|
| MD5 |
d3e2203f8af7308ce186ed5dd6154ce4
|
|
| BLAKE2b-256 |
0b502b8349d3eb6b179233435d9359b439dadb204a08b20504f198e4330e0c7b
|
File details
Details for the file mbe_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mbe_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80d914984ddcd312e36fb4d11995ab0dd295f4b54728368fc928a2b0f242b23e
|
|
| MD5 |
69557b6e6625bf8ee841da33b8f004b1
|
|
| BLAKE2b-256 |
976ffbc9e98209fae9f2bd9fffe956f91e0cdf749ea91e5b3653285d893f266f
|