Skip to main content

Marginal Baseline Eval (MBE): A framework for rigorously auditing representation metrics in deep neural networks.

Project description

The Marginal Baseline Eval (MBE)

License: MIT

Welcome to the Marginal Baseline Eval (MBE) repository!

This repository provides the formal implementation of the MBE protocol — a strict, 4-stage validation methodology designed to rigorously audit representation metrics in deep neural networks.

It was originally built during a massive case study that mathematically falsified the Gradient Effective Rank (FIM_norm) metric.

Why Do We Need MBE?

The AI safety and interpretability communities frequently propose internal structural metrics (e.g., representation geometry, effective rank, gradient coherence) to predict generalization or track model health.

However, many of these metrics are secretly Loss Proxies. Because early validation loss trivially predicts final test accuracy, any metric that mathematically correlates with the magnitude of the loss will automatically correlate with generalization. Such a metric provides zero independent structural insight.

The MBE protocol catches these false positive metrics using a rigorous partial-correlation baseline control.

Installation

You can install the framework directly from PyPI:

pip install mbe-eval

Or, if you want to run the PyTorch demos, clone the repository:

git clone https://github.com/AparajeetS/metric-audit-paper-code.git
cd metric-audit-paper-code
pip install -r requirements.txt

The MBE API

MBE is a fully importable Python framework powered by pandas and pingouin. You can integrate it directly into your own model evaluation pipelines.

from mbe_eval import MBEEvaluator

# Pass your experimental arrays (numpy arrays)
evaluator = MBEEvaluator(metric_name="My Cool Metric", baseline_name="Epoch 20 Val Loss")
report = evaluator.evaluate(metric_vals, baseline_vals, target_vals)

This automatically prints a beautiful rich diagnostics table to the console and generates a high-resolution seaborn graphical report in the mbe_reports/ directory.

Real PyTorch Demos

We provide two end-to-end PyTorch scripts in the examples/ directory that actually train neural networks and run the evaluation live.

1. The Acid Test (Stage 1) Shows how a metric can successfully track capacity and noise, giving false assurance.

python examples/01_run_acid_test.py

2. The Heterogeneous Grid (Stage 4) The killer demo. Trains 20 models with randomized hyperparameters, computes the Gradient Effective Rank, and runs the final MBE Partial Correlation control to prove the metric is a disguised loss proxy.

python examples/02_run_heterogeneous_grid.py

Repository Structure

metric-audit-paper-code/
├── mbe_eval/               # The core MBE evaluation API
│   ├── __init__.py
│   ├── core.py             # MBEEvaluator class
│   ├── utils.py            # PyTorch FIM_norm extraction
│   └── sample_eval.py      # Basic synthetic simulation
├── examples/               # Real end-to-end PyTorch demos
│   ├── 01_run_acid_test.py
│   └── 02_run_heterogeneous_grid.py
├── experiments/            # All 12 original paper experiment scripts
├── metric_audit/           # Core FIM_norm computation library
├── docs/
│   └── RESULTS.md          # Raw numerical results for the paper
├── PAPER.md                # Full technical writeup
├── requirements.txt
├── LICENSE
└── README.md

Citation

If you use the Marginal Baseline Eval in your own representation evaluation, please cite the accompanying manuscript:

Shadangi, A. (2026). Does It Beat the Baseline? A Comprehensive Negative Result 
on Gradient Effective Rank as a Generalization Predictor. arXiv preprint.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mbe_eval-0.1.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mbe_eval-0.1.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file mbe_eval-0.1.0.tar.gz.

File metadata

  • Download URL: mbe_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for mbe_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 996131c5bc493dda4b33c34330c2e55c369ca56083c134e524d9b1efa158ea11
MD5 d3e2203f8af7308ce186ed5dd6154ce4
BLAKE2b-256 0b502b8349d3eb6b179233435d9359b439dadb204a08b20504f198e4330e0c7b

See more details on using hashes here.

File details

Details for the file mbe_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mbe_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for mbe_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 80d914984ddcd312e36fb4d11995ab0dd295f4b54728368fc928a2b0f242b23e
MD5 69557b6e6625bf8ee841da33b8f004b1
BLAKE2b-256 976ffbc9e98209fae9f2bd9fffe956f91e0cdf749ea91e5b3653285d893f266f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page