Calibrax: Unified benchmarking framework for the JAX scientific ML ecosystem
Project description
Calibrax
Validated against: scikit-learn and SciPy references for representative regression, classification, distance, and divergence metrics.
Early Development — API is unstable and subject to breaking changes. Pin to specific commits if stability is required.
Calibrax (Calibrate + JAX) is a unified benchmarking and metrics framework for the JAX scientific ML ecosystem. It extracts and consolidates shared benchmarking, profiling, statistical analysis, and evaluation functionality from Datarax, Artifex, and Opifex.
Features
Metrics (111 registered Tier 0 metrics, 17 domains, 4-tier architecture)
Calibrax provides a 4-tier metric system covering the full spectrum of ML evaluation. The current registry contains 111 Tier 0 pure-function metrics; Tier 1-3 APIs, optional plugins, and metric-learning losses are part of the package architecture but are not all registered metric entries today.
| Tier | Name | Pattern | Examples |
|---|---|---|---|
| 0 | Pure Functions | fn(predictions, targets) -> scalar |
MSE, cosine distance, BLEU |
| 1 | Frozen Backbone | update() -> compute() -> reset() |
FID, BERTScore, Inception Score |
| 2 | Learned | nnx.Module with trainable weights |
LPIPS |
| 3 | Metric Learning | Differentiable embedding loss | Contrastive, Triplet, ArcFace |
Functional domains: regression, classification, calibration, segmentation, distance, divergence, information, ranking, statistical, clustering, fairness, image, text, audio, geometric, graph, manifold
Key capabilities:
- MetricRegistry with axiom-based discovery for registered Tier 0 metrics (
list_true_metrics(),list_by_invariance("rotation")) - Geometric distance hierarchy — Euclidean, Riemannian (SPD, Grassmann, Stiefel), pseudo-Riemannian (ultrahyperbolic), Finsler (Randers)
- Graph metrics — spectral distance, resistance distance, Floyd-Warshall shortest paths
- Reference checks — representative Tier 0 metrics are tested against scikit-learn and SciPy references with
1e-6tolerance; see Peer Comparison - Composition —
MetricCollection,WeightedMetric,MetricSuite,ThresholdMetric - Wrappers —
BootstrapMetric(confidence intervals),ClasswiseWrapper,MetricTracker,MinMaxTracker - Metric learning losses — contrastive, triplet margin, NTXent, ArcFace, CosFace, ProxyNCA, ProxyAnchor, with hard/semi-hard negative mining
Benchmarking & Profiling
- Timing — Warm-up aware timing with JIT compilation separation
- Resource monitoring — CPU, memory, GPU memory/clock/power tracking
- Energy & carbon — Energy measurement with carbon footprint estimation
- FLOPS & roofline — XLA-level FLOP counting, roofline performance analysis
- Compilation — XLA compilation profiling and tracing
- Complexity — Algorithmic complexity analysis
- Hardware — Automatic hardware detection and capability reporting
Analysis & Infrastructure
- Statistical analysis — Bootstrap confidence intervals, hypothesis testing, effect sizes, outlier detection
- Regression detection — Direction-aware detection with configurable severity levels
- Comparison & ranking — Cross-configuration comparison, Pareto front analysis, aggregate scoring
- Validation — Convergence analysis and accuracy assessment
- Storage — JSON-per-run file backend with baseline management
- Exporters — W&B and MLflow integration, publication-ready LaTeX/HTML/CSV tables and matplotlib plots
- CI integration — Regression gate with git bisect automation
- Monitoring — Production alerting with configurable thresholds
- CLI —
calibrax ingest|export|check|baseline|trend|summary|profile
Quick Start
import jax.numpy as jnp
from calibrax.metrics import MetricRegistry, calculate_all
from calibrax.metrics.functional.regression import mse, mae, r_squared
predictions = jnp.array([1.1, 2.3, 2.8, 4.2, 4.7])
targets = jnp.array([1.0, 2.0, 3.0, 4.0, 5.0])
# Individual metrics
print(f"MSE: {mse(predictions, targets):.4f}")
print(f"R²: {r_squared(predictions, targets):.4f}")
# Batch computation of all registered metrics
results = calculate_all(predictions, targets, metrics=["mse", "mae", "rmse", "r_squared"])
# Registry discovery
registry = MetricRegistry()
true_metrics = registry.list_true_metrics()
rotation_inv = registry.list_by_invariance("rotation")
Installation
# Basic installation
uv pip install calibrax
# With statistical analysis (scipy)
uv pip install "calibrax[stats]"
# With GPU monitoring
uv pip install "calibrax[gpu]"
# With image quality plugins (FID, Inception Score)
uv pip install "calibrax[image]"
# With text quality plugins (BERTScore)
uv pip install "calibrax[text]"
# With publication export (matplotlib)
uv pip install "calibrax[publication]"
Development Setup
The recommended way to set up a development environment is with the included setup.sh script. It auto-detects your platform (Linux CUDA, macOS Intel, Apple Silicon), creates a virtual environment, installs all dependencies, and generates an activation script.
git clone https://github.com/avitai/calibrax.git
cd calibrax
# Standard setup with automatic GPU detection
./setup.sh
# Activate the environment
source ./activate.sh
setup.sh Options
| Flag | Description |
|---|---|
--cpu-only |
Force CPU-only setup, skip GPU/Metal detection |
--metal |
Enable Metal acceleration on Apple Silicon Macs |
--deep-clean |
Clear JAX cache, pip cache, pytest cache, and other artifacts |
--force |
Force reinstallation even if environment exists |
--verbose, -v |
Show detailed output during setup |
# Examples
./setup.sh --cpu-only # CPU-only development
./setup.sh --metal # Apple Silicon with Metal
./setup.sh --force --verbose # Force reinstall with full output
./setup.sh --deep-clean # Clean everything and start fresh
Manual Setup
If you prefer to set up manually:
git clone https://github.com/avitai/calibrax.git
cd calibrax
uv venv
uv pip install -e ".[dev,test,stats]"
uv run pre-commit install
Architecture
src/calibrax/
├── core/ Data models, protocols, adapters, result container, registry
├── profiling/ Timing, resources, GPU, energy, FLOPS, roofline, compilation,
│ complexity, hardware, tracing, carbon
├── statistics/ Statistical analyzer, significance testing
├── analysis/ Regression, comparison, ranking, scaling, Pareto, changepoint
├── validation/ Convergence, accuracy, validation framework
├── monitoring/ Alerts, production monitoring
├── storage/ JSON store, baselines
├── exporters/ W&B, MLflow, publication-ready output
├── metrics/
│ ├── functional/ 111 Tier 0 pure functions across 17 domains
│ ├── stateful/ Tier 1-2 base classes (FrozenBackboneMetric, LearnedMetric)
│ ├── learning/ Tier 3 metric learning losses and miners
│ ├── plugins/ Optional-dependency metrics (FID, BERTScore, LPIPS)
│ ├── composition.py MetricCollection, WeightedMetric, MetricSuite, ThresholdMetric
│ ├── wrappers.py BootstrapMetric, ClasswiseWrapper, MetricTracker, MinMaxTracker
│ └── _registry.py MetricRegistry singleton with axiom-based discovery
├── ci/ CI regression gate, bisection engine
└── cli/ Command-line interface
Examples
Runnable examples are in examples/metrics/, available as both Python scripts and Jupyter notebooks:
| Example | Level | Topics |
|---|---|---|
| 01_quickstart.py | Beginner | Individual metrics, calculate_all, registry queries |
| 02_regression_deep_dive.py | Beginner | Same-shape regression metrics, outlier sensitivity |
| 03_classification.py | Intermediate | Classification, calibration, segmentation |
| 04_distances.py | Intermediate | Euclidean, hyperbolic, divergences, information theory |
| 05_composition.py | Intermediate | Collections, weighted metrics, quality gates, tracking |
| 06_image_quality.py | Intermediate | PSNR, SSIM, MS-SSIM, BLEU, ROUGE |
| 07_metric_learning.py | Advanced | Contrastive, triplet, NTXent, ArcFace, mining |
| 08_manifold_graph.py | Advanced | SPD, Grassmann, spectral distance, Floyd-Warshall |
Development
# Activate the local environment first
source activate.sh
# Run tests
uv run pytest tests/ -v --cov=calibrax --cov-report=term-missing
# Lint & format
uv run ruff check src/ tests/ --fix
uv run ruff format src/ tests/
# Type check
uv run pyright src/
# All quality checks
uv run pre-commit run --all-files
# Build documentation
uv run mkdocs build --strict --clean
# Convert examples to Jupyter notebooks
uv run python scripts/jupytext_converter.py batch-py-to-nb examples/metrics/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file calibrax-0.1.1.tar.gz.
File metadata
- Download URL: calibrax-0.1.1.tar.gz
- Upload date:
- Size: 145.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b792fb19d2eb87c22fcd342a1dfc562291a57539aa33706cd733d0f7d3f8bd2
|
|
| MD5 |
69422c7fdfe77579002826353031b542
|
|
| BLAKE2b-256 |
9c92977079025210b4c1a1bcfb6304c2b9eb4656f53094775d3d824e24c40f00
|
File details
Details for the file calibrax-0.1.1-py3-none-any.whl.
File metadata
- Download URL: calibrax-0.1.1-py3-none-any.whl
- Upload date:
- Size: 184.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31c954986a2e53f699f299409fdd3ce001f0276375b614ed044509a7ea7a1047
|
|
| MD5 |
1df3c71d718191798977a97607339608
|
|
| BLAKE2b-256 |
b51de00a150c88e6c88701031e4046de1999332cfc1e3c73cd3f82d142a5013c
|