Cross-framework block-scaled tensor primitive (FP8 / FP4 / MXFP8 / NVFP4 / INT4)
Project description
A cross-framework block-scaled tensor primitive for low-precision compute (FP8 / FP4 / MXFP8 / NVFP4 / INT4).
📚 Documentation · 📦 PyPI · 💬 Discussions · 🐛 Issues
import numpy as np
import breccia
# Quantize a tensor to FP8 with per-block-K scaling (DeepSeek-v3 style).
x = np.random.randn(8, 256).astype(np.float32)
st = breccia.cast(x, breccia.Float8BlockScaling(block_k=128))
# Scaled matmul: data stays in FP8, scales fold into the FP32 accumulator.
A = breccia.cast(np.random.randn(16, 256).astype(np.float32),
breccia.Float8CurrentScaling())
W = breccia.cast(np.random.randn(256, 64).astype(np.float32),
breccia.Float8BlockScaling(block_k=128))
y = breccia.matmul(A, W)
Why
Every framework today reinvents block-scaled low-precision in incompatible ways:
- NVIDIA TransformerEngine ships four parallel recipe classes
(
DelayedScaling,Float8CurrentScaling,Float8BlockScaling,MXFP8BlockScaling) — NVIDIA-only. - PyTorch torchao rolls its own
AffineQuantizedTensor— PyTorch-only. - DeepSeek-v3 has a private FP8 format. FP8-Flow-MoE (Nov 2025) has another. COAT has another for optimizer-state compression.
- Megatron, JAX, TorchTitan each re-derive scale-aware all-gather.
- AMD MI355, Trainium2, TPU v6 all have incompatible scale semantics across vendors.
No vendor can be the neutral substrate (NVIDIA can't ship for AMD, AMD can't ship for TPU). The cross-vendor gap is widening through 2026–2027 with FP4. breccia is the "safetensors of low-precision" — one neutral primitive that round-trips with each of them.
Sister library to scree:
- scree handles variable-length data (loose fragments).
- breccia handles low-precision data bound by its scale (fragments + cement).
What you get
A single core type — ScaledTensor(data, scale, recipe, layout) — plus
six recipes, four layouts, five bridges, and reference + Triton kernels.
Six recipes covering 95% of today's fragmentation:
| Recipe | Format | Block size | Used by |
|---|---|---|---|
DelayedScaling |
FP8 E4M3 / E5M2 | per-tensor | TE main recipe |
Float8CurrentScaling |
FP8 E4M3 / E5M2 | per-tensor | TE / torchao |
Float8BlockScaling |
FP8 E4M3 / E5M2 | 128 along K | DeepSeek-v3 |
MXFP8BlockScaling |
FP8 + E8M0 scale | 32 along K | OCP MX standard |
NVFP4BlockScaling |
FP4 E2M1 + FP8 scale | 16 along K | NVIDIA Blackwell |
INT4Scaling |
INT4 ± fp16 scale | configurable | GPTQ / AWQ family |
Five bridges for zero-copy interop:
| Bridge | Direction | Dep |
|---|---|---|
from_transformer_engine / to_transformer_engine |
TE Float8Tensor ↔ ScaledTensor | transformer-engine |
from_torchao / to_torchao |
AffineQuantizedTensor ↔ ScaledTensor | torchao |
save_safetensors / load_safetensors |
safetensors file ↔ dict of ScaledTensor | safetensors |
to_dlpack / from_dlpack |
zero-copy across NumPy / PyTorch / MLX / JAX | built-in |
from_deepseek_v3 / to_deepseek_v3 |
DeepSeek-v3 buffers ↔ ScaledTensor | none |
Memory savings vs FP32, computed at v0.0.1 ((1024, 1024) weight):
| Format | Bytes | vs FP32 |
|---|---|---|
| FP32 | 4.19 MB | 1.00× |
| FP16 | 2.10 MB | 0.50× |
| FP8 (Float8CurrentScaling) | 1.05 MB | 0.25× |
| FP8 (Float8BlockScaling, b=128) | 1.08 MB | 0.26× |
| MXFP8 (block 32, E8M0 scale) | 1.08 MB | 0.26× |
| NVFP4 (block 16, E4M3 scale) | 1.11 MB | 0.27× |
| INT4 (group 128, fp16 scale) | 1.06 MB | 0.25× |
Reproduce: python benchmarks/bench_memory.py
Accuracy (cosine similarity vs FP32 on Gaussian inputs, mean over 8 seeds):
| Recipe | Cos sim |
|---|---|
| Float8CurrentScaling (E4M3) | 0.9997 |
| Float8BlockScaling(block_k=128) | 0.9997 |
| MXFP8BlockScaling | 0.9974 |
| NVFP4BlockScaling | 0.9955 |
| INT4Scaling(group_size=128) | 0.9932 |
Reproduce: python benchmarks/bench_accuracy.py
Status
v0.1.3, beta — production-ready API, every v0.1 ✅.
| Component | Status |
|---|---|
ScaledTensor type + invariants |
✅ |
| 6 ScalingRecipes (incl. asymmetric INT4 with zero-point) | ✅ |
| 4 Layouts | ✅ |
cast / dequantize / matmul / requantize |
✅ |
| Bridges: TE / torchao / HF / DLPack / DeepSeek-v3 | ✅ |
| NumPy + PyTorch + MLX + JAX backends | ✅ |
Native PyTorch FP8 acceleration (torch.float8_e4m3fn end-to-end) |
✅ |
Straight-through estimator (cast_ste, cast_ste_clipped) |
✅ |
| Triton FP8 scaled matmul (per-tensor) | ✅ H100 validated — 0.8 ms warm, 6× faster than torch._scaled_mm |
| Triton block-scaled FP8 matmul (DeepSeek pattern) | ✅ H100 validated (cos sim 0.9813 vs FP32) |
| Triton AOT path (autotune=False default, fast first call) | ✅ |
| TransformerEngine bridge (forward + reverse, bit-exact) | ✅ H100 validated (max abs diff = 0 in both directions) |
250+ tests passing. CI on Python 3.10 / 3.11 / 3.12 (Ubuntu) + 3.11 (macOS).
Install
pip install breccia # NumPy backend
pip install "breccia[torch]" # + PyTorch
pip install "breccia[mlx]" # + MLX (Apple Silicon)
pip install "breccia[bridges]" # + safetensors for HF bridge
pip install "breccia[torch,mlx,bridges,dev]" # full dev setup
Examples
examples/01_quickstart.py— cast + matmulexamples/02_recipe_portable_train.py— train MXFP8, ship NVFP4 (same model code)examples/03_checkpoint_with_scale.py— save/load safetensors with scale metadataexamples/04_te_migration.py— bridge from TransformerEngine
Documentation
- Getting started — install + first program
- Concepts — mental model: data + scale + recipe + layout
- Recipes — when to use each of the 6 recipes
- Formats — bit-level FP8 / FP4 / INT4 / E8M0 layouts
- API reference — every public function and class
- Bridges & migration — TE / torchao / HF / DLPack / DeepSeek
- Kernels — reference and Triton scaled-matmul design
- Numerics — accuracy / range trade-offs
- Architecture — internals, design decisions
- Benchmarks — methodology + reproduction
- FAQ
The name
A breccia is a sedimentary rock made of broken angular fragments held together by a cementing matrix. Low-precision data fragments + the scale matrix that gives them meaning — same structure.
It's the natural geological successor to
scree: loose fragments (scree)
become breccia when cemented together.
Contributing
PRs welcome. See CONTRIBUTING.md for the workflow. Open a GitHub Discussion for anything beyond a small fix.
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file breccia-0.1.3.tar.gz.
File metadata
- Download URL: breccia-0.1.3.tar.gz
- Upload date:
- Size: 51.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df1da8cef4b5290405dac1ca2ea42766b8ead8281538041eb136490cd4926399
|
|
| MD5 |
8061f5f0664356e871c6c4a55f62f039
|
|
| BLAKE2b-256 |
42c224d4c7c42f64468d16d5f0d5f8fcc60df3e591545911b5ad7b09f3abe0ff
|
File details
Details for the file breccia-0.1.3-py3-none-any.whl.
File metadata
- Download URL: breccia-0.1.3-py3-none-any.whl
- Upload date:
- Size: 40.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e20792e7a54592877eb35a8d1227469cfb6e827caa530f7e948d3a1ae85eca8
|
|
| MD5 |
ddaf63536e25060b63286fd704a39388
|
|
| BLAKE2b-256 |
312aaef4ef7e378127415f2371ed94113b21b284b56ed6de48d54e07b6b27bce
|