breccia

Cross-framework block-scaled tensor primitive (FP8 / FP4 / MXFP8 / NVFP4 / INT4)

These details have not been verified by PyPI

Project links

Project description

breccia hero banner

A cross-framework block-scaled tensor primitive for low-precision compute (FP8 / FP4 / MXFP8 / NVFP4 / INT4).

📚 Documentation · 📦 PyPI · 💬 Discussions · 🐛 Issues

import numpy as np
import breccia

# Quantize a tensor to FP8 with per-block-K scaling (DeepSeek-v3 style).
x = np.random.randn(8, 256).astype(np.float32)
st = breccia.cast(x, breccia.Float8BlockScaling(block_k=128))

# Scaled matmul: data stays in FP8, scales fold into the FP32 accumulator.
A = breccia.cast(np.random.randn(16, 256).astype(np.float32),
                 breccia.Float8CurrentScaling())
W = breccia.cast(np.random.randn(256, 64).astype(np.float32),
                 breccia.Float8BlockScaling(block_k=128))
y = breccia.matmul(A, W)

Why

Every framework today reinvents block-scaled low-precision in incompatible ways:

NVIDIA TransformerEngine ships four parallel recipe classes (DelayedScaling, Float8CurrentScaling, Float8BlockScaling, MXFP8BlockScaling) — NVIDIA-only.
PyTorch torchao rolls its own AffineQuantizedTensor — PyTorch-only.
DeepSeek-v3 has a private FP8 format. FP8-Flow-MoE (Nov 2025) has another. COAT has another for optimizer-state compression.
Megatron, JAX, TorchTitan each re-derive scale-aware all-gather.
AMD MI355, Trainium2, TPU v6 all have incompatible scale semantics across vendors.

No vendor can be the neutral substrate (NVIDIA can't ship for AMD, AMD can't ship for TPU). The cross-vendor gap is widening through 2026–2027 with FP4. breccia is the "safetensors of low-precision" — one neutral primitive that round-trips with each of them.

Sister library to scree:

scree handles variable-length data (loose fragments).
breccia handles low-precision data bound by its scale (fragments + cement).

What you get

A single core type — ScaledTensor(data, scale, recipe, layout) — plus six recipes, four layouts, five bridges, and reference + Triton kernels.

Six recipes covering 95% of today's fragmentation:

Recipe	Format	Block size	Used by
`DelayedScaling`	FP8 E4M3 / E5M2	per-tensor	TE main recipe
`Float8CurrentScaling`	FP8 E4M3 / E5M2	per-tensor	TE / torchao
`Float8BlockScaling`	FP8 E4M3 / E5M2	128 along K	DeepSeek-v3
`MXFP8BlockScaling`	FP8 + E8M0 scale	32 along K	OCP MX standard
`NVFP4BlockScaling`	FP4 E2M1 + FP8 scale	16 along K	NVIDIA Blackwell
`INT4Scaling`	INT4 ± fp16 scale	configurable	GPTQ / AWQ family

Five bridges for zero-copy interop:

Bridge	Direction	Dep
`from_transformer_engine` / `to_transformer_engine`	TE Float8Tensor ↔ ScaledTensor	`transformer-engine`
`from_torchao` / `to_torchao`	AffineQuantizedTensor ↔ ScaledTensor	`torchao`
`save_safetensors` / `load_safetensors`	safetensors file ↔ dict of ScaledTensor	`safetensors`
`to_dlpack` / `from_dlpack`	zero-copy across NumPy / PyTorch / MLX / JAX	built-in
`from_deepseek_v3` / `to_deepseek_v3`	DeepSeek-v3 buffers ↔ ScaledTensor	none

Memory savings vs FP32, computed at v0.0.1 ((1024, 1024) weight):

Format	Bytes	vs FP32
FP32	4.19 MB	1.00×
FP16	2.10 MB	0.50×
FP8 (Float8CurrentScaling)	1.05 MB	0.25×
FP8 (Float8BlockScaling, b=128)	1.08 MB	0.26×
MXFP8 (block 32, E8M0 scale)	1.08 MB	0.26×
NVFP4 (block 16, E4M3 scale)	1.11 MB	0.27×
INT4 (group 128, fp16 scale)	1.06 MB	0.25×

Reproduce: python benchmarks/bench_memory.py

Accuracy (cosine similarity vs FP32 on Gaussian inputs, mean over 8 seeds):

Recipe	Cos sim
Float8CurrentScaling (E4M3)	0.9997
Float8BlockScaling(block_k=128)	0.9997
MXFP8BlockScaling	0.9974
NVFP4BlockScaling	0.9955
INT4Scaling(group_size=128)	0.9932

Reproduce: python benchmarks/bench_accuracy.py

Status

v0.1.2, beta — production-ready API.

Component	Status
`ScaledTensor` type + invariants	✅
6 ScalingRecipes (incl. asymmetric INT4 with zero-point)	✅
4 Layouts	✅
`cast` / `dequantize` / `matmul` / `requantize`	✅
Bridges: TE / torchao / HF / DLPack / DeepSeek-v3	✅
NumPy + PyTorch + MLX + JAX backends	✅
Native PyTorch FP8 acceleration (`torch.float8_e4m3fn` end-to-end)	✅
Straight-through estimator (`cast_ste`, `cast_ste_clipped`)	✅
Triton FP8 scaled matmul (per-tensor)	✅ H100 validated — 0.8 ms warm, 6× faster than `torch._scaled_mm`
Triton block-scaled FP8 matmul (DeepSeek pattern)	✅ H100 validated (cos sim 0.9813 vs FP32)
Triton AOT path (autotune=False default, fast first call)	✅
TransformerEngine bridge (forward direction, bit-exact)	✅ H100 validated (max abs diff = 0)
TransformerEngine bridge (reverse direction across TE 2.x churn)	🟡 forward is bit-exact; reverse needs per-version constructor pin

250+ tests passing. CI on Python 3.10 / 3.11 / 3.12 (Ubuntu) + 3.11 (macOS).

Install

pip install breccia                     # NumPy backend
pip install "breccia[torch]"            # + PyTorch
pip install "breccia[mlx]"              # + MLX (Apple Silicon)
pip install "breccia[bridges]"          # + safetensors for HF bridge
pip install "breccia[torch,mlx,bridges,dev]"  # full dev setup

Examples

examples/01_quickstart.py — cast + matmul
examples/02_recipe_portable_train.py — train MXFP8, ship NVFP4 (same model code)
examples/03_checkpoint_with_scale.py — save/load safetensors with scale metadata
examples/04_te_migration.py — bridge from TransformerEngine

Documentation

Getting started — install + first program
Concepts — mental model: data + scale + recipe + layout
Recipes — when to use each of the 6 recipes
Formats — bit-level FP8 / FP4 / INT4 / E8M0 layouts
API reference — every public function and class
Bridges & migration — TE / torchao / HF / DLPack / DeepSeek
Kernels — reference and Triton scaled-matmul design
Numerics — accuracy / range trade-offs
Architecture — internals, design decisions
Benchmarks — methodology + reproduction
FAQ

The name

A breccia is a sedimentary rock made of broken angular fragments held together by a cementing matrix. Low-precision data fragments + the scale matrix that gives them meaning — same structure.

It's the natural geological successor to scree: loose fragments (scree) become breccia when cemented together.

Contributing

PRs welcome. See CONTRIBUTING.md for the workflow. Open a GitHub Discussion for anything beyond a small fix.

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

May 24, 2026

This version

0.1.2

May 24, 2026

0.1.1

May 24, 2026

0.1.0

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

breccia-0.1.2.tar.gz (51.0 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

breccia-0.1.2-py3-none-any.whl (40.2 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file breccia-0.1.2.tar.gz.

File metadata

Download URL: breccia-0.1.2.tar.gz
Upload date: May 24, 2026
Size: 51.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for breccia-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`0e67bdc6e7651fc1bb562a022598752faf47f1adaa6fc53d67352a61725c3852`
MD5	`808061c2e5b4906ab5d5de894a7e210b`
BLAKE2b-256	`17b9e55f7ffb75a85b1626f5298dc5863dbca4b02d543336b7cefd2680068f8c`

See more details on using hashes here.

File details

Details for the file breccia-0.1.2-py3-none-any.whl.

File metadata

Download URL: breccia-0.1.2-py3-none-any.whl
Upload date: May 24, 2026
Size: 40.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for breccia-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1c09a923142d4e1a08b8dc90ba248dd0858623574a9da266870d11d530688d2b`
MD5	`60642036cc336bd99fd769dba38b5047`
BLAKE2b-256	`afdfd3688c88a60c556998af663281f15c4da48bfec9a86703e78954d7d640bd`

See more details on using hashes here.

breccia 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A cross-framework block-scaled tensor primitive for low-precision compute (FP8 / FP4 / MXFP8 / NVFP4 / INT4).

Why

What you get

Status

Install

Examples

Documentation

The name

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes