Skip to main content

Cross-framework block-scaled tensor primitive (FP8 / FP4 / MXFP8 / NVFP4 / INT4)

Project description

breccia hero banner

PyPI version Python versions License CI Docs GitHub stars

A cross-framework block-scaled tensor primitive for low-precision compute (FP8 / FP4 / MXFP8 / NVFP4 / INT4).

📚 Documentation · 📦 PyPI · 💬 Discussions · 🐛 Issues

import numpy as np
import breccia

# Quantize a tensor to FP8 with per-block-K scaling (DeepSeek-v3 style).
x = np.random.randn(8, 256).astype(np.float32)
st = breccia.cast(x, breccia.Float8BlockScaling(block_k=128))

# Scaled matmul: data stays in FP8, scales fold into the FP32 accumulator.
A = breccia.cast(np.random.randn(16, 256).astype(np.float32),
                 breccia.Float8CurrentScaling())
W = breccia.cast(np.random.randn(256, 64).astype(np.float32),
                 breccia.Float8BlockScaling(block_k=128))
y = breccia.matmul(A, W)

Why

Every framework today reinvents block-scaled low-precision in incompatible ways:

  • NVIDIA TransformerEngine ships four parallel recipe classes (DelayedScaling, Float8CurrentScaling, Float8BlockScaling, MXFP8BlockScaling) — NVIDIA-only.
  • PyTorch torchao rolls its own AffineQuantizedTensor — PyTorch-only.
  • DeepSeek-v3 has a private FP8 format. FP8-Flow-MoE (Nov 2025) has another. COAT has another for optimizer-state compression.
  • Megatron, JAX, TorchTitan each re-derive scale-aware all-gather.
  • AMD MI355, Trainium2, TPU v6 all have incompatible scale semantics across vendors.

No vendor can be the neutral substrate (NVIDIA can't ship for AMD, AMD can't ship for TPU). The cross-vendor gap is widening through 2026–2027 with FP4. breccia is the "safetensors of low-precision" — one neutral primitive that round-trips with each of them.

Sister library to scree:

  • scree handles variable-length data (loose fragments).
  • breccia handles low-precision data bound by its scale (fragments + cement).

What you get

A single core type — ScaledTensor(data, scale, recipe, layout) — plus six recipes, four layouts, five bridges, and reference + Triton kernels.

Six recipes covering 95% of today's fragmentation:

Recipe Format Block size Used by
DelayedScaling FP8 E4M3 / E5M2 per-tensor TE main recipe
Float8CurrentScaling FP8 E4M3 / E5M2 per-tensor TE / torchao
Float8BlockScaling FP8 E4M3 / E5M2 128 along K DeepSeek-v3
MXFP8BlockScaling FP8 + E8M0 scale 32 along K OCP MX standard
NVFP4BlockScaling FP4 E2M1 + FP8 scale 16 along K NVIDIA Blackwell
INT4Scaling INT4 ± fp16 scale configurable GPTQ / AWQ family

Five bridges for zero-copy interop:

Bridge Direction Dep
from_transformer_engine / to_transformer_engine TE Float8Tensor ↔ ScaledTensor transformer-engine
from_torchao / to_torchao AffineQuantizedTensor ↔ ScaledTensor torchao
save_safetensors / load_safetensors safetensors file ↔ dict of ScaledTensor safetensors
to_dlpack / from_dlpack zero-copy across NumPy / PyTorch / MLX / JAX built-in
from_deepseek_v3 / to_deepseek_v3 DeepSeek-v3 buffers ↔ ScaledTensor none

Memory savings vs FP32, computed at v0.0.1 ((1024, 1024) weight):

Format Bytes vs FP32
FP32 4.19 MB 1.00×
FP16 2.10 MB 0.50×
FP8 (Float8CurrentScaling) 1.05 MB 0.25×
FP8 (Float8BlockScaling, b=128) 1.08 MB 0.26×
MXFP8 (block 32, E8M0 scale) 1.08 MB 0.26×
NVFP4 (block 16, E4M3 scale) 1.11 MB 0.27×
INT4 (group 128, fp16 scale) 1.06 MB 0.25×

Reproduce: python benchmarks/bench_memory.py

Accuracy (cosine similarity vs FP32 on Gaussian inputs, mean over 8 seeds):

Recipe Cos sim
Float8CurrentScaling (E4M3) 0.9997
Float8BlockScaling(block_k=128) 0.9997
MXFP8BlockScaling 0.9974
NVFP4BlockScaling 0.9955
INT4Scaling(group_size=128) 0.9932

Reproduce: python benchmarks/bench_accuracy.py

Status

v0.1.2, beta — production-ready API.

Component Status
ScaledTensor type + invariants
6 ScalingRecipes (incl. asymmetric INT4 with zero-point)
4 Layouts
cast / dequantize / matmul / requantize
Bridges: TE / torchao / HF / DLPack / DeepSeek-v3
NumPy + PyTorch + MLX + JAX backends
Native PyTorch FP8 acceleration (torch.float8_e4m3fn end-to-end)
Straight-through estimator (cast_ste, cast_ste_clipped)
Triton FP8 scaled matmul (per-tensor) ✅ H100 validated — 0.8 ms warm, 6× faster than torch._scaled_mm
Triton block-scaled FP8 matmul (DeepSeek pattern) ✅ H100 validated (cos sim 0.9813 vs FP32)
Triton AOT path (autotune=False default, fast first call)
TransformerEngine bridge (forward direction, bit-exact) ✅ H100 validated (max abs diff = 0)
TransformerEngine bridge (reverse direction across TE 2.x churn) 🟡 forward is bit-exact; reverse needs per-version constructor pin

250+ tests passing. CI on Python 3.10 / 3.11 / 3.12 (Ubuntu) + 3.11 (macOS).

Install

pip install breccia                     # NumPy backend
pip install "breccia[torch]"            # + PyTorch
pip install "breccia[mlx]"              # + MLX (Apple Silicon)
pip install "breccia[bridges]"          # + safetensors for HF bridge
pip install "breccia[torch,mlx,bridges,dev]"  # full dev setup

Examples

Documentation

The name

A breccia is a sedimentary rock made of broken angular fragments held together by a cementing matrix. Low-precision data fragments + the scale matrix that gives them meaning — same structure.

It's the natural geological successor to scree: loose fragments (scree) become breccia when cemented together.

Contributing

PRs welcome. See CONTRIBUTING.md for the workflow. Open a GitHub Discussion for anything beyond a small fix.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

breccia-0.1.2.tar.gz (51.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

breccia-0.1.2-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file breccia-0.1.2.tar.gz.

File metadata

  • Download URL: breccia-0.1.2.tar.gz
  • Upload date:
  • Size: 51.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for breccia-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0e67bdc6e7651fc1bb562a022598752faf47f1adaa6fc53d67352a61725c3852
MD5 808061c2e5b4906ab5d5de894a7e210b
BLAKE2b-256 17b9e55f7ffb75a85b1626f5298dc5863dbca4b02d543336b7cefd2680068f8c

See more details on using hashes here.

File details

Details for the file breccia-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: breccia-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for breccia-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1c09a923142d4e1a08b8dc90ba248dd0858623574a9da266870d11d530688d2b
MD5 60642036cc336bd99fd769dba38b5047
BLAKE2b-256 afdfd3688c88a60c556998af663281f15c4da48bfec9a86703e78954d7d640bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page