Optimal block-scaled FP4 quants (NVFP4, MXFP4)
Project description
Qwantize
Optimal quantization methods for block-scaled formats.
Formats
- NVFP4 — FP4 E2M1 with FP8 E4M3 scales (block sizes 16, 32)
- MXFP4 — FP4 E2M1 with UE8M0 (power-of-2) scales (block sizes 16, 32)
Methods
Each format supports multiple scale selection strategies:
| Method | Description |
|---|---|
| Naive | Standard heuristic: s = snap(amax / Q_MAX) |
| SSE-Optimal | Bounded search minimizing sum of squared quantization error |
| Hessian-Optimal | Bounded search minimizing Hessian-weighted error r^T H r using activations |
All methods have both pure-PyTorch (reference) and Triton (GPU-accelerated) implementations.
Install
pip install qwantize
Requires PyTorch (>=2.0) and Triton (>=3.0).
Usage
from qwantize import nvfp4_naive, nvfp4_optimal, nvfp4_dequantize, compute_metrics
# W has shape (..., block_size) where block_size is 16 or 32
W_blocked = W.reshape(M, K // 32, 32)
# Quantize: returns (scales, quants)
scales, quants = nvfp4_optimal(W_blocked, dim=-1)
# Dequantize
W_dq = nvfp4_dequantize(scales, quants, dim=-1)
# Or get dequantized output directly
scales, quants, W_dq = nvfp4_optimal(W_blocked, dim=-1, return_dequant=True)
# Compute metrics: ||Q(W)-W||/||W|| and ||XW_q^T - XW^T||/||XW^T||
metrics = compute_metrics(W, W_dq.reshape(M, K), X)
Triton-accelerated versions:
from qwantize import nvfp4_optimal_triton, nvfp4_optimal_hessian_triton
scales, quants, W_dq = nvfp4_optimal_triton(W_blocked, dim=-1, return_dequant=True)
# Hessian-aware (requires activations X)
scales, quants, W_dq = nvfp4_optimal_hessian_triton(W_blocked, dim=-1, return_dequant=True, X=X)
Benchmarks
Benchmarked on the down_proj weight of the first decoder layer from Qwen3-4B, with activations from WikiText-2 (max_seq_len=512, num_samples=2048).
python bench/full_bench.py
NVFP4 (block size 16)
| Method | Weight Error | Output Error | Triton Speedup |
|---|---|---|---|
| Naive | 10.05% | 6.89% | 1.7x |
| SSE-Optimal | 8.74% | 6.04% | 7.0x |
| H-Optimal | 9.35% | 5.31% | 1.8x |
MXFP4 (block size 16)
| Method | Weight Error | Output Error | Triton Speedup |
|---|---|---|---|
| Naive | 11.77% | 8.48% | 1.7x |
| SSE-Optimal | 11.02% | 7.67% | 33x |
| H-Optimal | 11.10% | 7.62% | — |
Documentation
Full documentation: qwantize.readthedocs.io
Build locally:
pip install -r docs/requirements.txt
cd docs && make html
Contact
- Author: Ayoub Ghriss, research@ayghri.me
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file qwantize-0.1.1.tar.gz.
File metadata
- Download URL: qwantize-0.1.1.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.13.11 Linux/6.12.74-gentoo-x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef1de56020bc042e9c1a4fd25df046870584f3d5ab20e3e71fdfdc7ff035f712
|
|
| MD5 |
729ec6673a7873478188c71f38716ea8
|
|
| BLAKE2b-256 |
b6cbad91163ed7bbfe313c994779b246a2ae7dbfc66ccd63c7016bdfb7fadf6c
|
File details
Details for the file qwantize-0.1.1-py3-none-any.whl.
File metadata
- Download URL: qwantize-0.1.1-py3-none-any.whl
- Upload date:
- Size: 21.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.13.11 Linux/6.12.74-gentoo-x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e9676a5a1322b1fbdc70990598a65d242ce27050c16324cf82c2389adac27c6
|
|
| MD5 |
36c848e556f40aa2a4f7c7100ad430b0
|
|
| BLAKE2b-256 |
7bf9a9af2ca40f3260250ff0c16cc7015d5ef770a6ec0ba14e6d2fd04225fd19
|