Skip to main content

Optimal block-scaled FP4 quants (NVFP4, MXFP4)

Project description

Qwantize

Optimal quantization methods for block-scaled formats.

Formats

  • NVFP4 — FP4 E2M1 with FP8 E4M3 scales (block sizes 16, 32)
  • MXFP4 — FP4 E2M1 with UE8M0 (power-of-2) scales (block sizes 16, 32)

Methods

Each format supports multiple scale selection strategies:

Method Description
Naive Standard heuristic: s = snap(amax / Q_MAX)
SSE-Optimal Bounded search minimizing sum of squared quantization error
Hessian-Optimal Bounded search minimizing Hessian-weighted error r^T H r using activations

All methods have both pure-PyTorch (reference) and Triton (GPU-accelerated) implementations.

Install

pip install qwantize

Requires PyTorch (>=2.0) and Triton (>=3.0).

Usage

from qwantize import nvfp4_naive, nvfp4_optimal, nvfp4_dequantize, compute_metrics

# W has shape (..., block_size) where block_size is 16 or 32
W_blocked = W.reshape(M, K // 32, 32)

# Quantize: returns (scales, quants)
scales, quants = nvfp4_optimal(W_blocked, dim=-1)

# Dequantize
W_dq = nvfp4_dequantize(scales, quants, dim=-1)

# Or get dequantized output directly
scales, quants, W_dq = nvfp4_optimal(W_blocked, dim=-1, return_dequant=True)

# Compute metrics: ||Q(W)-W||/||W|| and ||XW_q^T - XW^T||/||XW^T||
metrics = compute_metrics(W, W_dq.reshape(M, K), X)

Triton-accelerated versions:

from qwantize import nvfp4_optimal_triton, nvfp4_optimal_hessian_triton

scales, quants, W_dq = nvfp4_optimal_triton(W_blocked, dim=-1, return_dequant=True)

# Hessian-aware (requires activations X)
scales, quants, W_dq = nvfp4_optimal_hessian_triton(W_blocked, dim=-1, return_dequant=True, X=X)

Benchmarks

Benchmarked on the down_proj weight of the first decoder layer from Qwen3-4B, with activations from WikiText-2 (max_seq_len=512, num_samples=2048).

python bench/full_bench.py

NVFP4 (block size 16)

Method Weight Error Output Error Triton Speedup
Naive 10.05% 6.89% 1.7x
SSE-Optimal 8.74% 6.04% 7.0x
H-Optimal 9.35% 5.31% 1.8x

MXFP4 (block size 16)

Method Weight Error Output Error Triton Speedup
Naive 11.77% 8.48% 1.7x
SSE-Optimal 11.02% 7.67% 33x
H-Optimal 11.10% 7.62%

Documentation

Full documentation: qwantize.readthedocs.io

Build locally:

pip install -r docs/requirements.txt
cd docs && make html

Contact

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwantize-0.1.1.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwantize-0.1.1-py3-none-any.whl (21.7 kB view details)

Uploaded Python 3

File details

Details for the file qwantize-0.1.1.tar.gz.

File metadata

  • Download URL: qwantize-0.1.1.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.11 Linux/6.12.74-gentoo-x86_64

File hashes

Hashes for qwantize-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ef1de56020bc042e9c1a4fd25df046870584f3d5ab20e3e71fdfdc7ff035f712
MD5 729ec6673a7873478188c71f38716ea8
BLAKE2b-256 b6cbad91163ed7bbfe313c994779b246a2ae7dbfc66ccd63c7016bdfb7fadf6c

See more details on using hashes here.

File details

Details for the file qwantize-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: qwantize-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 21.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.11 Linux/6.12.74-gentoo-x86_64

File hashes

Hashes for qwantize-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1e9676a5a1322b1fbdc70990598a65d242ce27050c16324cf82c2389adac27c6
MD5 36c848e556f40aa2a4f7c7100ad430b0
BLAKE2b-256 7bf9a9af2ca40f3260250ff0c16cc7015d5ef770a6ec0ba14e6d2fd04225fd19

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page