PolarQuant: Hadamard-rotated Lloyd-Max quantization for LLM compression. Weights + KV cache + CLI.

These details have not been verified by PyPI

Project links

Project description

PolarEngine for vLLM

Custom quantization plugin for vLLM using PolarQuant -- optimal Gaussian quantization via Walsh-Hadamard rotation + Lloyd-Max centroids.

arXiv preprint: arXiv:2603.7424577

Recommended path: For best quality-per-VRAM, use PolarQuant Q5 + torchao INT4 (43.1 tok/s, 6.5 GB VRAM, PPL 6.56). PolarEngine's custom Triton kernel is available for environments where torchao is not an option.

Results (Qwen3.5-9B, RTX PRO 6000 Blackwell)

Method	tok/s	VRAM	PPL (WikiText-2)	Notes
FP16 baseline	45.7	17.9 GB	6.37	Reference
PolarQuant Q5 + torchao INT4	43.1	6.5 GB	6.56	Recommended
torchao INT4 (absmax)	43.3	6.3 GB	6.68
BnB NF4	34.6	7.7 GB	~6.7
PolarEngine v4 (Triton)	34.2	7.9 GB	6.89	Custom kernel
PolarQuant Q5 dequant FP16	45.9	18.1 GB	6.39	Near-lossless
PolarQuant MLX Q4	19.7	4.8 GB	6.90	Mac mini M4 16 GB

PolarQuant Ablation (Q5, Qwen3.5-9B)

Configuration	PPL	Delta vs FP16
Absmax Q5 (baseline)	6.9030	+0.53
+ Hadamard rotation	6.4010	+0.03
+ Lloyd-Max centroids	6.9139	+0.54
+ Both (PolarQuant Q5)	6.3909	+0.02

Hadamard rotation accounts for 98% of the improvement. The Walsh-Hadamard transform makes weight distributions approximately Gaussian, enabling near-optimal uniform quantization.

How It Works

PolarQuant quantization:

Normalize weight blocks by L2 norm
Rotate via Walsh-Hadamard Transform (makes weights Gaussian -- 98% of quality gain)
Quantize using Lloyd-Max optimal centroids for N(0,1)
Store codes (int8/nibble-packed) + per-block norms (fp16)

Inference keeps weights quantized in GPU VRAM:

Triton kernel does centroid lookup + GEMV in one operation
FWHT applied to input (not weights) -- 25x faster via matmul
FWHT cached across Q/K/V projections (69x total speedup)
INT4 nibble packing for Q3/Q4 layers (36% VRAM savings)

Installation

pip install polarengine-vllm

Or from source:

git clone https://github.com/caiovicentino/polarengine-vllm
cd polarengine-vllm
pip install -e .

Optional CUDA kernels (for CUDA graph support):

pip install -e ".[cuda]"

Quick Start

Option A: PolarQuant Q5 + torchao (Recommended)

from transformers import AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import quantize_, Int4WeightOnlyConfig
import torch

# Load PolarQuant Q5 model (auto-dequantizes to FP16)
model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Qwen3.5-9B-PolarQuant-Q5",
    dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-PolarQuant-Q5")

# Apply torchao INT4 for fast inference (43 tok/s, 6.5 GB VRAM)
quantize_(model, Int4WeightOnlyConfig(group_size=128))

inputs = tokenizer("Hello!", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Option B: PolarEngine Triton Kernel

1. Quantize a model

python -m polarengine_vllm.quantize \
    --model Qwen/Qwen3.5-9B \
    --output ./Qwen3.5-9B-PolarEngine/

2. Serve with vLLM

vllm serve ./Qwen3.5-9B-PolarEngine/ --quantization polarengine

3. Use from Python

from vllm import LLM
model = LLM("./Qwen3.5-9B-PolarEngine/", quantization="polarengine")
output = model.generate("Explain quantum computing:")

Mixed-Bit Assignment

Layer Type	Bits	Rationale
gate/up proj (MLP)	Q3	Tolerant to quantization
down proj (MLP)	Q4	Moderate sensitivity
Q/K/V proj (Attention)	Q5	Higher precision for attention
O proj (Attention)	Q6	Output projection needs quality
Embeddings	Q5	Large, benefits from compression
LM Head	Q6	Critical for token prediction
Norms, biases, router	FP16	Too small to quantize

Architecture

Input x -> Pad -> FWHT(x) via matmul -> Triton GEMV Kernel -> Output
                  ^                        ^
          H128 (cached, 64KB)    codes + norms + centroids
                                 (quantized, in VRAM)

Published Models

Model	Link	Notes
Qwen3.5-9B PolarQuant Q5	HuggingFace	Recommended, 9.1 GB
Qwen3.5-9B PolarQuant MLX 4-bit	HuggingFace	Apple Silicon
Qwen3.5-9B PolarEngine v4	HuggingFace	Triton kernel

See the main EOQ repository for additional models and full documentation.

Citation

@article{vicentino2026polarquant,
    title={PolarQuant: Near-Lossless LLM Quantization via Walsh-Hadamard Rotation
           and Entropy-Optimal Coding},
    author={Vicentino, Caio},
    journal={arXiv preprint arXiv:2603.7424577},
    year={2026}
}

License

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Apr 6, 2026

This version

0.5.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polarquant-0.5.0.tar.gz (109.5 kB view details)

Uploaded Apr 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

polarquant-0.5.0-py3-none-any.whl (116.8 kB view details)

Uploaded Apr 3, 2026 Python 3

File details

Details for the file polarquant-0.5.0.tar.gz.

File metadata

Download URL: polarquant-0.5.0.tar.gz
Upload date: Apr 3, 2026
Size: 109.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for polarquant-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`3e711da20aff4245f764bc14b3d53c18d499b1880166e1f03d37b1a698079b11`
MD5	`db980328eef25b8167e5c85d7fe5ab22`
BLAKE2b-256	`7be259851d56bddc0a83846ef7d5fbdc00a39ae486af849512b9d8176787b339`

See more details on using hashes here.

File details

Details for the file polarquant-0.5.0-py3-none-any.whl.

File metadata

Download URL: polarquant-0.5.0-py3-none-any.whl
Upload date: Apr 3, 2026
Size: 116.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for polarquant-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`115e6b2f79bdc62be0fbe0439815695efe114d5b88902f41ce4c79a99d213bdc`
MD5	`97a5956e78a387a5511ceb4b348b1c9f`
BLAKE2b-256	`de4e7d1e72e64d12e45ca10bf7ad19ef263686a64d4a7eb0c2f55f2dec399cec`

See more details on using hashes here.

polarquant 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PolarEngine for vLLM

Results (Qwen3.5-9B, RTX PRO 6000 Blackwell)

PolarQuant Ablation (Q5, Qwen3.5-9B)

How It Works

Installation

Quick Start

Option A: PolarQuant Q5 + torchao (Recommended)

Option B: PolarEngine Triton Kernel

1. Quantize a model

2. Serve with vLLM

3. Use from Python

Mixed-Bit Assignment

Architecture

Published Models

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes