Skip to main content

PolarQuant: Hadamard-rotated Lloyd-Max quantization for LLM compression. Native HF transformers integration + weights + KV cache + CLI.

Project description

PolarEngine for vLLM

Custom quantization plugin for vLLM using PolarQuant -- optimal Gaussian quantization via Walsh-Hadamard rotation + Lloyd-Max centroids.

arXiv preprint: arXiv:2603.7424577

Recommended path: For best quality-per-VRAM, use PolarQuant Q5 + torchao INT4 (43.1 tok/s, 6.5 GB VRAM, PPL 6.56). PolarEngine's custom Triton kernel is available for environments where torchao is not an option.


Results (Qwen3.5-9B, RTX PRO 6000 Blackwell)

Method tok/s VRAM PPL (WikiText-2) Notes
FP16 baseline 45.7 17.9 GB 6.37 Reference
PolarQuant Q5 + torchao INT4 43.1 6.5 GB 6.56 Recommended
torchao INT4 (absmax) 43.3 6.3 GB 6.68
BnB NF4 34.6 7.7 GB ~6.7
PolarEngine v4 (Triton) 34.2 7.9 GB 6.89 Custom kernel
PolarQuant Q5 dequant FP16 45.9 18.1 GB 6.39 Near-lossless
PolarQuant MLX Q4 19.7 4.8 GB 6.90 Mac mini M4 16 GB

PolarQuant Ablation (Q5, Qwen3.5-9B)

Configuration PPL Delta vs FP16
Absmax Q5 (baseline) 6.9030 +0.53
+ Hadamard rotation 6.4010 +0.03
+ Lloyd-Max centroids 6.9139 +0.54
+ Both (PolarQuant Q5) 6.3909 +0.02

Hadamard rotation accounts for 98% of the improvement. The Walsh-Hadamard transform makes weight distributions approximately Gaussian, enabling near-optimal uniform quantization.


How It Works

PolarQuant quantization:

  1. Normalize weight blocks by L2 norm
  2. Rotate via Walsh-Hadamard Transform (makes weights Gaussian -- 98% of quality gain)
  3. Quantize using Lloyd-Max optimal centroids for N(0,1)
  4. Store codes (int8/nibble-packed) + per-block norms (fp16)

Inference keeps weights quantized in GPU VRAM:

  • Triton kernel does centroid lookup + GEMV in one operation
  • FWHT applied to input (not weights) -- 25x faster via matmul
  • FWHT cached across Q/K/V projections (69x total speedup)
  • INT4 nibble packing for Q3/Q4 layers (36% VRAM savings)

Installation

pip install polarengine-vllm

Or from source:

git clone https://github.com/caiovicentino/polarengine-vllm
cd polarengine-vllm
pip install -e .

Optional CUDA kernels (for CUDA graph support):

pip install -e ".[cuda]"

Quick Start

Option A: PolarQuant Q5 + torchao (Recommended)

from transformers import AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import quantize_, Int4WeightOnlyConfig
import torch

# Load PolarQuant Q5 model (auto-dequantizes to FP16)
model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Qwen3.5-9B-PolarQuant-Q5",
    dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-PolarQuant-Q5")

# Apply torchao INT4 for fast inference (43 tok/s, 6.5 GB VRAM)
quantize_(model, Int4WeightOnlyConfig(group_size=128))

inputs = tokenizer("Hello!", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Option B: PolarEngine Triton Kernel

1. Quantize a model

python -m polarengine_vllm.quantize \
    --model Qwen/Qwen3.5-9B \
    --output ./Qwen3.5-9B-PolarEngine/

2. Serve with vLLM

vllm serve ./Qwen3.5-9B-PolarEngine/ --quantization polarengine

3. Use from Python

from vllm import LLM
model = LLM("./Qwen3.5-9B-PolarEngine/", quantization="polarengine")
output = model.generate("Explain quantum computing:")

Mixed-Bit Assignment

Layer Type Bits Rationale
gate/up proj (MLP) Q3 Tolerant to quantization
down proj (MLP) Q4 Moderate sensitivity
Q/K/V proj (Attention) Q5 Higher precision for attention
O proj (Attention) Q6 Output projection needs quality
Embeddings Q5 Large, benefits from compression
LM Head Q6 Critical for token prediction
Norms, biases, router FP16 Too small to quantize

Architecture

Input x -> Pad -> FWHT(x) via matmul -> Triton GEMV Kernel -> Output
                  ^                        ^
          H128 (cached, 64KB)    codes + norms + centroids
                                 (quantized, in VRAM)

Published Models

Model Link Notes
Qwen3.5-9B PolarQuant Q5 HuggingFace Recommended, 9.1 GB
Qwen3.5-9B PolarQuant MLX 4-bit HuggingFace Apple Silicon
Qwen3.5-9B PolarEngine v4 HuggingFace Triton kernel

See the main EOQ repository for additional models and full documentation.


Citation

@article{vicentino2026polarquant,
    title={PolarQuant: Near-Lossless LLM Quantization via Walsh-Hadamard Rotation
           and Entropy-Optimal Coding},
    author={Vicentino, Caio},
    journal={arXiv preprint arXiv:2603.7424577},
    year={2026}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polarquant-0.6.0.tar.gz (138.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polarquant-0.6.0-py3-none-any.whl (152.6 kB view details)

Uploaded Python 3

File details

Details for the file polarquant-0.6.0.tar.gz.

File metadata

  • Download URL: polarquant-0.6.0.tar.gz
  • Upload date:
  • Size: 138.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for polarquant-0.6.0.tar.gz
Algorithm Hash digest
SHA256 36103c6e72b0fd13025b585af139176ca550cdd3c6bd91c78a80215447e0257a
MD5 c780acce3b38d950d0557afc53cd40c1
BLAKE2b-256 75588db8919b354e5dce0f7ef5d1082ac3ed029634a97e5b1e4ae518acd73e9e

See more details on using hashes here.

File details

Details for the file polarquant-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: polarquant-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 152.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for polarquant-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc44107bc861f686f2669de14530378192a2726eac40901691aedb318d9d4456
MD5 81f644bb70dab9a5888486ee58a3bdf5
BLAKE2b-256 8d1f264984fd505bdeeda8b6923e3e13ea47bfd084475e6f81105e1c49726a52

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page