Skip to main content

Near-optimal weight quantization for LLMs using the TurboQuant algorithm

Project description

turboQuant

Go to https://pypi.org/project/turboquant-hf/

Weight quantization for Large Language Models
Adapted from the TurboQuant algorithm by Google Research (ICLR 2026)
Go to https://pypi.org/project/turboquant-hf/ Go to https://pypi.org/project/turboquant-hf/ Go to https://pypi.org/project/turboquant-hf/ Go to https://pypi.org/project/turboquant-hf/

Quantize any HuggingFace model to 2-4 bits per weight with minimal quality loss. No calibration data required.

Features

  • 2/3/4-bit weight quantization using TurboQuant's random-rotation + Lloyd-Max pipeline
  • No calibration data needed -- uses mathematical properties of random rotations, not model-specific tuning
  • Residual quantization -- optional second pass (e.g., 4+4 = 8 bit) for near-lossless compression
  • Wide HuggingFace model support -- works with Llama, Mistral, Qwen, Gemma, Phi, and other models using nn.Linear (CausalLM, Seq2Seq, classification via --model-class)
  • KV cache compression -- runtime cache compression with proper bit-packing for longer contexts
  • CLI included -- quantize models from the command line
  • Pure PyTorch -- no CUDA/Triton dependency required

Installation

# Core (PyTorch only)
pip install turboquant-hf

# With HuggingFace support (recommended)
pip install turboquant-hf[transformers]

# Development
pip install turboquant-hf[dev]

Or install from source:

git clone https://github.com/singhsidhukuldeep/turboQuant.git
cd turboQuant
pip install -e ".[all]"

Quick Start

Python API

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantConfig, quantize_model, save_quantized, load_quantized

# Load a model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B", torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

# Quantize to 4-bit
config = TurboQuantConfig(bit_width=4, group_size=128)
model = quantize_model(model, config)

# Save
save_quantized(model, config, "./qwen-0.5b-tq4", save_tokenizer=True, tokenizer=tokenizer)

# Load later
model = load_quantized("Qwen/Qwen2.5-0.5B", "./qwen-0.5b-tq4")

# Generate
inputs = tokenizer("The meaning of life is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Higher Quality with Residual Quantization

# 4+4 = 8-bit total, near-lossless
config = TurboQuantConfig(bit_width=4, residual_bit_width=4, group_size=128)
model = quantize_model(model, config)

KV Cache Compression

from turboquant import TurboQuantKVCache

cache = TurboQuantKVCache(key_bits=4, value_bits=4, residual_window=128)

# During generation, compress and store:
# cache.update(layer_idx, key_states, value_states)
# keys, values = cache.get(layer_idx)

Command Line

# Quantize a model
turboquant quantize \
    --model Qwen/Qwen2.5-0.5B \
    --output ./quantized \
    --bits 4 \
    --group-size 128

# Estimate compression ratio
turboquant estimate --model Qwen/Qwen2.5-0.5B --bits 4

# Generate text with a quantized model
turboquant generate \
    --model Qwen/Qwen2.5-0.5B \
    --quantized ./quantized \
    --prompt "Hello, world!"

# Inspect a quantized model
turboquant info ./quantized

How It Works

This library adapts the TurboQuant vector quantization algorithm for model weight compression:

  1. Normalize: Extract per-group norms (stored in float32)
  2. Rotate: Apply a random orthogonal transform (Walsh-Hadamard or Haar). After rotation, each coordinate follows a known Beta((d-1)/2, (d-1)/2) distribution regardless of the original weight values
  3. Quantize: Apply a precomputed Lloyd-Max optimal scalar quantizer tailored to this Beta distribution
  4. Pack: Bit-pack the quantization indices for compact storage

The key insight from the TurboQuant paper is that random rotation makes the statistical properties of rotated coordinates predictable and universal, enabling an optimal quantizer without any calibration data.

Scope and Relationship to the Paper

The TurboQuant paper (Zandieh et al., 2025) focuses on KV cache compression and nearest-neighbor search. This library applies the paper's core technique (Algorithm 1: random rotation + Lloyd-Max scalar quantization) to weight quantization, which is a community-driven adaptation not covered in the original paper.

The paper also describes a two-stage approach (Algorithm 2: MSE + 1-bit QJL correction) for unbiased inner product estimation. Community testing has shown that QJL correction hurts in practice for softmax attention and weight reconstruction, so this library uses the MSE-only quantizer (Algorithm 1) and offers an optional multi-bit residual pass instead.

Why No Calibration?

Traditional quantization methods (GPTQ, AWQ) require calibration data to determine optimal quantization parameters per-layer. TurboQuant sidesteps this: after rotation, the coordinate distribution is determined by dimensionality alone. The optimal codebook is precomputed once for each (dimension, bit-width) pair.

Supported Configurations

Compression ratios account for per-group float32 norms and remainder column overhead at group_size=128:

Config Total Bits Approx. Compression Quality
4-bit 4 ~3.7x Good for most tasks
3-bit 3 ~4.8x Acceptable for large models (7B+)
2-bit 2 ~6.6x Aggressive, some quality loss
4+4 residual 8 ~1.9x Near-lossless
4+2 residual 6 ~2.5x Balanced

Citations

Adapted from research by Google:

  • TurboQuant (Zandieh et al., 2025) -- Online vector quantization with near-optimal distortion rate. ICLR 2026.
  • QJL (Zandieh et al., 2024) -- 1-bit quantized JL transform for KV cache quantization with zero overhead. AAAI 2025.
  • PolarQuant (Han et al., 2025) -- Quantizing KV caches with polar transformation. AISTATS 2026.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_hf-0.1.1.tar.gz (43.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboquant_hf-0.1.1-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file turboquant_hf-0.1.1.tar.gz.

File metadata

  • Download URL: turboquant_hf-0.1.1.tar.gz
  • Upload date:
  • Size: 43.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for turboquant_hf-0.1.1.tar.gz
Algorithm Hash digest
SHA256 895bded0f8498105e5872536c16a833b25797295dc96c062955fca75e94a697a
MD5 08e371402def29d4eb7985d0dfb83387
BLAKE2b-256 5224f77b0949c8d4752672219ebb586b9a2db3b4b2b970d837eb598087eb6cbb

See more details on using hashes here.

Provenance

The following attestation bundles were made for turboquant_hf-0.1.1.tar.gz:

Publisher: publish.yml on singhsidhukuldeep/turboQuant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file turboquant_hf-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: turboquant_hf-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for turboquant_hf-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1f4cd238d4a500ce9ff4b66d1ef4fc4dda13256b69ec2c552a7f80c6dd16b39b
MD5 5b168b289cc5aadb2c2f000fa24498e5
BLAKE2b-256 11f6dd949d64ac17e377ef8894ab7cdf5be13980181c0ef407dd10776cbbec61

See more details on using hashes here.

Provenance

The following attestation bundles were made for turboquant_hf-0.1.1-py3-none-any.whl:

Publisher: publish.yml on singhsidhukuldeep/turboQuant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page