Near-optimal weight quantization for LLMs using the TurboQuant algorithm

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

singhsidhukuldeep

These details have not been verified by PyPI

Project description

turboQuant

Weight quantization for Large Language Models, adapted from the TurboQuant algorithm by Google Research (ICLR 2026).

Quantize any HuggingFace model to 2-4 bits per weight with minimal quality loss. No calibration data required.

Features

2/3/4-bit weight quantization using TurboQuant's random-rotation + Lloyd-Max pipeline
No calibration data needed -- uses mathematical properties of random rotations, not model-specific tuning
Residual quantization -- optional second pass (e.g., 4+4 = 8 bit) for near-lossless compression
Wide HuggingFace model support -- works with Llama, Mistral, Qwen, Gemma, Phi, and other models using nn.Linear (CausalLM, Seq2Seq, classification via --model-class)
KV cache compression -- runtime cache compression with proper bit-packing for longer contexts
CLI included -- quantize models from the command line
Pure PyTorch -- no CUDA/Triton dependency required

Installation

# Core (PyTorch only)
pip install turboquant-hf

# With HuggingFace support (recommended)
pip install turboquant-hf[transformers]

# Development
pip install turboquant-hf[dev]

Or install from source:

git clone https://github.com/singhsidhukuldeep/turboQuant.git
cd turboQuant
pip install -e ".[all]"

Quick Start

Python API

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantConfig, quantize_model, save_quantized, load_quantized

# Load a model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B", torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

# Quantize to 4-bit
config = TurboQuantConfig(bit_width=4, group_size=128)
model = quantize_model(model, config)

# Save
save_quantized(model, config, "./qwen-0.5b-tq4", save_tokenizer=True, tokenizer=tokenizer)

# Load later
model = load_quantized("Qwen/Qwen2.5-0.5B", "./qwen-0.5b-tq4")

# Generate
inputs = tokenizer("The meaning of life is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Higher Quality with Residual Quantization

# 4+4 = 8-bit total, near-lossless
config = TurboQuantConfig(bit_width=4, residual_bit_width=4, group_size=128)
model = quantize_model(model, config)

KV Cache Compression

from turboquant import TurboQuantKVCache

cache = TurboQuantKVCache(key_bits=4, value_bits=4, residual_window=128)

# During generation, compress and store:
# cache.update(layer_idx, key_states, value_states)
# keys, values = cache.get(layer_idx)

Command Line

# Quantize a model
turboquant quantize \
    --model Qwen/Qwen2.5-0.5B \
    --output ./quantized \
    --bits 4 \
    --group-size 128

# Estimate compression ratio
turboquant estimate --model Qwen/Qwen2.5-0.5B --bits 4

# Generate text with a quantized model
turboquant generate \
    --model Qwen/Qwen2.5-0.5B \
    --quantized ./quantized \
    --prompt "Hello, world!"

# Inspect a quantized model
turboquant info ./quantized

How It Works

This library adapts the TurboQuant vector quantization algorithm for model weight compression:

Normalize: Extract per-group norms (stored in float32)
Rotate: Apply a random orthogonal transform (Walsh-Hadamard or Haar). After rotation, each coordinate follows a known Beta((d-1)/2, (d-1)/2) distribution regardless of the original weight values
Quantize: Apply a precomputed Lloyd-Max optimal scalar quantizer tailored to this Beta distribution
Pack: Bit-pack the quantization indices for compact storage

The key insight from the TurboQuant paper is that random rotation makes the statistical properties of rotated coordinates predictable and universal, enabling an optimal quantizer without any calibration data.

Scope and Relationship to the Paper

The TurboQuant paper (Zandieh et al., 2025) focuses on KV cache compression and nearest-neighbor search. This library applies the paper's core technique (Algorithm 1: random rotation + Lloyd-Max scalar quantization) to weight quantization, which is a community-driven adaptation not covered in the original paper.

The paper also describes a two-stage approach (Algorithm 2: MSE + 1-bit QJL correction) for unbiased inner product estimation. Community testing has shown that QJL correction hurts in practice for softmax attention and weight reconstruction, so this library uses the MSE-only quantizer (Algorithm 1) and offers an optional multi-bit residual pass instead.

Why No Calibration?

Traditional quantization methods (GPTQ, AWQ) require calibration data to determine optimal quantization parameters per-layer. TurboQuant sidesteps this: after rotation, the coordinate distribution is determined by dimensionality alone. The optimal codebook is precomputed once for each (dimension, bit-width) pair.

Supported Configurations

Compression ratios account for per-group float32 norms and remainder column overhead at group_size=128:

Config	Total Bits	Approx. Compression	Quality
4-bit	4	~3.7x	Good for most tasks
3-bit	3	~4.8x	Acceptable for large models (7B+)
2-bit	2	~6.6x	Aggressive, some quality loss
4+4 residual	8	~1.9x	Near-lossless
4+2 residual	6	~2.5x	Balanced

Citations

Adapted from research by Google:

TurboQuant (Zandieh et al., 2025) -- Online vector quantization with near-optimal distortion rate. ICLR 2026.
QJL (Zandieh et al., 2024) -- 1-bit quantized JL transform for KV cache quantization with zero overhead. AAAI 2025.
PolarQuant (Han et al., 2025) -- Quantizing KV caches with polar transformation. AISTATS 2026.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

singhsidhukuldeep

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Apr 1, 2026

This version

0.1.0

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_hf-0.1.0.tar.gz (41.7 kB view details)

Uploaded Mar 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turboquant_hf-0.1.0-py3-none-any.whl (36.9 kB view details)

Uploaded Mar 31, 2026 Python 3

File details

Details for the file turboquant_hf-0.1.0.tar.gz.

File metadata

Download URL: turboquant_hf-0.1.0.tar.gz
Upload date: Mar 31, 2026
Size: 41.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for turboquant_hf-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3ccda181ce2d75ffbddde254f5b953dd62bc627c5a659c3c61eaeb4b02d0ce86`
MD5	`15948f257c18c44ada8464f2eb7e4d5e`
BLAKE2b-256	`a5f1ea60a28554dfc205c4713ff11b330967d9943f7feff4ec83f37b8263cc60`

See more details on using hashes here.

Provenance

The following attestation bundles were made for turboquant_hf-0.1.0.tar.gz:

Publisher: publish.yml on singhsidhukuldeep/turboQuant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: turboquant_hf-0.1.0.tar.gz
- Subject digest: 3ccda181ce2d75ffbddde254f5b953dd62bc627c5a659c3c61eaeb4b02d0ce86
- Sigstore transparency entry: 1203560404
- Sigstore integration time: Mar 31, 2026
Source repository:
- Permalink: singhsidhukuldeep/turboQuant@af73ba3b4af2f999a8f92a6be651377f0fca80b3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/singhsidhukuldeep
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@af73ba3b4af2f999a8f92a6be651377f0fca80b3
- Trigger Event: release

File details

Details for the file turboquant_hf-0.1.0-py3-none-any.whl.

File metadata

Download URL: turboquant_hf-0.1.0-py3-none-any.whl
Upload date: Mar 31, 2026
Size: 36.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for turboquant_hf-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33528ec41644623b2f8f5dc2c46c84200b7e11d05c8d2ba9ef4a4d92e8561def`
MD5	`5f307c0fab662d0b3c5e65a316d15c66`
BLAKE2b-256	`ea68403d26a60a35591434aceeca3ca68333050b7972f0e96880569d4a450d08`

See more details on using hashes here.

Provenance

The following attestation bundles were made for turboquant_hf-0.1.0-py3-none-any.whl:

Publisher: publish.yml on singhsidhukuldeep/turboQuant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: turboquant_hf-0.1.0-py3-none-any.whl
- Subject digest: 33528ec41644623b2f8f5dc2c46c84200b7e11d05c8d2ba9ef4a4d92e8561def
- Sigstore transparency entry: 1203560409
- Sigstore integration time: Mar 31, 2026
Source repository:
- Permalink: singhsidhukuldeep/turboQuant@af73ba3b4af2f999a8f92a6be651377f0fca80b3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/singhsidhukuldeep
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@af73ba3b4af2f999a8f92a6be651377f0fca80b3
- Trigger Event: release

turboquant-hf 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

turboQuant

Features

Installation

Quick Start

Python API

Higher Quality with Residual Quantization

KV Cache Compression

Command Line

How It Works

Scope and Relationship to the Paper

Why No Calibration?

Supported Configurations

Citations

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance