High-performance Quantization Engine with Lloyd-Max and QJL
Project description
TurboQuant-Pytorch
High-Performance Vector Quantization Engine
A Implementation of the turboquant paper by Pytorch C++
English Version
TurboQuant is a specialized, high-performance vector quantization library designed for Large Language Models (LLMs) and vector search applications. residual compensation.
Key Features
- Turbo-Charged C++ Core: Core operations like rotation, projection, and quantization are implemented in optimized C++ for millisecond-level inference.
- Lloyd-Max Optimization: Automatically computes the most efficient centroids for Gaussian distributions using Scipy's K-Means.
- Unbiased Residual Compensation: Uses QJL signs to preserve vector magnitude and direction, minimizing cumulative error in deep networks.
- Smart Matrix Caching: Automatically caches trained centroids ($\mathcal{C}$) and orthogonal matrices ($\Pi, S$) for instant engine startup.
- Adaptive Dimension Support: Fully compatible with any dimension $d$ and any bit-rate $b$ (from 1-bit to 8-bit). TurboQuant is a high-performance quantization library designed for Large Language Models (LLMs) and vector search applications. By offloading core computations to C++ and integrating mathematical optimization, TurboQuant significantly reduces memory overhead while maintaining near-lossless precision.
Key Highlights
- Extreme C++ Acceleration: Core operations like Matrix Rotation, Projection, and quantization logic are deeply optimized using C++/LibTorch to achieve millisecond-level inference.
- Lloyd-Max Mathematical Optimization: Automatically calculates optimal centroids for Gaussian distributions using Scipy-based K-Means, ensuring high-precision quantization.
- Unbiased Residual Compensation: Utilizes QJL sign bits to preserve vector direction and magnitude, solving cumulative error issues in deep neural networks.
- Intelligent Matrix Caching : Automatically caches trained centroids and orthogonal matrices (Pi, S) for "instant-on" engine startup.
- High Elasticity & Customization: Supports arbitrary dimensions d and dynamic bitrate switching from 1-bit to 8-bit.
Comparison
How to read the benchmark chart:
- Y-Axis (Fidelity): Higher Cosine Similarity means more accurate vector reconstruction.
- X-Axis (Latency): Lower values indicate faster C++/LibTorch execution.
- Bubble Size (Memory): Larger bubbles represent higher memory compression ratios.
- 1-bit: 32x Compression (Largest Bubble)
- 2-bit: 16x Compression
- 4-bit: 8x Compression
- Int8 (Baselines): 4x Compression (Smallest Bubble)
Installation
# Clone the repository
git clone https://github.com/ericoder960803/TurboQuant.git
cd TurboQuant
# Install in editable mode (Builds C++ extension automatically)
pip install -e .
Usage
TurboQuant provides a seamless PyTorch-like API. You can easily integrate it into your inference pipeline.
Quick Start
import torch
from turboquant import TurboQuantEngine
# 1. Initialize the engine
# d: vector dimension, b: target bit-rate (1, 2, 4, or 8)
d = 1024
b = 2
engine = TurboQuantEngine(d=d, b=b, cache=True)
# 2. Prepare your high-precision vector (FP32)
x = torch.randn(d)
# 3. Encode (Compression)
# idx: Lloyd-Max centroids indices
# qjl: 1-bit residual signs
# gamma: Dynamic scaling factor for reconstruction
idx, qjl, gamma = engine.encode(x)
# 4. Decode (Decompression)
x_hat = engine.decode(idx, qjl, gamma)
# 5. Check Fidelity
similarity = torch.nn.functional.cosine_similarity(x.unsqueeze(0), x_hat.unsqueeze(0))
print(f"Reconstruction Cosine Similarity: {similarity.item():.4f}")
Usage Examples
1. LLM KV-Cache Management (16x Memory Saving)
Ideal for long-context LLM inference (e.g., Llama-3) where KV-cache memory is the primary bottleneck.
import torch
from turboquant import TurboQuantEngine
class TurboQuantKVCache:
def __init__(self, dim=4096, bits=2):
# 2-bit quantization compresses 4KB into 0.25KB
self.engine = TurboQuantEngine(d=dim, b=bits, cache=True)
self.cache = []
def push(self, key_tensor):
"""Compress and store in cache"""
packet = self.engine.encode(key_tensor)
self.cache.append(packet)
def fetch_all(self):
"""Decompress all vectors for Attention calculation"""
if not self.cache: return None
return torch.stack([self.engine.decode(*p) for p in self.cache])
# Usage
kv_manager = TurboQuantKVCache(dim=4096, bits=2)
kv_manager.push(torch.randn(4096)) # Encode new key
keys = kv_manager.fetch_all() # Restore for Attention [Seq_Len, Dim]
2. High-Speed Vector Search
Enables high-fidelity vector databases or RAG systems with minimal storage footprint.
import torch
from turboquant import TurboQuantEngine
# 1. Setup Database (10,000 vectors)
D, B = 1024, 2
engine = TurboQuantEngine(d=D, b=B)
database = torch.randn(10000, D)
# 2. Offline Compression
compressed_db = [engine.encode(v) for v in database]
# 3. Online Search
query = torch.randn(D)
reconstructed_db = torch.stack([engine.decode(*p) for p in compressed_db])
scores = torch.nn.functional.cosine_similarity(query.unsqueeze(0), reconstructed_db)
# 4. Get Top-K
top_values, top_indices = torch.topk(scores, k=5)
print(f"Top Indices: {top_indices.tolist()}")
Mathematical Foundation
The reconstruction $\hat{x}$ is computed as: $$\hat{x} = \Pi^T ( \mathcal{C}{idx} + \gamma \cdot \sqrt{\frac{\pi}{2d}} \cdot S^T q{jnl} )$$ Where:
- $\Pi$: Orthogonal Rotation Matrix
- $\mathcal{C}$: Lloyd-Max Optimal Centroids
- $S$: QJL Projection Matrix
Citation
If you find TurboQuant-PyTorch useful in your research or project, please cite the original paper:
Original Paper (arXiv:2504.19874)
@article{zandieh2025turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
journal={arXiv preprint arXiv:2504.19874},
year={2025}
}
@misc{ericliam2026turboquant,
author = {Eric Liam},
title = {TurboQuant-PyTorch: High-Performance C++/LibTorch Implementation},
year = {2026},
publisher = {GitHub},
howpublished = {\url{[https://github.com/ericoder960803/TurboQuant-PyTorch](https://github.com/ericoder960803/TurboQuant-PyTorch)}}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turboquant_pytorch-0.1.2.tar.gz.
File metadata
- Download URL: turboquant_pytorch-0.1.2.tar.gz
- Upload date:
- Size: 11.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8da66aa046dbe6041ad29fd83a5aad783107c3af10ed028d8125e1c2e9c965dd
|
|
| MD5 |
2fda98b2ab8a72708cf8aab5728b731a
|
|
| BLAKE2b-256 |
5eb9c833764ea3c7aa6f53dea3670d6c8653c1cdc426e5c6c1339748d924c5b0
|
Provenance
The following attestation bundles were made for turboquant_pytorch-0.1.2.tar.gz:
Publisher:
python-publish.yml on ericoder960803/TurboQuant-PyTorch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
turboquant_pytorch-0.1.2.tar.gz -
Subject digest:
8da66aa046dbe6041ad29fd83a5aad783107c3af10ed028d8125e1c2e9c965dd - Sigstore transparency entry: 1239442165
- Sigstore integration time:
-
Permalink:
ericoder960803/TurboQuant-PyTorch@26908b4404d9de54b932c92dde249d18f75d5980 -
Branch / Tag:
refs/tags/v0.1.2.dev - Owner: https://github.com/ericoder960803
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@26908b4404d9de54b932c92dde249d18f75d5980 -
Trigger Event:
push
-
Statement type:
File details
Details for the file turboquant_pytorch-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: turboquant_pytorch-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.7 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d772d8069b3437c56b7159bbf9662795def0c6098f7796d2e74a9ccf07c50cfd
|
|
| MD5 |
ce8ec03dca0ef3540db1a9c7d4d81ec5
|
|
| BLAKE2b-256 |
c9c3006b3d430abb56c5cf4074c4cdcc87a56a5db15b590a282ead2fd03bb9f7
|
Provenance
The following attestation bundles were made for turboquant_pytorch-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
python-publish.yml on ericoder960803/TurboQuant-PyTorch
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
turboquant_pytorch-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
d772d8069b3437c56b7159bbf9662795def0c6098f7796d2e74a9ccf07c50cfd - Sigstore transparency entry: 1239442166
- Sigstore integration time:
-
Permalink:
ericoder960803/TurboQuant-PyTorch@26908b4404d9de54b932c92dde249d18f75d5980 -
Branch / Tag:
refs/tags/v0.1.2.dev - Owner: https://github.com/ericoder960803
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@26908b4404d9de54b932c92dde249d18f75d5980 -
Trigger Event:
push
-
Statement type: