Skip to main content

High-performance Quantization Engine with Lloyd-Max and QJL

Project description

TurboQuant-Pytorch

High-Performance Vector Quantization Engine

A Implementation of the turboquant paper by Pytorch C++

Language Library License: MIT

English Version | 繁體中文版本


English Version

TurboQuant is a specialized, high-performance vector quantization library designed for Large Language Models (LLMs) and vector search applications. residual compensation.

Key Features

  • Turbo-Charged C++ Core: Core operations like rotation, projection, and quantization are implemented in optimized C++ for millisecond-level inference.
  • Lloyd-Max Optimization: Automatically computes the most efficient centroids for Gaussian distributions using Scipy's K-Means.
  • Unbiased Residual Compensation: Uses QJL signs to preserve vector magnitude and direction, minimizing cumulative error in deep networks.
  • Smart Matrix Caching: Automatically caches trained centroids ($\mathcal{C}$) and orthogonal matrices ($\Pi, S$) for instant engine startup.
  • Adaptive Dimension Support: Fully compatible with any dimension $d$ and any bit-rate $b$ (from 1-bit to 8-bit). TurboQuant is a high-performance quantization library designed for Large Language Models (LLMs) and vector search applications. By offloading core computations to C++ and integrating mathematical optimization, TurboQuant significantly reduces memory overhead while maintaining near-lossless precision.

Key Highlights

  • Extreme C++ Acceleration: Core operations like Matrix Rotation, Projection, and quantization logic are deeply optimized using C++/LibTorch to achieve millisecond-level inference.
  • Lloyd-Max Mathematical Optimization: Automatically calculates optimal centroids for Gaussian distributions using Scipy-based K-Means, ensuring high-precision quantization.
  • Unbiased Residual Compensation: Utilizes QJL sign bits to preserve vector direction and magnitude, solving cumulative error issues in deep neural networks.
  • Intelligent Matrix Caching : Automatically caches trained centroids and orthogonal matrices (Pi, S) for "instant-on" engine startup.
  • High Elasticity & Customization: Supports arbitrary dimensions d and dynamic bitrate switching from 1-bit to 8-bit.

Comparison

TurboQuant Performance Benchmark

How to read the benchmark chart:

  • Y-Axis (Fidelity): Higher Cosine Similarity means more accurate vector reconstruction.
  • X-Axis (Latency): Lower values indicate faster C++/LibTorch execution.
  • Bubble Size (Memory): Larger bubbles represent higher memory compression ratios.
    • 1-bit: 32x Compression (Largest Bubble)
    • 2-bit: 16x Compression
    • 4-bit: 8x Compression
    • Int8 (Baselines): 4x Compression (Smallest Bubble)

Installation

# Clone the repository
git clone https://github.com/ericoder960803/TurboQuant.git

cd TurboQuant

# Install in editable mode (Builds C++ extension automatically)
pip install -e .

Usage

TurboQuant provides a seamless PyTorch-like API. You can easily integrate it into your inference pipeline.

Quick Start

import torch
from turboquant import TurboQuantEngine

# 1. Initialize the engine
# d: vector dimension, b: target bit-rate (1, 2, 4, or 8)
d = 1024
b = 2
engine = TurboQuantEngine(d=d, b=b, cache=True)

# 2. Prepare your high-precision vector (FP32)
x = torch.randn(d)

# 3. Encode (Compression)
# idx: Lloyd-Max centroids indices
# qjl: 1-bit residual signs
# gamma: Dynamic scaling factor for reconstruction
idx, qjl, gamma = engine.encode(x)

# 4. Decode (Decompression)
x_hat = engine.decode(idx, qjl, gamma)

# 5. Check Fidelity
similarity = torch.nn.functional.cosine_similarity(x.unsqueeze(0), x_hat.unsqueeze(0))
print(f"Reconstruction Cosine Similarity: {similarity.item():.4f}")

Usage Examples

1. LLM KV-Cache Management (16x Memory Saving)

Ideal for long-context LLM inference (e.g., Llama-3) where KV-cache memory is the primary bottleneck.

import torch
from turboquant import TurboQuantEngine

class TurboQuantKVCache:
    def __init__(self, dim=4096, bits=2):
        # 2-bit quantization compresses 4KB into 0.25KB
        self.engine = TurboQuantEngine(d=dim, b=bits, cache=True)
        self.cache = [] 

    def push(self, key_tensor):
        """Compress and store in cache"""
        packet = self.engine.encode(key_tensor)
        self.cache.append(packet)

    def fetch_all(self):
        """Decompress all vectors for Attention calculation"""
        if not self.cache: return None
        return torch.stack([self.engine.decode(*p) for p in self.cache])
# Usage
kv_manager = TurboQuantKVCache(dim=4096, bits=2)
kv_manager.push(torch.randn(4096)) # Encode new key
keys = kv_manager.fetch_all()      # Restore for Attention [Seq_Len, Dim]

2. High-Speed Vector Search

Enables high-fidelity vector databases or RAG systems with minimal storage footprint.

import torch
from turboquant import TurboQuantEngine

# 1. Setup Database (10,000 vectors)
D, B = 1024, 2
engine = TurboQuantEngine(d=D, b=B)
database = torch.randn(10000, D)

# 2. Offline Compression
compressed_db = [engine.encode(v) for v in database]

# 3. Online Search
query = torch.randn(D)
reconstructed_db = torch.stack([engine.decode(*p) for p in compressed_db])
scores = torch.nn.functional.cosine_similarity(query.unsqueeze(0), reconstructed_db)

# 4. Get Top-K
top_values, top_indices = torch.topk(scores, k=5)
print(f"Top Indices: {top_indices.tolist()}")

Mathematical Foundation

The reconstruction $\hat{x}$ is computed as: $$\hat{x} = \Pi^T ( \mathcal{C}{idx} + \gamma \cdot \sqrt{\frac{\pi}{2d}} \cdot S^T q{jnl} )$$ Where:

  • $\Pi$: Orthogonal Rotation Matrix
  • $\mathcal{C}$: Lloyd-Max Optimal Centroids
  • $S$: QJL Projection Matrix

Citation

If you find TurboQuant-PyTorch useful in your research or project, please cite the original paper:

Original Paper (arXiv:2504.19874)

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}
@misc{ericliam2026turboquant,
  author = {Eric Liam},
  title = {TurboQuant-PyTorch: High-Performance C++/LibTorch Implementation},
  year = {2026},
  publisher = {GitHub},
  howpublished = {\url{[https://github.com/ericoder960803/TurboQuant-PyTorch](https://github.com/ericoder960803/TurboQuant-PyTorch)}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_pytorch-0.1.1.tar.gz (11.4 kB view details)

Uploaded Source

File details

Details for the file turboquant_pytorch-0.1.1.tar.gz.

File metadata

  • Download URL: turboquant_pytorch-0.1.1.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for turboquant_pytorch-0.1.1.tar.gz
Algorithm Hash digest
SHA256 015002e7ff26c0152273624f74e836882cf565122c5b378f98345f22e405e683
MD5 6112acd0ccee0157de2cb83cedf60db8
BLAKE2b-256 6b5c62c5c6714e00c1eb455f3808613c494fa2fc83ce12b14fd4a12236c952bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for turboquant_pytorch-0.1.1.tar.gz:

Publisher: python-publish.yml on ericoder960803/TurboQuant-PyTorch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page