Skip to main content

High-performance Quantization Engine with Lloyd-Max and QJL

Project description

TurboQuant-Pytorch

High-Performance Vector Quantization Engine

A Implementation of the turboquant paper by Pytorch C++

Language Library License: MIT

English Version | 繁體中文版本


English Version

TurboQuant is a specialized, high-performance vector quantization library designed for Large Language Models (LLMs) and vector search applications. residual compensation.

Key Features

  • Turbo-Charged C++ Core: Core operations like rotation, projection, and quantization are implemented in optimized C++ for millisecond-level inference.
  • Lloyd-Max Optimization: Automatically computes the most efficient centroids for Gaussian distributions using Scipy's K-Means.
  • Unbiased Residual Compensation: Uses QJL signs to preserve vector magnitude and direction, minimizing cumulative error in deep networks.
  • Smart Matrix Caching: Automatically caches trained centroids ($\mathcal{C}$) and orthogonal matrices ($\Pi, S$) for instant engine startup.
  • Adaptive Dimension Support: Fully compatible with any dimension $d$ and any bit-rate $b$ (from 1-bit to 8-bit). TurboQuant is a high-performance quantization library designed for Large Language Models (LLMs) and vector search applications. By offloading core computations to C++ and integrating mathematical optimization, TurboQuant significantly reduces memory overhead while maintaining near-lossless precision.

Key Highlights

  • Extreme C++ Acceleration: Core operations like Matrix Rotation, Projection, and quantization logic are deeply optimized using C++/LibTorch to achieve millisecond-level inference.
  • Lloyd-Max Mathematical Optimization: Automatically calculates optimal centroids for Gaussian distributions using Scipy-based K-Means, ensuring high-precision quantization.
  • Unbiased Residual Compensation: Utilizes QJL sign bits to preserve vector direction and magnitude, solving cumulative error issues in deep neural networks.
  • Intelligent Matrix Caching : Automatically caches trained centroids and orthogonal matrices (Pi, S) for "instant-on" engine startup.
  • High Elasticity & Customization: Supports arbitrary dimensions d and dynamic bitrate switching from 1-bit to 8-bit.

Comparison

TurboQuant Performance Benchmark

How to read the benchmark chart:

  • Y-Axis (Fidelity): Higher Cosine Similarity means more accurate vector reconstruction.
  • X-Axis (Latency): Lower values indicate faster C++/LibTorch execution.
  • Bubble Size (Memory): Larger bubbles represent higher memory compression ratios.
    • 1-bit: 32x Compression (Largest Bubble)
    • 2-bit: 16x Compression
    • 4-bit: 8x Compression
    • Int8 (Baselines): 4x Compression (Smallest Bubble)

Installation

# Clone the repository
git clone https://github.com/ericoder960803/TurboQuant.git

cd TurboQuant

# Install in editable mode (Builds C++ extension automatically)
pip install -e .

Usage

TurboQuant provides a seamless PyTorch-like API. You can easily integrate it into your inference pipeline.

Quick Start

import torch
from turboquant import TurboQuantEngine

# 1. Initialize the engine
# d: vector dimension, b: target bit-rate (1, 2, 4, or 8)
d = 1024
b = 2
engine = TurboQuantEngine(d=d, b=b, cache=True)

# 2. Prepare your high-precision vector (FP32)
x = torch.randn(d)

# 3. Encode (Compression)
# idx: Lloyd-Max centroids indices
# qjl: 1-bit residual signs
# gamma: Dynamic scaling factor for reconstruction
idx, qjl, gamma = engine.encode(x)

# 4. Decode (Decompression)
x_hat = engine.decode(idx, qjl, gamma)

# 5. Check Fidelity
similarity = torch.nn.functional.cosine_similarity(x.unsqueeze(0), x_hat.unsqueeze(0))
print(f"Reconstruction Cosine Similarity: {similarity.item():.4f}")

Usage Examples

1. LLM KV-Cache Management (16x Memory Saving)

Ideal for long-context LLM inference (e.g., Llama-3) where KV-cache memory is the primary bottleneck.

import torch
from turboquant import TurboQuantEngine

class TurboQuantKVCache:
    def __init__(self, dim=4096, bits=2):
        # 2-bit quantization compresses 4KB into 0.25KB
        self.engine = TurboQuantEngine(d=dim, b=bits, cache=True)
        self.cache = [] 

    def push(self, key_tensor):
        """Compress and store in cache"""
        packet = self.engine.encode(key_tensor)
        self.cache.append(packet)

    def fetch_all(self):
        """Decompress all vectors for Attention calculation"""
        if not self.cache: return None
        return torch.stack([self.engine.decode(*p) for p in self.cache])
# Usage
kv_manager = TurboQuantKVCache(dim=4096, bits=2)
kv_manager.push(torch.randn(4096)) # Encode new key
keys = kv_manager.fetch_all()      # Restore for Attention [Seq_Len, Dim]

2. High-Speed Vector Search

Enables high-fidelity vector databases or RAG systems with minimal storage footprint.

import torch
from turboquant import TurboQuantEngine

# 1. Setup Database (10,000 vectors)
D, B = 1024, 2
engine = TurboQuantEngine(d=D, b=B)
database = torch.randn(10000, D)

# 2. Offline Compression
compressed_db = [engine.encode(v) for v in database]

# 3. Online Search
query = torch.randn(D)
reconstructed_db = torch.stack([engine.decode(*p) for p in compressed_db])
scores = torch.nn.functional.cosine_similarity(query.unsqueeze(0), reconstructed_db)

# 4. Get Top-K
top_values, top_indices = torch.topk(scores, k=5)
print(f"Top Indices: {top_indices.tolist()}")

Mathematical Foundation

The reconstruction $\hat{x}$ is computed as: $$\hat{x} = \Pi^T ( \mathcal{C}{idx} + \gamma \cdot \sqrt{\frac{\pi}{2d}} \cdot S^T q{jnl} )$$ Where:

  • $\Pi$: Orthogonal Rotation Matrix
  • $\mathcal{C}$: Lloyd-Max Optimal Centroids
  • $S$: QJL Projection Matrix

Citation

If you find TurboQuant-PyTorch useful in your research or project, please cite the original paper:

Original Paper (arXiv:2504.19874)

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}
@misc{ericliam2026turboquant,
  author = {Eric Liam},
  title = {TurboQuant-PyTorch: High-Performance C++/LibTorch Implementation},
  year = {2026},
  publisher = {GitHub},
  howpublished = {\url{[https://github.com/ericoder960803/TurboQuant-PyTorch](https://github.com/ericoder960803/TurboQuant-PyTorch)}}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_pytorch-0.1.2.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboquant_pytorch-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file turboquant_pytorch-0.1.2.tar.gz.

File metadata

  • Download URL: turboquant_pytorch-0.1.2.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for turboquant_pytorch-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8da66aa046dbe6041ad29fd83a5aad783107c3af10ed028d8125e1c2e9c965dd
MD5 2fda98b2ab8a72708cf8aab5728b731a
BLAKE2b-256 5eb9c833764ea3c7aa6f53dea3670d6c8653c1cdc426e5c6c1339748d924c5b0

See more details on using hashes here.

Provenance

The following attestation bundles were made for turboquant_pytorch-0.1.2.tar.gz:

Publisher: python-publish.yml on ericoder960803/TurboQuant-PyTorch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file turboquant_pytorch-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for turboquant_pytorch-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d772d8069b3437c56b7159bbf9662795def0c6098f7796d2e74a9ccf07c50cfd
MD5 ce8ec03dca0ef3540db1a9c7d4d81ec5
BLAKE2b-256 c9c3006b3d430abb56c5cf4074c4cdcc87a56a5db15b590a282ead2fd03bb9f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for turboquant_pytorch-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: python-publish.yml on ericoder960803/TurboQuant-PyTorch

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page