PCA-Matryoshka + TurboQuant compression for embeddings, LLM KV caches, pgvector, and NATS — up to 27x compression

These details have not been verified by PyPI

Project links

Project description

TurboQuant Pro

PCA-Matryoshka dimension reduction + TurboQuant scalar quantization for embedding compression, LLM KV caches, pgvector, and NATS transport.

Up to 27x compression with 0.979 cosine similarity. Works on consumer GPUs (Volta+) and CPU.

What's New in v0.3.0

PCA-Matryoshka (PCAMatryoshka): Training-free PCA rotation enables effective dimension truncation for non-Matryoshka embedding models (Bond, IEEE TAI 2026).
Combined pipeline (PCAMatryoshkaPipeline): PCA reduction + TurboQuant quantization in one call -- 27x compression on BGE-M3.
Incremental PCA (partial_fit): Update the PCA basis as new data arrives without storing the full dataset.
Serialization: Save/load fitted PCA models to .npz files.

Installation

pip install turboquant-pro

# With GPU support (CUDA 12.x)
pip install turboquant-pro[gpu]

# With pgvector support (PostgreSQL)
pip install turboquant-pro[pgvector]

# With NATS transport support
pip install turboquant-pro[nats]

# Everything
pip install turboquant-pro[all]

Quick Start

import numpy as np
from turboquant_pro import TurboQuantKV

tq = TurboQuantKV(head_dim=256, n_heads=16, bits=3, use_gpu=False)
compressed = tq.compress(kv_tensor, packed=True)   # 5.1x smaller
reconstructed = tq.decompress(compressed)           # cos_sim > 0.978

PCA-Matryoshka Compression (NEW in v0.3)

PCA-Matryoshka applies a PCA rotation to any non-Matryoshka embedding model's output, reordering dimensions by explained variance so that truncation becomes effective without retraining. Combined with TurboQuant quantization, this achieves up to 27x compression.

from turboquant_pro import PCAMatryoshka

# Fit PCA on a sample of embeddings (5-10K vectors is sufficient)
pca = PCAMatryoshka(input_dim=1024, output_dim=384)
result = pca.fit(sample_embeddings)
print(f"Variance explained: {result.total_variance_explained:.1%}")

# Create the full pipeline: PCA-384 + TurboQuant 3-bit
pipeline = pca.with_quantizer(bits=3)  # ~27x compression
print(pipeline)  # PCAMatryoshkaPipeline(1024 -> PCA-384 -> TQ3-bit, ~27.7x)

# Compress/decompress
compressed = pipeline.compress(embedding)      # 4096 bytes -> ~148 bytes
reconstructed = pipeline.decompress(compressed)  # cosine ~0.979

# Batch operations
compressed_batch = pipeline.compress_batch(embeddings_2d)
reconstructed_batch = pipeline.decompress_batch(compressed_batch)

# Measure quality
mean_cos, min_cos, std_cos = pipeline.batch_cosine_similarity(test_set)

# Save/load the fitted PCA model
pca.save("pca_bge_m3_384.npz")
pca_loaded = PCAMatryoshka.load("pca_bge_m3_384.npz")

# Incremental PCA updates (no need to store full dataset)
pca.partial_fit(new_embeddings)

PCA-Matryoshka + TurboQuant compression on BGE-M3 (1024-dim):

Configuration	Ratio	Cosine Sim	Recall@10
PCA-512 + TQ3	20.9x	0.984	78.0%
PCA-384 + TQ3	27.7x	0.979	76.4%
PCA-256 + TQ3	41.0x	0.963	78.2%
PCA-128 + TQ3	78.8x	0.923	73.0%

Storage savings with PCA-Matryoshka (PCA-384 + TQ3, BGE-M3):

Dataset	Vectors	Float32	PCA+TQ3	Ratio	Saved
RAG chunks	112K	437 MB	16 MB	27.3x	421 MB
Ethics chunks	2.4M	9,375 MB	343 MB	27.3x	9,032 MB

How It Works

TurboQuant Pro implements the PolarQuant + QJL algorithm from Zandieh et al. (ICLR 2026) for compressing the key-value cache in transformer inference:

                    KV Tensor (B, H, S, D)
                           |
                    [L2 Norm Extract]
                           |
                    [Unit Normalize]
                           |
                   [Random Rotation Pi]        <-- QR of Gaussian matrix
                           |
                [Lloyd-Max Scalar Quantize]    <-- b-bit per coordinate
                           |
                     [Bit-Pack Indices]        <-- 8x3-bit = 3 bytes
                           |
              CompressedKV {indices, norms, bits}
                           |
                     [Unpack + Lookup]
                           |
                   [Inverse Rotation]
                           |
                    [Scale by Norms]
                           |
                Reconstructed KV Tensor

Key idea: A random orthogonal rotation maps head-dimension vectors onto the unit hypersphere, making coordinates approximately i.i.d. Gaussian. This enables efficient scalar quantization with precomputed Lloyd-Max codebooks.

Benchmark Results

Compression quality and ratios on random Gaussian KV tensors (head_dim=256, n_heads=16, fp16 baseline):

Bits	Compression Ratio	Cosine Similarity	MSE
2	7.5x	0.926	0.001178
3	5.1x	0.978	0.000349
4	3.9x	0.995	0.000082

Memory estimates for popular models at 8K context (3-bit, packed):

Model	Original	Compressed	Saved	Ratio
Llama 3.1 8B	0.500 GB	0.098 GB	0.402 GB	5.1x
Llama 3.1 70B	1.250 GB	0.244 GB	1.006 GB	5.1x
Gemma 4 27B	1.125 GB	0.220 GB	0.905 GB	5.1x
Mistral 7B	2.000 GB	0.391 GB	1.609 GB	5.1x

Streaming Cache

TurboQuant Pro includes a streaming tiered cache for autoregressive generation:

L1 (hot window): Recent tokens stored uncompressed for zero-latency attention
L2 (cold storage): Older tokens bit-packed at b-bit precision (~5x compression)

from turboquant_pro import TurboQuantKVCache

cache = TurboQuantKVCache(head_dim=256, n_heads=16, bits=3, hot_window=512)

for token in tokens:
    k, v = model.forward_one(token)
    cache.append(k, v)                          # auto-compresses old entries
    keys = cache.get_keys(0, cache.length)       # seamless hot+cold retrieval
    values = cache.get_values(0, cache.length)

pgvector Embedding Compression

TurboQuant Pro can compress high-dimensional embeddings stored in PostgreSQL pgvector, reducing storage by 10x (from float32) or 5x (from float16):

from turboquant_pro import TurboQuantPGVector

tq = TurboQuantPGVector(dim=1024, bits=3, seed=42)

# Compress a single embedding (4096 bytes -> 388 bytes)
compressed = tq.compress_embedding(embedding_float32)

# Store as bytea in PostgreSQL
bytea_data = compressed.to_pgbytea()

# Batch compress for bulk operations
compressed_batch = tq.compress_batch(embeddings_array)

# Search compressed embeddings
scores = tq.compressed_cosine_similarity(query, compressed_batch)

# PostgreSQL integration
tq.create_compressed_table(conn, "embeddings_compressed")
tq.insert_compressed(conn, "embeddings_compressed", ids, embeddings)
results = tq.search_compressed(conn, "embeddings_compressed", query, top_k=10)

Storage savings for real workloads (1024-dim BGE-M3, 3-bit):

Dataset	Vectors	Float32	Compressed	Ratio	Saved
RAG chunks	112K	437 MB	41 MB	10.5x	396 MB
Ethics chunks	2.4M	9,375 MB	893 MB	10.5x	8,482 MB
Publications	824K	3,222 MB	307 MB	10.5x	2,915 MB

NATS Transport Codec

Compress embeddings for transmission over NATS JetStream or any message bus:

from turboquant_pro import TurboQuantNATSCodec

codec = TurboQuantNATSCodec(dim=1024, bits=3, seed=42)

# Encode for transport (4096 bytes -> 392 bytes)
payload = codec.encode(embedding_float32)

# Decode on the receiving end
embedding_approx = codec.decode(payload)

# Batch operations
payloads = codec.encode_batch(embeddings_2d)
embeddings = codec.decode_batch(payloads)

# Check compression stats
print(codec.stats())
# {'dim': 1024, 'bits': 3, 'payload_bytes': 392,
#  'float32_bytes': 4096, 'compression_ratio': 10.45, ...}

Components

Class	Purpose
`PCAMatryoshka`	PCA rotation + truncation for dimension reduction
`PCAMatryoshkaPipeline`	Combined PCA + TurboQuant end-to-end pipeline
`TurboQuantKV`	Stateless compress/decompress with optional bit-packing
`TurboQuantKVCache`	Streaming L1/L2 tiered cache for autoregressive inference
`CompressedKV`	Container dataclass for compressed tensors
`TurboQuantPGVector`	Compress pgvector embeddings for PostgreSQL storage
`CompressedEmbedding`	Container for a single compressed embedding
`PCACompressedEmbedding`	Container for PCA-reduced + quantized embedding
`TurboQuantNATSCodec`	Encode/decode embeddings for NATS transport

Integration Options

llama.cpp / llama-cpp-python

See examples/llama_integration.py for a wrapper pattern that intercepts KV tensors and stores them in a TurboQuantKVCache.

vLLM

TurboQuant Pro can be integrated into vLLM's PagedAttention by compressing cold KV pages:

# Conceptual: compress a page of KV cache
tq = TurboQuantKV(head_dim=128, n_heads=8, bits=3)
compressed_page = tq.compress(kv_page, packed=True)
# Store compressed_page instead of raw fp16

HuggingFace Transformers

Wrap the KV cache in generate() by subclassing the model's attention:

# Override the cache update in the attention layer
compressed_k = tq.compress(key_states, packed=True)
compressed_v = tq.compress(value_states, packed=True)
# Decompress when computing attention scores

GPU Acceleration

When CuPy is available, TurboQuant Pro uses CUDA RawKernels for bit-packing operations. All kernels are Volta-compatible (compute capability 7.0+).

tq = TurboQuantKV(head_dim=256, n_heads=16, bits=3, use_gpu=True)
# Automatically uses CuPy for rotation, quantization, and bit-packing

Falls back to NumPy automatically when CuPy is not installed.

Citation

If you use TurboQuant Pro in your research, please cite both this implementation and the original algorithm:

@software{bond2026turboquantpro,
  title={TurboQuant Pro: PCA-Matryoshka + TurboQuant Compression for Embeddings and LLM KV Caches},
  author={Bond, Andrew H.},
  year={2026},
  url={https://github.com/ahb-sjsu/turboquant-pro},
  license={MIT}
}

@article{bond2026pcamatryoshka,
  title={PCA-Matryoshka: Enabling Effective Dimension Reduction for Non-Matryoshka Embedding Models with Applications to Vector Database Compression},
  author={Bond, Andrew H.},
  journal={IEEE Transactions on Artificial Intelligence},
  year={2026}
}

@inproceedings{zandieh2026sublinear,
  title={Sub-linear Memory Inference via PolarQuant and QJL},
  author={Zandieh, Amir and Han, Insu and Daliri, Majid and Karbasi, Amin},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

Acknowledgments

Algorithm: Zandieh, Han, Daliri, and Karbasi -- "Sub-linear Memory Inference via PolarQuant and QJL" (ICLR 2026)
Origin: Adapted from the Theory Radar project's TurboBeam beam-search compression, which first implemented PolarQuant+QJL in Python
Author: Andrew H. Bond, San Jose State University

License

MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Apr 11, 2026

0.10.0

Apr 11, 2026

0.9.1

Apr 11, 2026

0.9.0

Apr 11, 2026

0.8.0

Apr 11, 2026

0.7.0

Apr 9, 2026

0.6.0

Apr 9, 2026

0.5.0

Apr 9, 2026

This version

0.4.0

Apr 9, 2026

0.2.0

Apr 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_pro-0.4.0.tar.gz (2.6 MB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turboquant_pro-0.4.0-py3-none-any.whl (41.9 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file turboquant_pro-0.4.0.tar.gz.

File metadata

Download URL: turboquant_pro-0.4.0.tar.gz
Upload date: Apr 9, 2026
Size: 2.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for turboquant_pro-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`0fcb7a73adc9a4adb31ddfca28b91b5c3ba0b60a206c68451e7e4725a9647dd7`
MD5	`b4199c62c1fd1ce790297986432f2394`
BLAKE2b-256	`53a82ef68800e97a882da9bfacce64c921e246e2075a099be5f4acd702cf4e04`

See more details on using hashes here.

File details

Details for the file turboquant_pro-0.4.0-py3-none-any.whl.

File metadata

Download URL: turboquant_pro-0.4.0-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 41.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for turboquant_pro-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`be83de42fcf3918b79e548100af94afefde184524c384c12c9cd554c987a4acb`
MD5	`49191c67e645b71e8201249ad6ffff2d`
BLAKE2b-256	`2e28b75cb04eed0a0add662ca8fcf7adb518c600bd536cb122a83ab8ffa39ee6`

See more details on using hashes here.

turboquant-pro 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TurboQuant Pro

What's New in v0.3.0

Installation

Quick Start

PCA-Matryoshka Compression (NEW in v0.3)

How It Works

Benchmark Results

Streaming Cache

pgvector Embedding Compression

NATS Transport Codec

Components

Integration Options

llama.cpp / llama-cpp-python

vLLM

HuggingFace Transformers

GPU Acceleration

Citation

Acknowledgments

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes