TurboQuant vector store for LangChain — 6x memory reduction with training-free quantization

These details have not been verified by PyPI

Project links

Project description

langchain-turboquant

The first LangChain integration for TurboQuant - Google Research's training-free vector compression algorithm (ICLR 2026).

Drop-in replacement for any LangChain vector store with ~6x memory reduction and near-zero accuracy loss. No GPU required.

한국어 README

Why langchain-turboquant?

Large-scale RAG pipelines store millions of embedding vectors in memory. At 1536 dimensions (OpenAI text-embedding-3-small), each vector takes 6 KB. A million vectors = 6 GB just for embeddings.

TurboQuant compresses these vectors to ~1 KB each (3-bit quantization), cutting memory by 6x while preserving search accuracy. Unlike Product Quantization (PQ) or IVFPQ, TurboQuant requires no codebook training - it works out of the box on any embedding.

Feature	langchain-turboquant	FAISS (PQ)	Chroma
Compression ratio	~6x (3-bit)	~4x (8-bit PQ)	1x (none)
Training required	No	Yes (codebook)	N/A
Drop-in LangChain	Yes	Partial	Yes
GPU required	No	Optional	No
Asymmetric search	Yes	Yes	N/A

How It Works

TurboQuant implements the two-stage compression algorithm from Google Research (ICLR 2026):

Stage 1: PolarQuant (MSE-optimal scalar quantization)

Random orthogonal rotation: Multiply the vector by a random orthogonal matrix. This "isotropizes" the coordinates so each one follows the same distribution (the hypersphere marginal).
Lloyd-Max quantization: Quantize each rotated coordinate independently using a pre-computed optimal codebook for the hypersphere marginal PDF.

The codebook is computed analytically from the distribution - no training data needed.

Stage 2: QJL (Quantized Johnson-Lindenstrauss residual correction)

Compute the quantization residual (difference between original and Stage 1 reconstruction).
Project the residual through a random Gaussian matrix.
Store only the sign bits (1 bit per dimension) of the projection.

At query time, an asymmetric estimator computes approximate inner products directly on compressed data - the query stays in full precision while stored vectors remain compressed.

Compression Math

For dimension d with b-bit quantization and QJL dimension m:

Compressed bits per vector = d * b + m * 1 + 32 + 32
                           = d * (b + 1) + 64

Original bits per vector   = d * 32

Compression ratio          = 32d / (d * (b+1) + 64)

At d=1536, b=3: ratio = 7.7x (theoretical) / ~6x (practical with uint8 storage)

Installation

pip install langchain-turboquant

Or install from source:

git clone https://github.com/wjddusrb03/langchain-turboquant.git
cd langchain-turboquant
pip install -e ".[dev]"

Dependencies

Python >= 3.9
NumPy >= 1.21
SciPy >= 1.7
LangChain Core >= 0.3

Quick Start

from langchain_turboquant import TurboQuantVectorStore
from langchain_openai import OpenAIEmbeddings

# Create a compressed vector store (3-bit = ~6x compression)
store = TurboQuantVectorStore(embedding=OpenAIEmbeddings(), bits=3)

# Add documents - just like any LangChain vector store
store.add_texts(
    ["TurboQuant compresses vectors by 6x",
     "LangChain is a framework for LLM applications",
     "RAG combines retrieval with generation"],
    metadatas=[{"topic": "compression"}, {"topic": "framework"}, {"topic": "rag"}]
)

# Search
results = store.similarity_search("How does compression work?", k=2)
for doc in results:
    print(doc.page_content)

# Check memory savings
print(store.memory_stats())
# {'num_documents': 3, 'dimension': 1536, 'bits': 3,
#  'compression_ratio': '7.7x', 'memory_saved_pct': '87.0%'}

Use as a LangChain Retriever

from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

retriever = store.as_retriever(search_kwargs={"k": 3})

# Use in a RAG chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI()
)

No API Key Demo

Run the included demo with fake embeddings (no API key needed):

python examples/rag_demo.py

API Reference

TurboQuantVectorStore

TurboQuantVectorStore(
    embedding: Embeddings,  # Any LangChain embedding model
    bits: int = 3,          # Quantization bits (1-4, recommended: 3)
    qjl_dim: int = None,    # QJL dimensions (default: same as embedding dim)
    seed: int = 42,         # Random seed for reproducibility
)

Methods:

Method	Description
`add_texts(texts, metadatas, ids)`	Embed, compress, and store texts
`similarity_search(query, k)`	Return top-k most similar documents
`similarity_search_with_score(query, k)`	Return top-k with cosine similarity scores
`similarity_search_by_vector(vector, k)`	Search by pre-computed embedding vector
`from_texts(texts, embedding, ...)`	Class method to create and populate store
`delete(ids)`	Delete documents by ID
`get_by_ids(ids)`	Retrieve documents by ID
`as_retriever(**kwargs)`	Convert to LangChain Retriever
`save(path)`	Persist store to disk
`load(path, embedding)`	Load store from disk
`memory_stats()`	Get compression statistics

TurboQuantizer (Low-level API)

from langchain_turboquant import TurboQuantizer

quantizer = TurboQuantizer(dim=1536, bits=3)

# Compress vectors
compressed = quantizer.quantize(vectors)  # (n, 1536) -> CompressedVectors

# Asymmetric search (query in full precision, database compressed)
scores = quantizer.cosine_scores(query_vector, compressed)

# Reconstruct (for evaluation)
reconstructed = quantizer.dequantize(compressed)

Compression Ratios by Configuration

Dimension	Bits	Theoretical Ratio	Memory Saved
384	3	5.8x	82.8%
768	3	6.8x	85.3%
1536	3	7.3x	86.3%
3072	3	7.7x	87.0%
1536	2	9.5x	89.5%
1536	4	6.1x	83.6%

Higher dimensions benefit more from compression (the fixed 64-bit overhead for norms/gammas becomes negligible).

Testing

The project includes 296 comprehensive tests covering:

Mathematical correctness (83 tests): Lloyd-Max codebook properties, rotation matrix orthogonality, MSE bounds, PDF integration, centroid conditions
Edge cases (35 tests): NaN/Inf vectors, empty arrays, Unicode text, dim=1/2/3, zero vectors, large batches
Search recall (44 tests): Top-k recall at various k/n/dim/bits, cluster discrimination, asymmetric estimator statistics, Pearson correlation
Persistence (29 tests): Save/load roundtrips, serialization formats, state consistency after add/delete cycles
Rigorous validation (68 tests): Compression ratios, performance benchmarks, score ordering, reconstruction quality
Core functionality (37 tests): VectorStore CRUD, quantizer operations, LangChain integration

# Run all tests
pytest tests/ -v

# Run specific test suite
pytest tests/test_math_stress.py -v     # Mathematical properties
pytest tests/test_recall_extensive.py -v # Search recall
pytest tests/test_edge_cases.py -v       # Edge cases

Architecture

langchain-turboquant/
├── src/langchain_turboquant/
│   ├── __init__.py          # Package exports
│   ├── lloyd_max.py         # Lloyd-Max optimal codebook computation
│   ├── quantizer.py         # TurboQuantizer (PolarQuant + QJL)
│   └── vectorstore.py       # LangChain VectorStore integration
├── tests/
│   ├── test_quantizer.py    # Core quantizer tests
│   ├── test_vectorstore.py  # VectorStore API tests
│   ├── test_rigorous.py     # Rigorous validation
│   ├── test_math_stress.py  # Mathematical properties
│   ├── test_edge_cases.py   # Edge cases
│   ├── test_recall_extensive.py  # Search recall
│   └── test_persistence.py  # Persistence tests
├── examples/
│   └── rag_demo.py          # Working RAG demo (no API key needed)
├── pyproject.toml
├── LICENSE
└── README.md

References

TurboQuant: Zandieh et al., "TurboQuant: Redefining Efficiency of KV Cache Compression for Large Language Models" (ICLR 2026). arXiv:2504.19874
PolarQuant: Zandieh et al., "PolarQuant: Achieving High-Fidelity Vector Quantization via Polar Coordinates" (AISTATS 2026). arXiv:2502.02617
QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (AAAI 2025). arXiv:2406.03482
LangChain: langchain.com

Contributing

Contributions are welcome! If you find a bug, have a feature request, or want to improve the code:

Open an Issue describing the problem or idea
Fork the repo and create a branch
Write tests for your changes
Submit a Pull Request

Please report any problems or suggestions in the Issues tab. All feedback is appreciated!

License

MIT License - see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_turboquant-0.1.0.tar.gz (41.2 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langchain_turboquant-0.1.0-py3-none-any.whl (14.3 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file langchain_turboquant-0.1.0.tar.gz.

File metadata

Download URL: langchain_turboquant-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 41.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for langchain_turboquant-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8650ba2fb3a567084c93c291c58199bbf2d14b7bdf4e9ef4a6c12ba1d70efeb3`
MD5	`60b5a4827792a7f02a9b980cdcd4edd3`
BLAKE2b-256	`6281e9938f52917d2091a90ce93d10713304c407ae2478554942021feacf5faa`

See more details on using hashes here.

File details

Details for the file langchain_turboquant-0.1.0-py3-none-any.whl.

File metadata

Download URL: langchain_turboquant-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 14.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for langchain_turboquant-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d52c372c0469b0399d2acead193d8ffbdb4a68751c6c9a0ee562f911f13390ea`
MD5	`e906e7010a36ffb0d2e2824203dc2cee`
BLAKE2b-256	`9d56e92c6eb5ef7e17dfa5111bbad98337b520027263f8c6add59efd5c4f9dd4`

See more details on using hashes here.

langchain-turboquant 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

langchain-turboquant

Why langchain-turboquant?

How It Works

Stage 1: PolarQuant (MSE-optimal scalar quantization)

Stage 2: QJL (Quantized Johnson-Lindenstrauss residual correction)

Compression Math

Installation

Dependencies

Quick Start

Use as a LangChain Retriever

No API Key Demo

API Reference

TurboQuantVectorStore

TurboQuantizer (Low-level API)

Compression Ratios by Configuration

Testing

Architecture

References

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes