TurboQuant vector store for LangChain — 6x memory reduction with training-free quantization
Project description
langchain-turboquant
The first LangChain integration for TurboQuant - Google Research's training-free vector compression algorithm (ICLR 2026).
Drop-in replacement for any LangChain vector store with ~6x memory reduction and near-zero accuracy loss. No GPU required.
Why langchain-turboquant?
Large-scale RAG pipelines store millions of embedding vectors in memory. At 1536 dimensions (OpenAI text-embedding-3-small), each vector takes 6 KB. A million vectors = 6 GB just for embeddings.
TurboQuant compresses these vectors to ~1 KB each (3-bit quantization), cutting memory by 6x while preserving search accuracy. Unlike Product Quantization (PQ) or IVFPQ, TurboQuant requires no codebook training - it works out of the box on any embedding.
| Feature | langchain-turboquant | FAISS (PQ) | Chroma |
|---|---|---|---|
| Compression ratio | ~6x (3-bit) | ~4x (8-bit PQ) | 1x (none) |
| Training required | No | Yes (codebook) | N/A |
| Drop-in LangChain | Yes | Partial | Yes |
| GPU required | No | Optional | No |
| Asymmetric search | Yes | Yes | N/A |
How It Works
TurboQuant implements the two-stage compression algorithm from Google Research (ICLR 2026):
Stage 1: PolarQuant (MSE-optimal scalar quantization)
- Random orthogonal rotation: Multiply the vector by a random orthogonal matrix. This "isotropizes" the coordinates so each one follows the same distribution (the hypersphere marginal).
- Lloyd-Max quantization: Quantize each rotated coordinate independently using a pre-computed optimal codebook for the hypersphere marginal PDF.
The codebook is computed analytically from the distribution - no training data needed.
Stage 2: QJL (Quantized Johnson-Lindenstrauss residual correction)
- Compute the quantization residual (difference between original and Stage 1 reconstruction).
- Project the residual through a random Gaussian matrix.
- Store only the sign bits (1 bit per dimension) of the projection.
At query time, an asymmetric estimator computes approximate inner products directly on compressed data - the query stays in full precision while stored vectors remain compressed.
Compression Math
For dimension d with b-bit quantization and QJL dimension m:
Compressed bits per vector = d * b + m * 1 + 32 + 32
= d * (b + 1) + 64
Original bits per vector = d * 32
Compression ratio = 32d / (d * (b+1) + 64)
At d=1536, b=3: ratio = 7.7x (theoretical) / ~6x (practical with uint8 storage)
Installation
pip install langchain-turboquant
Or install from source:
git clone https://github.com/wjddusrb03/langchain-turboquant.git
cd langchain-turboquant
pip install -e ".[dev]"
Dependencies
- Python >= 3.9
- NumPy >= 1.21
- SciPy >= 1.7
- LangChain Core >= 0.3
Quick Start
from langchain_turboquant import TurboQuantVectorStore
from langchain_openai import OpenAIEmbeddings
# Create a compressed vector store (3-bit = ~6x compression)
store = TurboQuantVectorStore(embedding=OpenAIEmbeddings(), bits=3)
# Add documents - just like any LangChain vector store
store.add_texts(
["TurboQuant compresses vectors by 6x",
"LangChain is a framework for LLM applications",
"RAG combines retrieval with generation"],
metadatas=[{"topic": "compression"}, {"topic": "framework"}, {"topic": "rag"}]
)
# Search
results = store.similarity_search("How does compression work?", k=2)
for doc in results:
print(doc.page_content)
# Check memory savings
print(store.memory_stats())
# {'num_documents': 3, 'dimension': 1536, 'bits': 3,
# 'compression_ratio': '7.7x', 'memory_saved_pct': '87.0%'}
Use as a LangChain Retriever
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
retriever = store.as_retriever(search_kwargs={"k": 3})
# Use in a RAG chain
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| ChatOpenAI()
)
No API Key Demo
Run the included demo with fake embeddings (no API key needed):
python examples/rag_demo.py
API Reference
TurboQuantVectorStore
TurboQuantVectorStore(
embedding: Embeddings, # Any LangChain embedding model
bits: int = 3, # Quantization bits (1-4, recommended: 3)
qjl_dim: int = None, # QJL dimensions (default: same as embedding dim)
seed: int = 42, # Random seed for reproducibility
)
Methods:
| Method | Description |
|---|---|
add_texts(texts, metadatas, ids) |
Embed, compress, and store texts |
similarity_search(query, k) |
Return top-k most similar documents |
similarity_search_with_score(query, k) |
Return top-k with cosine similarity scores |
similarity_search_by_vector(vector, k) |
Search by pre-computed embedding vector |
from_texts(texts, embedding, ...) |
Class method to create and populate store |
delete(ids) |
Delete documents by ID |
get_by_ids(ids) |
Retrieve documents by ID |
as_retriever(**kwargs) |
Convert to LangChain Retriever |
save(path) |
Persist store to disk |
load(path, embedding) |
Load store from disk |
memory_stats() |
Get compression statistics |
TurboQuantizer (Low-level API)
from langchain_turboquant import TurboQuantizer
quantizer = TurboQuantizer(dim=1536, bits=3)
# Compress vectors
compressed = quantizer.quantize(vectors) # (n, 1536) -> CompressedVectors
# Asymmetric search (query in full precision, database compressed)
scores = quantizer.cosine_scores(query_vector, compressed)
# Reconstruct (for evaluation)
reconstructed = quantizer.dequantize(compressed)
Compression Ratios by Configuration
| Dimension | Bits | Theoretical Ratio | Memory Saved |
|---|---|---|---|
| 384 | 3 | 5.8x | 82.8% |
| 768 | 3 | 6.8x | 85.3% |
| 1536 | 3 | 7.3x | 86.3% |
| 3072 | 3 | 7.7x | 87.0% |
| 1536 | 2 | 9.5x | 89.5% |
| 1536 | 4 | 6.1x | 83.6% |
Higher dimensions benefit more from compression (the fixed 64-bit overhead for norms/gammas becomes negligible).
Testing
The project includes 296 comprehensive tests covering:
- Mathematical correctness (83 tests): Lloyd-Max codebook properties, rotation matrix orthogonality, MSE bounds, PDF integration, centroid conditions
- Edge cases (35 tests): NaN/Inf vectors, empty arrays, Unicode text, dim=1/2/3, zero vectors, large batches
- Search recall (44 tests): Top-k recall at various k/n/dim/bits, cluster discrimination, asymmetric estimator statistics, Pearson correlation
- Persistence (29 tests): Save/load roundtrips, serialization formats, state consistency after add/delete cycles
- Rigorous validation (68 tests): Compression ratios, performance benchmarks, score ordering, reconstruction quality
- Core functionality (37 tests): VectorStore CRUD, quantizer operations, LangChain integration
# Run all tests
pytest tests/ -v
# Run specific test suite
pytest tests/test_math_stress.py -v # Mathematical properties
pytest tests/test_recall_extensive.py -v # Search recall
pytest tests/test_edge_cases.py -v # Edge cases
Architecture
langchain-turboquant/
├── src/langchain_turboquant/
│ ├── __init__.py # Package exports
│ ├── lloyd_max.py # Lloyd-Max optimal codebook computation
│ ├── quantizer.py # TurboQuantizer (PolarQuant + QJL)
│ └── vectorstore.py # LangChain VectorStore integration
├── tests/
│ ├── test_quantizer.py # Core quantizer tests
│ ├── test_vectorstore.py # VectorStore API tests
│ ├── test_rigorous.py # Rigorous validation
│ ├── test_math_stress.py # Mathematical properties
│ ├── test_edge_cases.py # Edge cases
│ ├── test_recall_extensive.py # Search recall
│ └── test_persistence.py # Persistence tests
├── examples/
│ └── rag_demo.py # Working RAG demo (no API key needed)
├── pyproject.toml
├── LICENSE
└── README.md
References
- TurboQuant: Zandieh et al., "TurboQuant: Redefining Efficiency of KV Cache Compression for Large Language Models" (ICLR 2026). arXiv:2504.19874
- PolarQuant: Zandieh et al., "PolarQuant: Achieving High-Fidelity Vector Quantization via Polar Coordinates" (AISTATS 2026). arXiv:2502.02617
- QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (AAAI 2025). arXiv:2406.03482
- LangChain: langchain.com
Contributing
Contributions are welcome! If you find a bug, have a feature request, or want to improve the code:
- Open an Issue describing the problem or idea
- Fork the repo and create a branch
- Write tests for your changes
- Submit a Pull Request
Please report any problems or suggestions in the Issues tab. All feedback is appreciated!
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file langchain_turboquant-0.1.0.tar.gz.
File metadata
- Download URL: langchain_turboquant-0.1.0.tar.gz
- Upload date:
- Size: 41.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8650ba2fb3a567084c93c291c58199bbf2d14b7bdf4e9ef4a6c12ba1d70efeb3
|
|
| MD5 |
60b5a4827792a7f02a9b980cdcd4edd3
|
|
| BLAKE2b-256 |
6281e9938f52917d2091a90ce93d10713304c407ae2478554942021feacf5faa
|
File details
Details for the file langchain_turboquant-0.1.0-py3-none-any.whl.
File metadata
- Download URL: langchain_turboquant-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d52c372c0469b0399d2acead193d8ffbdb4a68751c6c9a0ee562f911f13390ea
|
|
| MD5 |
e906e7010a36ffb0d2e2824203dc2cee
|
|
| BLAKE2b-256 |
9d56e92c6eb5ef7e17dfa5111bbad98337b520027263f8c6add59efd5c4f9dd4
|