Skip to main content

TurboQuant vector store for LangChain — 6x memory reduction with training-free quantization

Project description

langchain-turboquant

The first LangChain integration for TurboQuant - Google Research's training-free vector compression algorithm (ICLR 2026).

Drop-in replacement for any LangChain vector store with ~6x memory reduction and near-zero accuracy loss. No GPU required.

Python 3.9+ License: MIT Tests: 296 passed

한국어 README


Why langchain-turboquant?

Large-scale RAG pipelines store millions of embedding vectors in memory. At 1536 dimensions (OpenAI text-embedding-3-small), each vector takes 6 KB. A million vectors = 6 GB just for embeddings.

TurboQuant compresses these vectors to ~1 KB each (3-bit quantization), cutting memory by 6x while preserving search accuracy. Unlike Product Quantization (PQ) or IVFPQ, TurboQuant requires no codebook training - it works out of the box on any embedding.

Feature langchain-turboquant FAISS (PQ) Chroma
Compression ratio ~6x (3-bit) ~4x (8-bit PQ) 1x (none)
Training required No Yes (codebook) N/A
Drop-in LangChain Yes Partial Yes
GPU required No Optional No
Asymmetric search Yes Yes N/A

How It Works

TurboQuant implements the two-stage compression algorithm from Google Research (ICLR 2026):

Stage 1: PolarQuant (MSE-optimal scalar quantization)

  1. Random orthogonal rotation: Multiply the vector by a random orthogonal matrix. This "isotropizes" the coordinates so each one follows the same distribution (the hypersphere marginal).
  2. Lloyd-Max quantization: Quantize each rotated coordinate independently using a pre-computed optimal codebook for the hypersphere marginal PDF.

The codebook is computed analytically from the distribution - no training data needed.

Stage 2: QJL (Quantized Johnson-Lindenstrauss residual correction)

  1. Compute the quantization residual (difference between original and Stage 1 reconstruction).
  2. Project the residual through a random Gaussian matrix.
  3. Store only the sign bits (1 bit per dimension) of the projection.

At query time, an asymmetric estimator computes approximate inner products directly on compressed data - the query stays in full precision while stored vectors remain compressed.

Compression Math

For dimension d with b-bit quantization and QJL dimension m:

Compressed bits per vector = d * b + m * 1 + 32 + 32
                           = d * (b + 1) + 64

Original bits per vector   = d * 32

Compression ratio          = 32d / (d * (b+1) + 64)

At d=1536, b=3: ratio = 7.7x (theoretical) / ~6x (practical with uint8 storage)

Installation

pip install langchain-turboquant

Or install from source:

git clone https://github.com/wjddusrb03/langchain-turboquant.git
cd langchain-turboquant
pip install -e ".[dev]"

Dependencies

  • Python >= 3.9
  • NumPy >= 1.21
  • SciPy >= 1.7
  • LangChain Core >= 0.3

Quick Start

from langchain_turboquant import TurboQuantVectorStore
from langchain_openai import OpenAIEmbeddings

# Create a compressed vector store (3-bit = ~6x compression)
store = TurboQuantVectorStore(embedding=OpenAIEmbeddings(), bits=3)

# Add documents - just like any LangChain vector store
store.add_texts(
    ["TurboQuant compresses vectors by 6x",
     "LangChain is a framework for LLM applications",
     "RAG combines retrieval with generation"],
    metadatas=[{"topic": "compression"}, {"topic": "framework"}, {"topic": "rag"}]
)

# Search
results = store.similarity_search("How does compression work?", k=2)
for doc in results:
    print(doc.page_content)

# Check memory savings
print(store.memory_stats())
# {'num_documents': 3, 'dimension': 1536, 'bits': 3,
#  'compression_ratio': '7.7x', 'memory_saved_pct': '87.0%'}

Use as a LangChain Retriever

from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

retriever = store.as_retriever(search_kwargs={"k": 3})

# Use in a RAG chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI()
)

No API Key Demo

Run the included demo with fake embeddings (no API key needed):

python examples/rag_demo.py

API Reference

TurboQuantVectorStore

TurboQuantVectorStore(
    embedding: Embeddings,  # Any LangChain embedding model
    bits: int = 3,          # Quantization bits (1-4, recommended: 3)
    qjl_dim: int = None,    # QJL dimensions (default: same as embedding dim)
    seed: int = 42,         # Random seed for reproducibility
)

Methods:

Method Description
add_texts(texts, metadatas, ids) Embed, compress, and store texts
similarity_search(query, k) Return top-k most similar documents
similarity_search_with_score(query, k) Return top-k with cosine similarity scores
similarity_search_by_vector(vector, k) Search by pre-computed embedding vector
from_texts(texts, embedding, ...) Class method to create and populate store
delete(ids) Delete documents by ID
get_by_ids(ids) Retrieve documents by ID
as_retriever(**kwargs) Convert to LangChain Retriever
save(path) Persist store to disk
load(path, embedding) Load store from disk
memory_stats() Get compression statistics

TurboQuantizer (Low-level API)

from langchain_turboquant import TurboQuantizer

quantizer = TurboQuantizer(dim=1536, bits=3)

# Compress vectors
compressed = quantizer.quantize(vectors)  # (n, 1536) -> CompressedVectors

# Asymmetric search (query in full precision, database compressed)
scores = quantizer.cosine_scores(query_vector, compressed)

# Reconstruct (for evaluation)
reconstructed = quantizer.dequantize(compressed)

Compression Ratios by Configuration

Dimension Bits Theoretical Ratio Memory Saved
384 3 5.8x 82.8%
768 3 6.8x 85.3%
1536 3 7.3x 86.3%
3072 3 7.7x 87.0%
1536 2 9.5x 89.5%
1536 4 6.1x 83.6%

Higher dimensions benefit more from compression (the fixed 64-bit overhead for norms/gammas becomes negligible).

Testing

The project includes 296 comprehensive tests covering:

  • Mathematical correctness (83 tests): Lloyd-Max codebook properties, rotation matrix orthogonality, MSE bounds, PDF integration, centroid conditions
  • Edge cases (35 tests): NaN/Inf vectors, empty arrays, Unicode text, dim=1/2/3, zero vectors, large batches
  • Search recall (44 tests): Top-k recall at various k/n/dim/bits, cluster discrimination, asymmetric estimator statistics, Pearson correlation
  • Persistence (29 tests): Save/load roundtrips, serialization formats, state consistency after add/delete cycles
  • Rigorous validation (68 tests): Compression ratios, performance benchmarks, score ordering, reconstruction quality
  • Core functionality (37 tests): VectorStore CRUD, quantizer operations, LangChain integration
# Run all tests
pytest tests/ -v

# Run specific test suite
pytest tests/test_math_stress.py -v     # Mathematical properties
pytest tests/test_recall_extensive.py -v # Search recall
pytest tests/test_edge_cases.py -v       # Edge cases

Architecture

langchain-turboquant/
├── src/langchain_turboquant/
│   ├── __init__.py          # Package exports
│   ├── lloyd_max.py         # Lloyd-Max optimal codebook computation
│   ├── quantizer.py         # TurboQuantizer (PolarQuant + QJL)
│   └── vectorstore.py       # LangChain VectorStore integration
├── tests/
│   ├── test_quantizer.py    # Core quantizer tests
│   ├── test_vectorstore.py  # VectorStore API tests
│   ├── test_rigorous.py     # Rigorous validation
│   ├── test_math_stress.py  # Mathematical properties
│   ├── test_edge_cases.py   # Edge cases
│   ├── test_recall_extensive.py  # Search recall
│   └── test_persistence.py  # Persistence tests
├── examples/
│   └── rag_demo.py          # Working RAG demo (no API key needed)
├── pyproject.toml
├── LICENSE
└── README.md

References

  • TurboQuant: Zandieh et al., "TurboQuant: Redefining Efficiency of KV Cache Compression for Large Language Models" (ICLR 2026). arXiv:2504.19874
  • PolarQuant: Zandieh et al., "PolarQuant: Achieving High-Fidelity Vector Quantization via Polar Coordinates" (AISTATS 2026). arXiv:2502.02617
  • QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (AAAI 2025). arXiv:2406.03482
  • LangChain: langchain.com

Contributing

Contributions are welcome! If you find a bug, have a feature request, or want to improve the code:

  1. Open an Issue describing the problem or idea
  2. Fork the repo and create a branch
  3. Write tests for your changes
  4. Submit a Pull Request

Please report any problems or suggestions in the Issues tab. All feedback is appreciated!

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_turboquant-0.1.0.tar.gz (41.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_turboquant-0.1.0-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file langchain_turboquant-0.1.0.tar.gz.

File metadata

  • Download URL: langchain_turboquant-0.1.0.tar.gz
  • Upload date:
  • Size: 41.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for langchain_turboquant-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8650ba2fb3a567084c93c291c58199bbf2d14b7bdf4e9ef4a6c12ba1d70efeb3
MD5 60b5a4827792a7f02a9b980cdcd4edd3
BLAKE2b-256 6281e9938f52917d2091a90ce93d10713304c407ae2478554942021feacf5faa

See more details on using hashes here.

File details

Details for the file langchain_turboquant-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_turboquant-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d52c372c0469b0399d2acead193d8ffbdb4a68751c6c9a0ee562f911f13390ea
MD5 e906e7010a36ffb0d2e2824203dc2cee
BLAKE2b-256 9d56e92c6eb5ef7e17dfa5111bbad98337b520027263f8c6add59efd5c4f9dd4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page