Skip to main content

Compressed vector index for RAG pipelines — 7-10x storage reduction, no training required

Project description

nanoindex

Compressed vector index for RAG pipelines. 7-10× smaller, no training, no accuracy loss.

Drop-in replacement for FAISS or your vector store's quantization layer, based on TurboQuant (Google Research, ICLR 2026). The first open-source Python implementation.

from nanoindex import NanoIndex

idx = NanoIndex(dim=384, bits=4)
idx.add(embeddings, metadata)          # compress 34 MB → 4.6 MB
results = idx.search(query, k=10)     # 2ms on 22K vectors

Why

Vector storage is the quiet cost in every RAG system. A modest corpus of 500K documents at 384 dimensions costs 750 MB as float32. That's before replication, before backups, before you add more models.

NanoIndex compresses embeddings to 3–4 bits per value using a two-stage algorithm that requires no training data and no codebooks — just your vectors. You get a 7-10× smaller index with retrieval quality that beats trained baselines at equivalent compression ratios.


Install

pip install nanoindex           # core only (numpy)
pip install nanoindex[fast]     # + Numba JIT (~15× faster search)
pip install nanoindex[langchain]
pip install nanoindex[llamaindex]

Quick start

import numpy as np
from nanoindex import NanoIndex

# Any float32 embeddings — model and domain agnostic
embeddings = encoder.encode(texts)   # (N, dim) float32, L2-normalised

metadata = [{"id": f"doc_{i}", "text": texts[i], "source": sources[i]} for i in range(N)]

idx = NanoIndex(dim=384, bits=4)
idx.add(embeddings, metadata)
idx.save("my_index")

# Later
idx = NanoIndex.load("my_index")
results = idx.search(query_vec, k=10)

for r in results:
    print(r.score, r.text, r.metadata)

Batch search

# Search multiple queries at once
results = idx.search(query_matrix, k=10)   # (M, dim) → list[list[SearchResult]]

Metadata filters

# Equality filter (case-insensitive, supports lists)
results = idx.search(q, k=10, filters={"source": "arxiv"})
results = idx.search(q, k=10, filters={"author": ["Smith", "Jones"]})

# Range filters — any numeric field with _min / _max suffix
results = idx.search(q, k=10, filters={"year_min": 2022, "score_max": 0.9})

# Combined
results = idx.search(q, k=10, filters={"source": "arxiv", "year_min": 2023})

LangChain integration

from nanoindex.integrations.langchain import NanoVectorStore
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Build — replaces FAISS.from_texts()
store = NanoVectorStore.from_texts(texts, embeddings, bits=4)
store.save_local("my_index")

# Load
store = NanoVectorStore.load_local("my_index", embeddings)

# Use as retriever
retriever = store.as_retriever(search_kwargs={"k": 5})
chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

LlamaIndex integration

from nanoindex.integrations.llamaindex import NanoVectorStore
from llama_index.core import VectorStoreIndex, StorageContext

vector_store = NanoVectorStore(dim=1536, bits=4)
storage_ctx  = StorageContext.from_defaults(vector_store=vector_store)
index        = VectorStoreIndex.from_documents(docs, storage_context=storage_ctx)

query_engine = index.as_query_engine()
response     = query_engine.query("What is the capital of France?")

Benchmarks

Measured on 22K vectors, dim=384 (2023 F1 Bahrain GP telemetry embeddings, all-MiniLM-L6-v2), Apple M-series, single query, Numba enabled.

Method Compression Index size Recall@10 p50 latency
Brute-force float32 34.1 MB 1.000 0.6 ms
NanoIndex (4-bit) 7.4× 4.6 MB 0.744 2.0 ms
NanoIndex (3-bit) 9.6× 3.5 MB 0.514 2.0 ms
Faiss PQ (m=48, 8-bit) 32× 1.1 MB 0.369 0.3 ms
Faiss SQ8 8.5 MB 0.975 1.6 ms

NanoIndex at 4-bit outperforms Faiss PQ on recall (0.744 vs 0.369) while requiring zero training. The 2ms latency reflects a pure Python/Numba implementation; the original CUDA kernel in the paper achieves 8× GPU throughput vs float32.

Run benchmarks on your own data:

pip install nanoindex[bench]
python benchmarks/run_benchmarks.py --embeddings your_embeddings.npz --bits 4

How it works

NanoIndex implements TurboQuant — a two-stage, data-oblivious vector compression algorithm.

Stage 1 — PolarQuant

Each embedding is randomly rotated (shared orthogonal matrix R), then converted from Cartesian to polar coordinates. The angle components are uniformly quantized to b bits. The radii are recursively paired and quantized across 9 levels until a single scalar radius remains. This stores d-1 quantized angles + 1 float32 radius per vector.

Stage 2 — QJL residual correction

The quantization error from Stage 1 is projected through a random Johnson-Lindenstrauss matrix S ∈ ℝᵐˣᵈ. Only the sign bits of the projection are stored (1 bit each). At query time, a bias-corrected estimator adds back the residual correction without any decompression.

Inner product in the compressed domain

Approximate inner products are computed directly on compressed representations — no decompression step. The Numba-accelerated kernel processes 22K vectors in 2ms using parallel threads.

⟨q, v⟩ ≈ PolarQuant_IP(q, angles, radius) + QJL_correction(q, sign_bits, residual_norm)

Configuration

NanoIndex(
    dim   = 384,   # embedding dimension
    bits  = 4,     # bits per angle (3–8); 4-bit recommended
    qjl_m = 64,    # QJL projection dimensions; higher = better correction
    seed  = 42,    # for reproducible rotation matrices
)
bits Compression Typical Recall@10 Use when
3 ~9-10× 0.50–0.55 Maximum compression, quality less critical
4 ~7-8× 0.70–0.75 Recommended default
6 ~5× 0.85–0.90 High recall requirements
8 ~4× 0.93+ Near-lossless

SearchResult fields

@dataclass
class SearchResult:
    rank:     int         # 0-indexed rank in result list
    score:    float       # approximate cosine similarity
    id:       str         # from metadata["id"]
    text:     str         # from metadata["text"]
    metadata: dict        # all other fields from your metadata dict

Requirements

  • Python ≥ 3.10
  • numpy ≥ 1.24
  • numba ≥ 0.58 (optional, recommended — pip install nanoindex[fast])

Algorithm credit

NanoIndex implements the TurboQuant algorithm from:

TurboQuant: Redefining AI Efficiency with Extreme Compression
Google Research · Blog post · arXiv:2504.19874 · ICLR 2026

This is an independent open-source Python/Numba implementation. The original paper's performance numbers were obtained using a custom CUDA kernel.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nanoindex_rag-0.1.0.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nanoindex_rag-0.1.0-py3-none-any.whl (20.3 kB view details)

Uploaded Python 3

File details

Details for the file nanoindex_rag-0.1.0.tar.gz.

File metadata

  • Download URL: nanoindex_rag-0.1.0.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for nanoindex_rag-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f569674f9406dc4776e1301e7346da2314b21039bc0ce12d63e308c6b387fd86
MD5 dc182149c33120fdd6f1a9729bc35378
BLAKE2b-256 8d999a55278fe9507bf39703b3c07b81841666e00e988699542b42e6eb6d111a

See more details on using hashes here.

File details

Details for the file nanoindex_rag-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nanoindex_rag-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for nanoindex_rag-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f72d0d14177e0db3513f4d8cae7059eae5a1167ef9811b8d10dfdf0a2315b0eb
MD5 30808fb4df328b6cf15a42847dfe189b
BLAKE2b-256 1449972427fb0b479efbce84d720a4824a81234fed48681aa66f9f9fbaa71b0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page