Shard-first late-interaction retrieval for ColBERT and ColPali style workloads with CPU/GPU modes, Triton MaxSim, BM25 hybrid search, durable CRUD/WAL, multimodal preprocessing, and base64-ready reference APIs.
Project description
voyager-index
Late-interaction retrieval for on-prem AI systems. Runs on a single machine, supports CPU or GPU execution, and keeps MaxSim as the truth scorer.
voyager-index is built for teams that want a multi-vector native ColBERT / ColPali-style retrieval quality without adopting a distributed search stack. It combines proxy routing, exact or quantized MaxSim, multimodal preprocessing, and database-grade operations behind one API.
The OSS engine stays fast on its own. An optional Latence graph sidecar adds graph-aware candidate rescue, provenance, and freshness-aware metadata as a premium plane without becoming a hard dependency of the base retrieval path.
pip install "voyager-index[server,shard,gpu]"
voyager-index-server # OpenAPI at :8080/docs
For developers: one retrieval contract across CPU and GPU, with real APIs and real failure recovery. For infrastructure leaders: strong late-interaction search on modest hardware, without taking on distributed search complexity.
Start Here
Use the shard-first production lane first. The canonical CPU-safe production install is one command, then you can layer GPU support on top when needed:
pip install "voyager-index[full]" # full public CPU surface
pip install "voyager-index[full,gpu]" # add Triton GPU scoring on CUDA hosts
HOST=0.0.0.0 WORKERS=4 voyager-index-server
# OpenAPI docs at http://127.0.0.1:8080/docs
Fine-grained profiles stay available:
pip install "voyager-index[shard]" # minimal shard path
pip install "voyager-index[server,shard]" # reference API
pip install "voyager-index[server,shard,shard-native]" # Rust shard CPU fast-path
pip install "voyager-index[server,shard,solver]" # Tabu Search solver only
pip install "voyager-index[server,shard,native]" # both public native wheels
pip install "voyager-index[server,shard,latence-graph]" # graph lane without native extras
If you are evaluating quickly:
- run the Quickstart
- use the Reference API Tutorial for the HTTP path
- use the Shard Engine Guide for the high-performance lane
- use the Latence Graph Sidecar Guide for the optional premium lane
Who This Fits
voyager-index is a strong fit when:
- late-interaction retrieval quality matters
- on-prem deployment matters
- single-node operability matters
- CPU and GPU flexibility matters
- you want an API-facing retrieval service, not just an offline benchmark artifact
It is probably not the first choice if you need:
- a large distributed control plane across many nodes
- purely dense ANN retrieval at extreme scale
- a hosted multi-tenant SaaS search platform
Why
Most retrieval systems optimize the shortlist and treat late interaction as an add-on. voyager-index is built the other way around: MaxSim is the final scorer, and the rest of the system exists to make that practical in production.
- Proxy routing instead of mandatory graph dependency. A learned proxy router collapses multi-vector candidate generation to ANN over compact routing representations, then hands off to exact or quantized MaxSim. The optional Latence graph lane augments after first-stage retrieval instead of replacing the router.
- Fast CPU and GPU execution. Rust fused scoring for CPU, Triton kernels for GPU, with the same retrieval contract across both modes.
- Operational features included. CRUD, WAL, checkpoint, recovery, metadata, and API serving are part of the system, not an afterthought.
- Built for single-node deployment. No distributed control plane required for the common on-prem use case.
Current Public Proof
voyager-index has two public proof layers, and they should be read together:
- Core production lane: the shard-first route is proven by the BEIR shard
benchmark in
benchmarks/beir_benchmark.py. That harness measures search-only GPU-corpus Triton MaxSim and CPU multiworker fused Rust scoring on the same production lane the API serves, with current public results showing2.6-5.0 msGPU P95,164.8-346.8GPU QPS, and41.6-271.7CPU QPS across the listed BEIR sets. - Optional Latence graph lane: the graph lane is proven separately by
tools/benchmarks/benchmark_latence_graph_quality.pyplus the graph tests. In the current representative harness it delivers+0.75recall,+0.333NDCG, and+0.75support coverage on graph-shaped queries,0.0ordinary- query deltas,57%graph activation,3.5average added candidates on graph-shaped queries, and passing route-conformance checks.
The graph proof is intentionally scoped: it shows the shipped graph contract, additive rescue semantics, provenance tagging, and retrieval uplift on graph-shaped fixtures. It is not presented as a graph-on BEIR table.
The graph data itself comes from structured Latence graph data derived from the indexed corpus and synchronized into the sidecar as target-linked graph contracts. The public guide explains the architecture and provenance model without exposing proprietary internals.
For full methodology and benchmark caveats, see Benchmarks And Methodology.
BEIR Benchmark
Measured on NVIDIA RTX A5000 (24 GB) using lightonai/GTE-ModernColBERT-v1.
Numbers below are search-only and exclude query encoding. CPU results use
8 native Rust workers. These are full-query-set results, not a sampled
subset.
These results are meant to show three things:
- Retrieval quality on standard BEIR datasets
- Search latency and throughput under realistic conditions
- What is achievable on modest on-prem hardware
| Dataset | Documents | MAP@100 | NDCG@10 | NDCG@100 | Recall@10 | Recall@100 | GPU QPS | GPU P95 (ms) | CPU QPS | CPU P95 (ms) |
|---|---|---|---|---|---|---|---|---|---|---|
| arguana | 8,674 | 0.2598 | 0.3679 | 0.4171 | 0.7402 | 0.9586 | 270.0 | 4.1 | 41.6 | 202.7 |
| fiqa | 57,638 | 0.3818 | 0.4436 | 0.5049 | 0.5059 | 0.7297 | 164.8 | 5.0 | 80.2 | 115.7 |
| nfcorpus | 3,633 | 0.1963 | 0.3833 | 0.3485 | 0.3404 | 0.3348 | 282.6 | 3.8 | 123.3 | 84.4 |
| quora | 15,675 | 0.9686 | 0.9766 | 0.9790 | 0.9930 | 0.9993 | 346.8 | 2.6 | 271.7 | 46.9 |
| scidocs | 25,657 | 0.1383 | 0.1977 | 0.2763 | 0.2070 | 0.4369 | 246.8 | 4.3 | 83.9 | 111.8 |
| scifact | 5,183 | 0.7141 | 0.7544 | 0.7730 | 0.8766 | 0.9567 | 263.4 | 4.0 | 69.1 | 138.4 |
How to read these results
- GPU P95 under 6 ms across all listed datasets shows the fast path is practical on A5000-class hardware.
- CPU mode remains viable when GPU capacity is limited or reserved for model serving.
- Quality metrics are strong while using the same shard/Rust/Triton retrieval stack that powers the production API.
Comparison note: next-plaid
next-plaid is an important open-source reference for ColBERT-style serving. Their published numbers are measured on NVIDIA H100 80 GB with the same embedding model. Our numbers above are measured on an RTX A5000 and are search-only; their reported QPS includes encoding. Quora is omitted below because their README uses a much larger corpus for that dataset.
| Dataset | System | NDCG@10 | MAP@100 | Recall@100 | GPU QPS | GPU P95 (ms) | CPU QPS | CPU P95 (ms) |
|---|---|---|---|---|---|---|---|---|
| arguana | voyager | 0.3679 | 0.2598 | 0.9586 | 270.0 | 4.1 | 41.6 | 202.7 |
| next-plaid | 0.3499 | 0.2457 | 0.9337 | 13.6 | 170.1 | 17.4 | 454.7 | |
| fiqa | voyager | 0.4436 | 0.3818 | 0.7297 | 164.8 | 5.0 | 80.2 | 115.7 |
| next-plaid | 0.4506 | 0.3871 | 0.7459 | 18.2 | 170.6 | 17.6 | 259.1 | |
| nfcorpus | voyager | 0.3833 | 0.1963 | 0.3348 | 282.6 | 3.8 | 123.3 | 84.4 |
| next-plaid | 0.3828 | 0.1870 | 0.3228 | 6.6 | 262.1 | 16.9 | 219.4 | |
| scidocs | voyager | 0.1977 | 0.1383 | 0.4369 | 246.8 | 4.3 | 83.9 | 111.8 |
| next-plaid | 0.1914 | 0.1352 | 0.4418 | 17.5 | 139.3 | 16.5 | 281.7 | |
| scifact | voyager | 0.7544 | 0.7141 | 0.9567 | 263.4 | 4.0 | 69.1 | 138.4 |
| next-plaid | 0.7593 | 0.7186 | 0.9633 | 7.9 | 169.5 | 16.9 | 305.4 |
In our current benchmark setup, voyager-index is competitive or better on retrieval quality across most listed datasets and shows materially higher search throughput with much lower P95 latency on an RTX A5000. This is not a fully apples-to-apples comparison: next-plaid reports H100 numbers and includes encoding in QPS, while our numbers are search-only on a smaller GPU. The table above uses full-query evaluation specifically to avoid publishing a flattering slice.
Architecture
voyager-index separates the problem into routing, storage, exact scoring, optimization, durability, and serving. This keeps the retrieval contract stable across CPU, GPU, and mixed deployment modes.
query vectors (token / patch embeddings)
→ LEMUR routing MLP
→ FAISS ANN over routing representations
→ candidate document IDs
→ optional BM25 fusion when query_text is available
→ optional centroid-approx or doc-mean proxy pruning
→ optional ColBANDIT query-time pruning
→ exact or quantized MaxSim
Rust fused exact (CPU, mmap, SIMD, GIL-free)
Triton FP16 / INT8 / FP8 / ROQ-4 (GPU)
GPU-corpus gather + rerank
→ optional Latence graph augmentation
→ optional solver/context packing
→ top-K results or packed context
| Layer | What it does | Why it matters |
|---|---|---|
| Routing | LEMUR MLP, FAISS MIPS, candidate budgets | Makes late interaction tractable without graph construction |
| Storage | Safetensors shards, merged mmap, GPU-resident corpus | Honest CPU and GPU layouts for any corpus size |
| Exact scoring | Triton MaxSim, Rust fused MaxSim, quantized kernels | MaxSim stays the truth scorer across all deployment shapes |
| Optimization | ColBANDIT pruning, centroid approximation, ROQ-4 | Moves the latency/recall frontier without changing the retrieval contract |
| Optional graph plane | Latence graph sidecar, target-linked graph contracts, additive rescue, provenance | Keeps graph awareness premium and post-retrieval |
| Durability | WAL, memtable, checkpoint, crash recovery | A retrieval engine that behaves like a real database |
| Serving | FastAPI, base64 transport, multi-worker, OpenAPI | One pip install, one server, one API contract |
What Makes It Different
No mandatory graph dependency
voyager-index uses proxy routing plus exact MaxSim reranking without requiring a graph build step in the OSS serving path. When installed, the optional Latence graph sidecar is invoked after first-stage retrieval and merged additively. That keeps the system simpler to operate while preserving a premium graph lane.
Rust + Triton hot paths
The CPU path is a native Rust extension (latence_shard_engine) with
memory-mapped shards, fused MaxSim, SIMD acceleration, and GIL-free execution.
The GPU path uses Triton kernels for exact and quantized scoring with
variable-length document scheduling.
Research-backed features in the serving path
LEMUR routing, ColBANDIT query-time pruning, ROQ rotational quantization, and budget-aware context optimization are integrated into the shipped system rather than isolated in research notebooks.
Operational features, not just benchmarking
CRUD, WAL, checkpoint, crash recovery, payload metadata, scroll, and retrieve are included because retrieval systems in production need operational discipline, not just benchmark wins.
Multimodal native
The same serving stack supports text token embeddings (ColBERT) and image patch embeddings (ColPali/ColQwen), with preprocessing for PDF, DOCX, XLSX, and image inputs.
Quickstart
Install
pip install "voyager-index[full]" # full public CPU surface
pip install "voyager-index[full,gpu]" # + Triton GPU kernels on CUDA hosts
pip install "voyager-index[server,shard]" # + FastAPI server only
pip install "voyager-index[server,shard,shard-native]" # + Rust shard CPU fast-path
pip install "voyager-index[server,shard,solver]" # + Tabu Search solver only
Python SDK
import numpy as np
from voyager_index import Index
rng = np.random.default_rng(7)
docs = [rng.normal(size=(16, 128)).astype("float32") for _ in range(32)]
query = rng.normal(size=(16, 128)).astype("float32")
idx = Index(
"demo-index",
dim=128,
engine="shard",
n_shards=32,
k_candidates=256,
compression="fp16",
)
idx.add(docs, ids=list(range(len(docs))))
results = idx.search(query, k=5)
print(results[0])
idx.close()
HTTP API
HOST=0.0.0.0 WORKERS=4 voyager-index-server
# OpenAPI docs at http://127.0.0.1:8080/docs
import numpy as np
import requests
from voyager_index import encode_vector_payload
query = np.random.default_rng(7).normal(size=(16, 128)).astype("float32")
response = requests.post(
"http://127.0.0.1:8080/collections/my-shard/search",
json={
"vectors": encode_vector_payload(query, dtype="float16"),
"top_k": 5,
"quantization_mode": "fp8",
"use_colbandit": True,
},
timeout=30,
)
print(response.json()["results"][0])
Docker
docker build -f deploy/reference-api/Dockerfile -t voyager-index .
docker run -p 8080:8080 -v "$(pwd)/data:/data" voyager-index
Execution Modes
| Mode | Corpus placement | Best for |
|---|---|---|
| CPU exact | Disk/mmap → Rust fused MaxSim | Simplest deployment, no GPU required |
| GPU streamed | Disk/CPU → GPU transfer → Triton MaxSim | Large corpora that don't fit in VRAM |
| GPU corpus | Fully resident in VRAM | Lowest latency when corpus fits |
All three modes share the same collection format, API contract, and retrieval semantics. Start with CPU, add GPU when latency matters.
Engineering Knobs
Routing and candidate budgets
k_candidates— LEMUR candidate budget before exact scoringmax_docs_exact— hard ceiling for the exact-stage document setn_full_scores— proxy shortlist size before full MaxSimuse_colbandit— enable query-time pruning
Storage and transfer
n_shards— number of storage shardscompression—fp16,int8, orroq4transfer_mode—pageable,pinned, ordouble_buffered
Scoring and hardware
quantization_mode— exact,int8,fp8, orroq4router_device— where LEMUR executes (cpuorcuda)gpu_corpus_rerank_topn— GPU rerank frontier for corpus-resident mode
Hybrid and Multimodal
- BM25 + dense fusion via
rrfor Tabu Search refinement - Optional Latence graph augmentation via
graph_mode, independent graph budgets, andgraph_explain - Multimodal collections share the same base64 vector contract as text
- Document preprocessing handles PDF, DOCX, XLSX, and images via
render_documents()and the/reference/preprocess/documentsAPI endpoint
Production note:
denseHTTP collections are where BM25 +query_texthybrid search lives todayshardHTTP collections are the high-performance vector route and can still use the optional graph lane additively throughgraph_modeandquery_payload
API Surface
| Endpoint | Purpose |
|---|---|
POST /collections/{name} |
Create collection |
GET /collections/{name}/info |
Inspect collection tuning, health, and graph sync state |
POST /collections/{name}/points |
Add / upsert documents |
POST /collections/{name}/search |
Search |
POST /collections/{name}/search/batch |
Batch search |
GET /collections/{name}/scroll |
Scroll through results |
POST /collections/{name}/retrieve |
Retrieve by ID |
DELETE /collections/{name}/points |
Delete documents |
POST /collections/{name}/checkpoint |
Checkpoint WAL |
GET /collections/{name}/wal/status |
WAL status |
POST /encode |
Encode text/images to vectors |
POST /rerank |
Rerank results |
POST /reference/optimize |
Context packing (Tabu Search) |
Graph-aware search uses the same search endpoint and adds:
graph_modegraph_local_budgetgraph_community_budgetgraph_evidence_budgetgraph_explainquery_payloadfor ontology hints, workflow hints, and vector-only graph policy steering
Public Python Surface
IndexandIndexBuilder— local shard collectionsSearchPipeline— dense + sparse fusion in-processColbertIndex— late-interaction text workflowsColPaliEngineandMultiModalEngine— multimodal retrievalencode_vector_payload(),decode_payload()— base64 transport helpersvoyager-index-server— reference HTTP server
Documentation
- Quickstart
- Installation
- Python API Reference
- Reference API Tutorial
- Shard Engine Guide
- Latence Graph Sidecar Guide
- Enterprise Control Plane Boundary
- Max-Performance Guide
- Scaling Guide
- Benchmarks And Methodology
- Production Notes
- Contributing
- Releasing
- Security
Install From Source
git clone https://github.com/ddickmann/voyager-index.git
cd voyager-index
bash scripts/install_from_source.sh --cpu
Project Health
- PRs: pull request template
- Issues: bug report and feature request
- Community: Code of Conduct
- Security: see SECURITY.md
- Release process: see RELEASING.md
License
Apache-2.0. See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file voyager_index-0.1.3.tar.gz.
File metadata
- Download URL: voyager_index-0.1.3.tar.gz
- Upload date:
- Size: 402.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1a26125cd26b76794c396f9a6df5635edcce6ed475b6e70e4a14c4067ce725df
|
|
| MD5 |
a2f3719ff91606b6e9a205b80ebfaaca
|
|
| BLAKE2b-256 |
06e16da7f7fb39f0e1ac16559722cb6ce22eacde40db58aad6f52389a11fa06d
|
Provenance
The following attestation bundles were made for voyager_index-0.1.3.tar.gz:
Publisher:
release.yml on ddickmann/voyager-index
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voyager_index-0.1.3.tar.gz -
Subject digest:
1a26125cd26b76794c396f9a6df5635edcce6ed475b6e70e4a14c4067ce725df - Sigstore transparency entry: 1309228593
- Sigstore integration time:
-
Permalink:
ddickmann/voyager-index@b6de36a4745034edd179a293807e1d74fe6d7c46 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/ddickmann
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b6de36a4745034edd179a293807e1d74fe6d7c46 -
Trigger Event:
release
-
Statement type:
File details
Details for the file voyager_index-0.1.3-py3-none-any.whl.
File metadata
- Download URL: voyager_index-0.1.3-py3-none-any.whl
- Upload date:
- Size: 381.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed4d7790908383ea533cf9aad460fbd37c81985c3ddb4c3369c83664ed5af253
|
|
| MD5 |
8f3d486fd179cec4d8bbe6eae71f7505
|
|
| BLAKE2b-256 |
1d86b696765b32d1242ae5a52137f31478d1f0f6f13c69a0ab329effb477697c
|
Provenance
The following attestation bundles were made for voyager_index-0.1.3-py3-none-any.whl:
Publisher:
release.yml on ddickmann/voyager-index
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
voyager_index-0.1.3-py3-none-any.whl -
Subject digest:
ed4d7790908383ea533cf9aad460fbd37c81985c3ddb4c3369c83664ed5af253 - Sigstore transparency entry: 1309228651
- Sigstore integration time:
-
Permalink:
ddickmann/voyager-index@b6de36a4745034edd179a293807e1d74fe6d7c46 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/ddickmann
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b6de36a4745034edd179a293807e1d74fe6d7c46 -
Trigger Event:
release
-
Statement type: