Shard-first late-interaction retrieval for ColBERT and ColPali style workloads with CPU/GPU modes, Triton MaxSim, BM25 hybrid search, durable CRUD/WAL, multimodal preprocessing, and base64-ready reference APIs.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ddickmann

These details have not been verified by PyPI

Project description

voyager-index

Late-interaction retrieval for on-prem AI systems. Runs on a single machine, supports CPU or GPU execution, and keeps MaxSim as the truth scorer.

voyager-index is built for teams that want a multi-vector native ColBERT / ColPali-style retrieval quality without adopting a distributed search stack. It combines proxy routing, exact or quantized MaxSim, multimodal preprocessing, and database-grade operations behind one API.

The OSS engine stays fast on its own. An optional Latence graph sidecar adds graph-aware candidate rescue, provenance, and freshness-aware metadata as a premium plane without becoming a hard dependency of the base retrieval path.

pip install "voyager-index[server,shard,gpu]"
voyager-index-server                          # OpenAPI at :8080/docs

For developers: one retrieval contract across CPU and GPU, with real APIs and real failure recovery. For infrastructure leaders: strong late-interaction search on modest hardware, without taking on distributed search complexity.

Start Here

Use the shard-first production lane first. The canonical CPU-safe production install is one command, then you can layer GPU support on top when needed:

pip install "voyager-index[full]"      # full public CPU surface
pip install "voyager-index[full,gpu]"  # add Triton GPU scoring on CUDA hosts

HOST=0.0.0.0 WORKERS=4 voyager-index-server
# OpenAPI docs at http://127.0.0.1:8080/docs

Fine-grained profiles stay available:

pip install "voyager-index[shard]"                     # minimal shard path
pip install "voyager-index[server,shard]"             # reference API
pip install "voyager-index[server,shard,shard-native]"  # Rust shard CPU fast-path
pip install "voyager-index[server,shard,solver]"      # Tabu Search solver only
pip install "voyager-index[server,shard,native]"      # both public native wheels
pip install "voyager-index[server,shard,latence-graph]"  # graph lane without native extras

If you are evaluating quickly:

run the Quickstart
use the Reference API Tutorial for the HTTP path
use the Shard Engine Guide for the high-performance lane
use the Latence Graph Sidecar Guide for the optional premium lane

Who This Fits

voyager-index is a strong fit when:

late-interaction retrieval quality matters
on-prem deployment matters
single-node operability matters
CPU and GPU flexibility matters
you want an API-facing retrieval service, not just an offline benchmark artifact

It is probably not the first choice if you need:

a large distributed control plane across many nodes
purely dense ANN retrieval at extreme scale
a hosted multi-tenant SaaS search platform

Why

Most retrieval systems optimize the shortlist and treat late interaction as an add-on. voyager-index is built the other way around: MaxSim is the final scorer, and the rest of the system exists to make that practical in production.

Proxy routing instead of mandatory graph dependency. A learned proxy router collapses multi-vector candidate generation to ANN over compact routing representations, then hands off to exact or quantized MaxSim. The optional Latence graph lane augments after first-stage retrieval instead of replacing the router.
Fast CPU and GPU execution. Rust fused scoring for CPU, Triton kernels for GPU, with the same retrieval contract across both modes.
Operational features included. CRUD, WAL, checkpoint, recovery, metadata, and API serving are part of the system, not an afterthought.
Built for single-node deployment. No distributed control plane required for the common on-prem use case.

Current Public Proof

voyager-index has two public proof layers, and they should be read together:

Core production lane: the shard-first route is proven by the BEIR shard benchmark in benchmarks/beir_benchmark.py. That harness measures search-only GPU-corpus Triton MaxSim and CPU multiworker fused Rust scoring on the same production lane the API serves, with current public results showing 2.6-5.0 ms GPU P95, 164.8-346.8 GPU QPS, and 41.6-271.7 CPU QPS across the listed BEIR sets.
Optional Latence graph lane: the graph lane is proven separately by tools/benchmarks/benchmark_latence_graph_quality.py plus the graph tests. In the current representative harness it delivers +0.75 recall, +0.333 NDCG, and +0.75 support coverage on graph-shaped queries, 0.0 ordinary- query deltas, 57% graph activation, 3.5 average added candidates on graph-shaped queries, and passing route-conformance checks.

The graph proof is intentionally scoped: it shows the shipped graph contract, additive rescue semantics, provenance tagging, and retrieval uplift on graph-shaped fixtures. It is not presented as a graph-on BEIR table.

The graph data itself comes from structured Latence graph data derived from the indexed corpus and synchronized into the sidecar as target-linked graph contracts. The public guide explains the architecture and provenance model without exposing proprietary internals.

For full methodology and benchmark caveats, see Benchmarks And Methodology.

BEIR Benchmark

Measured on NVIDIA RTX A5000 (24 GB) using lightonai/GTE-ModernColBERT-v1. Numbers below are search-only and exclude query encoding. CPU results use 8 native Rust workers. These are full-query-set results, not a sampled subset.

These results are meant to show three things:

Retrieval quality on standard BEIR datasets
Search latency and throughput under realistic conditions
What is achievable on modest on-prem hardware

Dataset	Documents	MAP@100	NDCG@10	NDCG@100	Recall@10	Recall@100	GPU QPS	GPU P95 (ms)	CPU QPS	CPU P95 (ms)
arguana	8,674	0.2598	0.3679	0.4171	0.7402	0.9586	270.0	4.1	41.6	202.7
fiqa	57,638	0.3818	0.4436	0.5049	0.5059	0.7297	164.8	5.0	80.2	115.7
nfcorpus	3,633	0.1963	0.3833	0.3485	0.3404	0.3348	282.6	3.8	123.3	84.4
quora	15,675	0.9686	0.9766	0.9790	0.9930	0.9993	346.8	2.6	271.7	46.9
scidocs	25,657	0.1383	0.1977	0.2763	0.2070	0.4369	246.8	4.3	83.9	111.8
scifact	5,183	0.7141	0.7544	0.7730	0.8766	0.9567	263.4	4.0	69.1	138.4

How to read these results

GPU P95 under 6 ms across all listed datasets shows the fast path is practical on A5000-class hardware.
CPU mode remains viable when GPU capacity is limited or reserved for model serving.
Quality metrics are strong while using the same shard/Rust/Triton retrieval stack that powers the production API.

Comparison note: next-plaid

next-plaid is an important open-source reference for ColBERT-style serving. Their published numbers are measured on NVIDIA H100 80 GB with the same embedding model. Our numbers above are measured on an RTX A5000 and are search-only; their reported QPS includes encoding. Quora is omitted below because their README uses a much larger corpus for that dataset.

Dataset	System	NDCG@10	MAP@100	Recall@100	GPU QPS	GPU P95 (ms)	CPU QPS	CPU P95 (ms)
arguana	voyager	0.3679	0.2598	0.9586	270.0	4.1	41.6	202.7
	next-plaid	0.3499	0.2457	0.9337	13.6	170.1	17.4	454.7
fiqa	voyager	0.4436	0.3818	0.7297	164.8	5.0	80.2	115.7
	next-plaid	0.4506	0.3871	0.7459	18.2	170.6	17.6	259.1
nfcorpus	voyager	0.3833	0.1963	0.3348	282.6	3.8	123.3	84.4
	next-plaid	0.3828	0.1870	0.3228	6.6	262.1	16.9	219.4
scidocs	voyager	0.1977	0.1383	0.4369	246.8	4.3	83.9	111.8
	next-plaid	0.1914	0.1352	0.4418	17.5	139.3	16.5	281.7
scifact	voyager	0.7544	0.7141	0.9567	263.4	4.0	69.1	138.4
	next-plaid	0.7593	0.7186	0.9633	7.9	169.5	16.9	305.4

In our current benchmark setup, voyager-index is competitive or better on retrieval quality across most listed datasets and shows materially higher search throughput with much lower P95 latency on an RTX A5000. This is not a fully apples-to-apples comparison: next-plaid reports H100 numbers and includes encoding in QPS, while our numbers are search-only on a smaller GPU. The table above uses full-query evaluation specifically to avoid publishing a flattering slice.

Architecture

voyager-index separates the problem into routing, storage, exact scoring, optimization, durability, and serving. This keeps the retrieval contract stable across CPU, GPU, and mixed deployment modes.

query vectors (token / patch embeddings)
  → LEMUR routing MLP
  → FAISS ANN over routing representations
  → candidate document IDs
  → optional BM25 fusion when query_text is available
  → optional centroid-approx or doc-mean proxy pruning
  → optional ColBANDIT query-time pruning
  → exact or quantized MaxSim
       Rust fused exact (CPU, mmap, SIMD, GIL-free)
       Triton FP16 / INT8 / FP8 / ROQ-4 (GPU)
       GPU-corpus gather + rerank
  → optional Latence graph augmentation
  → optional solver/context packing
  → top-K results or packed context

Layer	What it does	Why it matters
Routing	LEMUR MLP, FAISS MIPS, candidate budgets	Makes late interaction tractable without graph construction
Storage	Safetensors shards, merged mmap, GPU-resident corpus	Honest CPU and GPU layouts for any corpus size
Exact scoring	Triton MaxSim, Rust fused MaxSim, quantized kernels	MaxSim stays the truth scorer across all deployment shapes
Optimization	ColBANDIT pruning, centroid approximation, ROQ-4	Moves the latency/recall frontier without changing the retrieval contract
Optional graph plane	Latence graph sidecar, target-linked graph contracts, additive rescue, provenance	Keeps graph awareness premium and post-retrieval
Durability	WAL, memtable, checkpoint, crash recovery	A retrieval engine that behaves like a real database
Serving	FastAPI, base64 transport, multi-worker, OpenAPI	One `pip install`, one server, one API contract

What Makes It Different

No mandatory graph dependency

voyager-index uses proxy routing plus exact MaxSim reranking without requiring a graph build step in the OSS serving path. When installed, the optional Latence graph sidecar is invoked after first-stage retrieval and merged additively. That keeps the system simpler to operate while preserving a premium graph lane.

Rust + Triton hot paths

The CPU path is a native Rust extension (latence_shard_engine) with memory-mapped shards, fused MaxSim, SIMD acceleration, and GIL-free execution. The GPU path uses Triton kernels for exact and quantized scoring with variable-length document scheduling.

Research-backed features in the serving path

LEMUR routing, ColBANDIT query-time pruning, ROQ rotational quantization, and budget-aware context optimization are integrated into the shipped system rather than isolated in research notebooks.

Operational features, not just benchmarking

CRUD, WAL, checkpoint, crash recovery, payload metadata, scroll, and retrieve are included because retrieval systems in production need operational discipline, not just benchmark wins.

Multimodal native

The same serving stack supports text token embeddings (ColBERT) and image patch embeddings (ColPali/ColQwen), with preprocessing for PDF, DOCX, XLSX, and image inputs.

Quickstart

Install

pip install "voyager-index[full]"                    # full public CPU surface
pip install "voyager-index[full,gpu]"                # + Triton GPU kernels on CUDA hosts
pip install "voyager-index[server,shard]"            # + FastAPI server only
pip install "voyager-index[server,shard,shard-native]"  # + Rust shard CPU fast-path
pip install "voyager-index[server,shard,solver]"     # + Tabu Search solver only

Python SDK

import numpy as np
from voyager_index import Index

rng = np.random.default_rng(7)
docs = [rng.normal(size=(16, 128)).astype("float32") for _ in range(32)]
query = rng.normal(size=(16, 128)).astype("float32")

idx = Index(
    "demo-index",
    dim=128,
    engine="shard",
    n_shards=32,
    k_candidates=256,
    compression="fp16",
)
idx.add(docs, ids=list(range(len(docs))))
results = idx.search(query, k=5)
print(results[0])
idx.close()

HTTP API

HOST=0.0.0.0 WORKERS=4 voyager-index-server
# OpenAPI docs at http://127.0.0.1:8080/docs

import numpy as np
import requests
from voyager_index import encode_vector_payload

query = np.random.default_rng(7).normal(size=(16, 128)).astype("float32")

response = requests.post(
    "http://127.0.0.1:8080/collections/my-shard/search",
    json={
        "vectors": encode_vector_payload(query, dtype="float16"),
        "top_k": 5,
        "quantization_mode": "fp8",
        "use_colbandit": True,
    },
    timeout=30,
)
print(response.json()["results"][0])

Docker

docker build -f deploy/reference-api/Dockerfile -t voyager-index .
docker run -p 8080:8080 -v "$(pwd)/data:/data" voyager-index

Execution Modes

Mode	Corpus placement	Best for
CPU exact	Disk/mmap → Rust fused MaxSim	Simplest deployment, no GPU required
GPU streamed	Disk/CPU → GPU transfer → Triton MaxSim	Large corpora that don't fit in VRAM
GPU corpus	Fully resident in VRAM	Lowest latency when corpus fits

All three modes share the same collection format, API contract, and retrieval semantics. Start with CPU, add GPU when latency matters.

Engineering Knobs

Routing and candidate budgets

k_candidates — LEMUR candidate budget before exact scoring
max_docs_exact — hard ceiling for the exact-stage document set
n_full_scores — proxy shortlist size before full MaxSim
use_colbandit — enable query-time pruning

Storage and transfer

n_shards — number of storage shards
compression — fp16, int8, or roq4
transfer_mode — pageable, pinned, or double_buffered

Scoring and hardware

quantization_mode — exact, int8, fp8, or roq4
router_device — where LEMUR executes (cpu or cuda)
gpu_corpus_rerank_topn — GPU rerank frontier for corpus-resident mode

Hybrid and Multimodal

BM25 + dense fusion via rrf or Tabu Search refinement
Optional Latence graph augmentation via graph_mode, independent graph budgets, and graph_explain
Multimodal collections share the same base64 vector contract as text
Document preprocessing handles PDF, DOCX, XLSX, and images via render_documents() and the /reference/preprocess/documents API endpoint

Production note:

dense HTTP collections are where BM25 + query_text hybrid search lives today
shard HTTP collections are the high-performance vector route and can still use the optional graph lane additively through graph_mode and query_payload

API Surface

Endpoint	Purpose
`POST /collections/{name}`	Create collection
`GET /collections/{name}/info`	Inspect collection tuning, health, and graph sync state
`POST /collections/{name}/points`	Add / upsert documents
`POST /collections/{name}/search`	Search
`POST /collections/{name}/search/batch`	Batch search
`GET /collections/{name}/scroll`	Scroll through results
`POST /collections/{name}/retrieve`	Retrieve by ID
`DELETE /collections/{name}/points`	Delete documents
`POST /collections/{name}/checkpoint`	Checkpoint WAL
`GET /collections/{name}/wal/status`	WAL status
`POST /encode`	Encode text/images to vectors
`POST /rerank`	Rerank results
`POST /reference/optimize`	Context packing (Tabu Search)

Graph-aware search uses the same search endpoint and adds:

graph_mode
graph_local_budget
graph_community_budget
graph_evidence_budget
graph_explain
query_payload for ontology hints, workflow hints, and vector-only graph policy steering

Public Python Surface

Index and IndexBuilder — local shard collections
SearchPipeline — dense + sparse fusion in-process
ColbertIndex — late-interaction text workflows
ColPaliEngine and MultiModalEngine — multimodal retrieval
encode_vector_payload(), decode_payload() — base64 transport helpers
voyager-index-server — reference HTTP server

Documentation

Install From Source

git clone https://github.com/ddickmann/voyager-index.git
cd voyager-index
bash scripts/install_from_source.sh --cpu

Project Health

PRs: pull request template
Issues: bug report and feature request
Community: Code of Conduct
Security: see SECURITY.md
Release process: see RELEASING.md

License

Apache-2.0. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ddickmann

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.6

Apr 20, 2026

0.1.5

Apr 15, 2026

This version

0.1.3

Apr 15, 2026

0.1.2

Mar 31, 2026

0.1.0

Mar 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voyager_index-0.1.3.tar.gz (402.8 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

voyager_index-0.1.3-py3-none-any.whl (381.7 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file voyager_index-0.1.3.tar.gz.

File metadata

Download URL: voyager_index-0.1.3.tar.gz
Upload date: Apr 15, 2026
Size: 402.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voyager_index-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`1a26125cd26b76794c396f9a6df5635edcce6ed475b6e70e4a14c4067ce725df`
MD5	`a2f3719ff91606b6e9a205b80ebfaaca`
BLAKE2b-256	`06e16da7f7fb39f0e1ac16559722cb6ce22eacde40db58aad6f52389a11fa06d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for voyager_index-0.1.3.tar.gz:

Publisher: release.yml on ddickmann/voyager-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: voyager_index-0.1.3.tar.gz
- Subject digest: 1a26125cd26b76794c396f9a6df5635edcce6ed475b6e70e4a14c4067ce725df
- Sigstore transparency entry: 1309228593
- Sigstore integration time: Apr 15, 2026
Source repository:
- Permalink: ddickmann/voyager-index@b6de36a4745034edd179a293807e1d74fe6d7c46
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/ddickmann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b6de36a4745034edd179a293807e1d74fe6d7c46
- Trigger Event: release

File details

Details for the file voyager_index-0.1.3-py3-none-any.whl.

File metadata

Download URL: voyager_index-0.1.3-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 381.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for voyager_index-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ed4d7790908383ea533cf9aad460fbd37c81985c3ddb4c3369c83664ed5af253`
MD5	`8f3d486fd179cec4d8bbe6eae71f7505`
BLAKE2b-256	`1d86b696765b32d1242ae5a52137f31478d1f0f6f13c69a0ab329effb477697c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for voyager_index-0.1.3-py3-none-any.whl:

Publisher: release.yml on ddickmann/voyager-index

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: voyager_index-0.1.3-py3-none-any.whl
- Subject digest: ed4d7790908383ea533cf9aad460fbd37c81985c3ddb4c3369c83664ed5af253
- Sigstore transparency entry: 1309228651
- Sigstore integration time: Apr 15, 2026
Source repository:
- Permalink: ddickmann/voyager-index@b6de36a4745034edd179a293807e1d74fe6d7c46
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/ddickmann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b6de36a4745034edd179a293807e1d74fe6d7c46
- Trigger Event: release

voyager-index 0.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

voyager-index

Start Here

Who This Fits

Why

Current Public Proof

BEIR Benchmark

How to read these results

Comparison note: next-plaid

Architecture

What Makes It Different

No mandatory graph dependency

Rust + Triton hot paths

Research-backed features in the serving path

Operational features, not just benchmarking

Multimodal native

Quickstart

Install

Python SDK

HTTP API

Docker

Execution Modes

Engineering Knobs

Routing and candidate budgets

Storage and transfer

Scoring and hardware

Hybrid and Multimodal

API Surface

Public Python Surface

Documentation

Install From Source

Project Health

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance