Fixed Dimensional Encodings for multi-vector retrieval (MUVERA) — Python port of Google's graph-mining implementation

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

smarthi

These details have not been verified by PyPI

Project links

Documentation

Project description

Implementation of MUVERA: Retrieval via Fixed Dimension Encodings

Fixed Dimensional Encodings for multi-vector retrieval.

A Python port of Google's graph-mining MUVERA implementation.
Paper: MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings (Rajput et al., 2024).

What is MUVERA?

Late-interaction retrieval models like ColBERT, ColPali, and ColQwen2 represent each query and document as a variable-length set of token embeddings rather than a single vector. Scoring two sets requires the computationally expensive MaxSim (Chamfer Similarity) operation:

Chamfer(Q, D) = Σ_{q ∈ Q} max_{d ∈ D} cos(q, d)

This makes large-scale ANN retrieval impractical with standard indexes.

MUVERA solves this by converting each multi-vector set into a single fixed-dimensional vector (FDE) such that:

fde_query(Q) · fde_doc(D)  ≈  Chamfer(Q, D)

Standard ANN libraries (FAISS, ScaNN, OpenSearch k-NN) can then index FDE vectors directly, restoring sub-linear retrieval for late-interaction models.

Installation

pip install pymuvera

Requires Python ≥ 3.12, NumPy ≥ 1.24, Pydantic ≥ 2.0.

Quick start

import numpy as np
from muvera_fde import MUVERAEncoder

# One encoder instance for both queries and documents — seed must match
enc = MUVERAEncoder(
    dimension=128,              # ColBERT / ColQwen2 token embedding dimension
    num_simhash_projections=4,  # 2^4 = 16 partitions per repetition
    num_repetitions=2,          # 2 independent repetitions
    seed=42,
)

print(enc)
# MUVERAEncoder(dimension=128, num_simhash_projections=4, num_repetitions=2,
#               projection_type=DEFAULT_IDENTITY, fde_dimension=4096)

query_tokens = np.random.randn(32,  128).astype(np.float32)   # 32 query tokens
doc_tokens   = np.random.randn(512, 128).astype(np.float32)   # 512 document tokens

q_fde = enc.encode_query(query_tokens)    # shape: (4096,)
d_fde = enc.encode_document(doc_tokens)   # shape: (4096,)

# Approximate Chamfer Similarity — drop into any ANN index as a float32 vector
score = float(q_fde @ d_fde)

API reference

`MUVERAEncoder`

The primary entry point. Initialise once and reuse for all queries and documents — the random partition structure (SimHash matrices, Count Sketch parameters) must be identical on both sides.

MUVERAEncoder(
    dimension: int = 128,
    num_simhash_projections: int = 4,
    num_repetitions: int = 1,
    seed: int = 1,
    projection_type: ProjectionType = ProjectionType.DEFAULT_IDENTITY,
    projection_dimension: int | None = None,
    simhash_rank: int = 1,
    fill_empty_partitions: bool = False,
    final_projection_dimension: int | None = None,
)

Parameter	Default	Description
`dimension`	128	Token embedding dimension
`num_simhash_projections`	4	SimHash bits k; partitions = 2^k
`num_repetitions`	1	Independent repetitions (more → better approximation)
`seed`	1	Shared RNG seed — must match query and document sides
`projection_type`	`DEFAULT_IDENTITY`	`DEFAULT_IDENTITY`, `AMS_SKETCH` (Count Sketch on token embeddings), or `LOW_RANK_GAUSSIAN` (low-rank factored SimHash)
`projection_dimension`	`None`	Target dim after Count Sketch; required for `AMS_SKETCH`
`simhash_rank`	1	Rank r for `LOW_RANK_GAUSSIAN`; must satisfy `1 ≤ r < num_simhash_projections`. r=4 is a practical sweet spot for ColQwen2 (d=128, k≥8)
`fill_empty_partitions`	`False`	Document side: fill empty slots via Hamming-nearest-neighbour
`final_projection_dimension`	`None`	Post-accumulation Count Sketch compression

Property: fde_dimension — output vector length.

Encoding single inputs

enc = MUVERAEncoder(dimension=128, num_simhash_projections=4, num_repetitions=2)

# Query: SUM aggregation — token embeddings summed into their SimHash partition
q_fde = enc.encode_query(query_tokens)    # (num_tokens, 128) → (fde_dim,)

# Document: AVERAGE aggregation — centroid of tokens per partition
d_fde = enc.encode_document(doc_tokens)   # (num_tokens, 128) → (fde_dim,)

# Both also accept flat 1-D input (num_tokens * dimension,)
q_fde = enc.encode_query(query_tokens.flatten())

Batch encoding

queries   = [np.random.randn(32,  128).astype(np.float32) for _ in range(100)]
documents = [np.random.randn(512, 128).astype(np.float32) for _ in range(1000)]

Q = enc.encode_queries_batch(queries)     # shape: (100,  fde_dimension)
D = enc.encode_documents_batch(documents) # shape: (1000, fde_dimension)

# All-pairs approximate Chamfer Similarities in one matmul
scores = Q @ D.T   # shape: (100, 1000)
top_k  = np.argsort(scores, axis=1)[:, ::-1][:, :10]  # top-10 per query

Reducing FDE size

Two orthogonal compression knobs:

Option A — per-partition Count Sketch (reduces width before accumulation):

from muvera_fde import ProjectionType

enc = MUVERAEncoder(
    dimension=128,
    num_simhash_projections=4,
    num_repetitions=4,
    projection_type=ProjectionType.AMS_SKETCH,
    projection_dimension=32,   # 128 → 32 per partition slot
)
# fde_dimension = 4 reps × 16 partitions × 32 = 2048  (vs 8192 without)

Option B — post-accumulation Count Sketch (compresses the final vector):

enc = MUVERAEncoder(
    dimension=128,
    num_simhash_projections=4,
    num_repetitions=4,
    final_projection_dimension=512,   # 8192 → 512
)
# fde_dimension = 512

Both preserve dot products in expectation: E[⟨sketch(x), sketch(y)⟩] = ⟨x, y⟩.

Low-rank SimHash — faster partition assignment

Replaces the full (d × k) SimHash matrix with two smaller factors A ∈ ℝ^{d×r} and B ∈ ℝ^{k×r}, so the partition cost drops from O(N × d × k) to O(N × d × r + N × r × k).

from muvera_fde import ProjectionType

enc = MUVERAEncoder(
    dimension=128,
    num_simhash_projections=8,   # 2^8 = 256 partitions
    num_repetitions=4,
    projection_type=ProjectionType.LOW_RANK_GAUSSIAN,
    simhash_rank=4,              # r=4; cost: O(N×128×4 + N×4×8) = O(544N) vs O(1024N)
    seed=42,
)
# fde_dimension = 4 × 256 × 128 = 131072 (same formula as DEFAULT_IDENTITY)

q_fde = enc.encode_query(query_tokens)
d_fde = enc.encode_document(doc_tokens)
score = float(q_fde @ d_fde)

Convergence guarantee (EGGROLL, Sarkar et al. 2025, Theorem 4): the low-rank sign pattern converges to the full-rank Gaussian sign pattern at rate O(r⁻¹) — faster than the standard CLT rate O(r⁻¹/²) because symmetry cancels all odd cumulants in the Edgeworth expansion.

Practical targets for ColQwen2 (d=128):

`simhash_rank`	Variance vs full-rank	SimHash cost vs full-rank (k=8)
1	~100% baseline	136N vs 1024N — 7.5× faster
4	~25% increase	544N vs 1024N — 1.9× faster
8	~12% increase	1088N vs 1024N — breakeven

Note: Sign assignments are scale-invariant (sign(αx) = sign(x)), so the 1/√r normalisation common in low-rank approximations is omitted — it has no effect on partition assignments.

Filling empty partition slots

With few document tokens and many partitions (large k), many slots will be empty (all-zero). Enabling fill_empty_partitions copies the projection of the nearest token by SimHash Hamming distance into each empty slot, improving recall for short documents:

enc = MUVERAEncoder(
    dimension=128,
    num_simhash_projections=4,
    num_repetitions=2,
    fill_empty_partitions=True,   # document side only; queries ignore this flag
)

short_doc_tokens = np.random.randn(8, 128).astype(np.float32)
d_fde = enc.encode_document(short_doc_tokens)   # no all-zero partition blocks

Low-level functional API

Bypass the encoder class entirely when you need to manage parameters manually (e.g. distributed indexing where workers share pre-built parameters):

from muvera_fde import FDEConfig, generate_query_fde, generate_document_fde

config = FDEConfig(
    dimension=128,
    num_repetitions=2,
    num_simhash_projections=4,
    seed=42,
)

q_fde = generate_query_fde(query_tokens, config)
d_fde = generate_document_fde(doc_tokens, config)

# Pass pre-built RepParams to skip RNG sampling on every call
enc = MUVERAEncoder(dimension=128, num_repetitions=2, num_simhash_projections=4, seed=42)
q_fde = generate_query_fde(query_tokens, config, enc._rep_params)

`FDEConfig` serialization

FDEConfig is a frozen Pydantic model — save it alongside your ANN index so the encoder configuration is always recoverable:

import json
from muvera_fde import FDEConfig

config = FDEConfig(dimension=128, num_repetitions=4, num_simhash_projections=4, seed=42)

# Save
with open("fde_config.json", "w") as f:
    json.dump(config.model_dump(), f)

# Load
with open("fde_config.json") as f:
    config2 = FDEConfig(**json.load(f))

assert config == config2

Two-stage retrieval pipeline

The intended production pattern for ColQwen2 / ColBERT:

Offline:
  doc token embeddings  →  encode_document()  →  FDE vector  →  ANN index

Online:
  query token embeddings  →  encode_query()  →  FDE vector
                                                     │
                                              ANN search (fast, sub-linear)
                                                     │
                                            top-K candidate docs
                                                     │
                                       MaxSim re-rank on raw token embeddings
                                                     │
                                               final top-K results

Stage 1 (ANN on FDE vectors) eliminates 99%+ of the corpus cheaply. Stage 2 (exact MaxSim on raw token embeddings) reranks the small candidate set for full accuracy.

Minimal FAISS integration

import faiss
import numpy as np
from muvera_fde import MUVERAEncoder

enc = MUVERAEncoder(dimension=128, num_simhash_projections=4, num_repetitions=2, seed=42)
dim = enc.fde_dimension  # 4096

# Build index
index = faiss.IndexFlatIP(dim)   # inner product ≈ Chamfer Similarity

# Index documents (offline)
doc_embeddings = [...]   # list of (num_tokens, 128) float32 arrays
D = enc.encode_documents_batch(doc_embeddings)   # (N, 4096)
faiss.normalize_L2(D)
index.add(D)

# Query (online)
query_tokens = np.random.randn(32, 128).astype(np.float32)
q_fde = enc.encode_query(query_tokens).reshape(1, -1)
faiss.normalize_L2(q_fde)

_, candidate_ids = index.search(q_fde, k=100)   # stage 1: fast ANN
# stage 2: MaxSim re-rank candidate_ids with raw token embeddings ...

Attribution

Python port of the C++ implementation in Google's graph-mining project, licensed under Apache 2.0.

See NOTICE for the full upstream attribution.

License

Apache 2.0 — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

smarthi

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

0.4.1

May 6, 2026

0.4.0

Apr 30, 2026

0.3.2

Apr 28, 2026

0.3.1

Apr 26, 2026

0.3.0

Apr 26, 2026

0.2.2

Apr 26, 2026

0.2.1

Apr 26, 2026

This version

0.2.0

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymuvera-0.2.0.tar.gz (24.2 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pymuvera-0.2.0-py3-none-any.whl (28.3 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file pymuvera-0.2.0.tar.gz.

File metadata

Download URL: pymuvera-0.2.0.tar.gz
Upload date: Apr 26, 2026
Size: 24.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pymuvera-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`92b05e18397a0cc5e5a9140c36a3f22e64636be892921a372dd30eea2aa168fe`
MD5	`60ced25bf45a06e25ab9e03128e42a7f`
BLAKE2b-256	`53e8eb60b96d647ce45bc84b3ce1543541451cde529c15d225c6c892ef36e05e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymuvera-0.2.0.tar.gz:

Publisher: ci.yml on smarthi/muvera-fde

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pymuvera-0.2.0.tar.gz
- Subject digest: 92b05e18397a0cc5e5a9140c36a3f22e64636be892921a372dd30eea2aa168fe
- Sigstore transparency entry: 1387577235
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: smarthi/muvera-fde@9a288ccf7450fbb52866466a49f73db0d6c4d778
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/smarthi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@9a288ccf7450fbb52866466a49f73db0d6c4d778
- Trigger Event: push

File details

Details for the file pymuvera-0.2.0-py3-none-any.whl.

File metadata

Download URL: pymuvera-0.2.0-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 28.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pymuvera-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`606bd183a92650fcf2d97a10c6bcca26f7db854d23f0a42e2223084ebdc70ac2`
MD5	`cc4951774d5a171949a1526d3ee180b6`
BLAKE2b-256	`fd9a2684c4f4eb64b8930550ea51535873576b6bc90f06c705385e06a15d274c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pymuvera-0.2.0-py3-none-any.whl:

Publisher: ci.yml on smarthi/muvera-fde

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pymuvera-0.2.0-py3-none-any.whl
- Subject digest: 606bd183a92650fcf2d97a10c6bcca26f7db854d23f0a42e2223084ebdc70ac2
- Sigstore transparency entry: 1387577515
- Sigstore integration time: Apr 26, 2026
Source repository:
- Permalink: smarthi/muvera-fde@9a288ccf7450fbb52866466a49f73db0d6c4d778
- Branch / Tag: refs/tags/v0.0.1
- Owner: https://github.com/smarthi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@9a288ccf7450fbb52866466a49f73db0d6c4d778
- Trigger Event: push

pymuvera 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Implementation of MUVERA: Retrieval via Fixed Dimension Encodings

What is MUVERA?

Installation

Quick start

API reference

MUVERAEncoder

Encoding single inputs

Batch encoding

Reducing FDE size

Low-rank SimHash — faster partition assignment

Filling empty partition slots

Low-level functional API

FDEConfig serialization

Two-stage retrieval pipeline

Minimal FAISS integration

Attribution

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`MUVERAEncoder`

`FDEConfig` serialization