Python bindings for Google's graph-mining Fixed Dimensional Encoding (FDE) from MUVERA

These details have not been verified by PyPI

Project links

Project description

MUVERA FDE

Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm from the graph-mining project, as described in the MUVERA paper.

Overview

This library provides Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm, a key component of the MUVERA (Multi-Vector Retrieval Aggregation) system. FDE is a technique for encoding variable-sized sets of vectors into fixed-dimensional representations while preserving similarity structure.

Key Applications

Multi-vector retrieval: Efficiently search collections of multi-vector data (e.g., video segments, document chunks)
Chamfer similarity approximation: Approximate Chamfer distance between point clouds
Fixed-size embeddings: Convert variable-length sequences into fixed-size vectors for ML models
Efficient similarity search: Enable fast nearest-neighbor search on multi-vector data

Algorithm Details

FDE uses SimHash projections to partition the input space and aggregates points within each partition. This approach preserves the similarity structure while providing a compact fixed-dimensional representation. The algorithm supports both sum aggregation (for queries) and average aggregation (for documents), making it suitable for asymmetric similarity search scenarios.

Relationship to MUVERA

MUVERA (Multi-Vector Retrieval Aggregation) is Google's system for efficient multi-vector search that achieves single-vector search speeds. FDE is a core component of MUVERA that enables:

Compression: Reducing multiple vectors to a single fixed-size representation
Similarity Preservation: Maintaining approximate Chamfer distances between point sets
Asymmetric Search: Different encodings for queries (sum) and documents (average)
Scalability: Enabling billion-scale multi-vector retrieval

This implementation focuses on the FDE component, which can be used standalone or as part of a larger retrieval system.

Features

Zero-copy numpy integration: Efficient memory usage with numpy arrays
Flexible encoding types: Support for both query (sum) and document (average) encodings
Configurable parameters: Control dimensionality, projections, and aggregation methods
Modern Python packaging: Built with uv and scikit-build-core

Installation

From source with uv

cd muvera-fde
uv pip install -e .

Development installation

uv pip install -e ".[dev]"

Quick Start

import numpy as np
from muvera_fde import (
    FixedDimensionalEncodingConfig,
    FixedDimensionalEncoder,
    EncodingType
)

# Create configuration
config = FixedDimensionalEncodingConfig(
    dimension=3,                    # Input point dimension
    num_simhash_projections=8,      # Number of SimHash projections
    num_repetitions=2,              # Number of repetitions for robustness
    encoding_type=EncodingType.DEFAULT_SUM  # Sum aggregation for queries
)

# Initialize encoder
encoder = FixedDimensionalEncoder(config)

# Generate random point cloud
points = np.random.randn(100, 3).astype(np.float32)

# Encode the point cloud
encoding = encoder.encode(points)
print(f"Encoding shape: {encoding.shape}")
print(f"Output dimension: {encoder.output_dimension}")

Configuration Options

The FixedDimensionalEncodingConfig dataclass supports the following parameters:

dimension: Dimension of input points (default: 3)
num_repetitions: Number of encoding repetitions (default: 1)
num_simhash_projections: Number of SimHash projections (default: 8)
seed: Random seed for reproducibility (default: 1)
encoding_type: EncodingType.DEFAULT_SUM for queries, EncodingType.AVERAGE for documents
projection_dimension: Optional dimension to project points before encoding
projection_type: ProjectionType.DEFAULT_IDENTITY or ProjectionType.AMS_SKETCH
fill_empty_partitions: Whether to fill empty partitions with zeros (default: False)
final_projection_dimension: Optional final output dimension

Advanced Usage

Query vs Document Encoding

# Query encoding (sum aggregation)
query_encoding = encoder.encode_query(query_points)

# Document encoding (average aggregation)
doc_encoding = encoder.encode_document(document_points)

Dimensionality Reduction

# Use AMS sketch projection
config = FixedDimensionalEncodingConfig(
    dimension=128,
    projection_dimension=32,
    projection_type=ProjectionType.AMS_SKETCH,
    final_projection_dimension=256
)

Building from Source

The project uses CMake and scikit-build-core for building the C++ extension. Eigen is automatically downloaded during the build process.

# Clean build
uv pip install --force-reinstall -e .

# Run tests
uv run pytest

Citation

If you use this library in your research, please cite the MUVERA paper:

@article{muvera2024,
  title={MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings},
  author={Google Research Team},
  journal={Google Research Blog},
  year={2024},
  url={https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/}
}

License

This project is licensed under the Apache License 2.0, the same license as the original Google graph-mining project. See the LICENSE file for details.

Acknowledgments

This library provides Python bindings for the Fixed Dimensional Encoding implementation from:

The core C++ implementation is adapted from Google's original code with modifications for Python integration.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

muvera_fde-0.1.0.tar.gz (120.5 kB view details)

Uploaded Jul 27, 2025 Source

File details

Details for the file muvera_fde-0.1.0.tar.gz.

File metadata

Download URL: muvera_fde-0.1.0.tar.gz
Upload date: Jul 27, 2025
Size: 120.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.3

File hashes

Hashes for muvera_fde-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8bedbbb70d5332de53cba91d835315d480008355fa189b4b885dc2d625d0b50d`
MD5	`6c876814f146fd5b4753f0c05aea53e5`
BLAKE2b-256	`46750af76f98378b96697df226d5777d81817339cfd2a637e8cfb90b1715c21b`

See more details on using hashes here.

muvera-fde 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MUVERA FDE

Overview

Key Applications

Algorithm Details

Relationship to MUVERA

Features

Installation

From source with uv

Development installation

Quick Start

Configuration Options

Advanced Usage

Query vs Document Encoding

Dimensionality Reduction

Building from Source

Citation

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes