Python bindings for Google's graph-mining Fixed Dimensional Encoding (FDE) from MUVERA
Project description
MUVERA FDE
Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm from the graph-mining project, as described in the MUVERA paper.
Overview
This library provides Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm, a key component of the MUVERA (Multi-Vector Retrieval Aggregation) system. FDE is a technique for encoding variable-sized sets of vectors into fixed-dimensional representations while preserving similarity structure.
Key Applications
- Multi-vector retrieval: Efficiently search collections of multi-vector data (e.g., video segments, document chunks)
- Chamfer similarity approximation: Approximate Chamfer distance between point clouds
- Fixed-size embeddings: Convert variable-length sequences into fixed-size vectors for ML models
- Efficient similarity search: Enable fast nearest-neighbor search on multi-vector data
Algorithm Details
FDE uses SimHash projections to partition the input space and aggregates points within each partition. This approach preserves the similarity structure while providing a compact fixed-dimensional representation. The algorithm supports both sum aggregation (for queries) and average aggregation (for documents), making it suitable for asymmetric similarity search scenarios.
Relationship to MUVERA
MUVERA (Multi-Vector Retrieval Aggregation) is Google's system for efficient multi-vector search that achieves single-vector search speeds. FDE is a core component of MUVERA that enables:
- Compression: Reducing multiple vectors to a single fixed-size representation
- Similarity Preservation: Maintaining approximate Chamfer distances between point sets
- Asymmetric Search: Different encodings for queries (sum) and documents (average)
- Scalability: Enabling billion-scale multi-vector retrieval
This implementation focuses on the FDE component, which can be used standalone or as part of a larger retrieval system.
Features
- Zero-copy numpy integration: Efficient memory usage with numpy arrays
- Flexible encoding types: Support for both query (sum) and document (average) encodings
- Configurable parameters: Control dimensionality, projections, and aggregation methods
- Modern Python packaging: Built with
uvandscikit-build-core
Installation
From source with uv
cd muvera-fde
uv pip install -e .
Development installation
uv pip install -e ".[dev]"
Quick Start
import numpy as np
from muvera_fde import (
FixedDimensionalEncodingConfig,
FixedDimensionalEncoder,
EncodingType
)
# Create configuration
config = FixedDimensionalEncodingConfig(
dimension=3, # Input point dimension
num_simhash_projections=8, # Number of SimHash projections
num_repetitions=2, # Number of repetitions for robustness
encoding_type=EncodingType.DEFAULT_SUM # Sum aggregation for queries
)
# Initialize encoder
encoder = FixedDimensionalEncoder(config)
# Generate random point cloud
points = np.random.randn(100, 3).astype(np.float32)
# Encode the point cloud
encoding = encoder.encode(points)
print(f"Encoding shape: {encoding.shape}")
print(f"Output dimension: {encoder.output_dimension}")
Configuration Options
The FixedDimensionalEncodingConfig dataclass supports the following parameters:
dimension: Dimension of input points (default: 3)num_repetitions: Number of encoding repetitions (default: 1)num_simhash_projections: Number of SimHash projections (default: 8)seed: Random seed for reproducibility (default: 1)encoding_type:EncodingType.DEFAULT_SUMfor queries,EncodingType.AVERAGEfor documentsprojection_dimension: Optional dimension to project points before encodingprojection_type:ProjectionType.DEFAULT_IDENTITYorProjectionType.AMS_SKETCHfill_empty_partitions: Whether to fill empty partitions with zeros (default: False)final_projection_dimension: Optional final output dimension
Advanced Usage
Query vs Document Encoding
# Query encoding (sum aggregation)
query_encoding = encoder.encode_query(query_points)
# Document encoding (average aggregation)
doc_encoding = encoder.encode_document(document_points)
Dimensionality Reduction
# Use AMS sketch projection
config = FixedDimensionalEncodingConfig(
dimension=128,
projection_dimension=32,
projection_type=ProjectionType.AMS_SKETCH,
final_projection_dimension=256
)
Building from Source
The project uses CMake and scikit-build-core for building the C++ extension. Eigen is automatically downloaded during the build process.
# Clean build
uv pip install --force-reinstall -e .
# Run tests
uv run pytest
Citation
If you use this library in your research, please cite the MUVERA paper:
@article{muvera2024,
title={MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings},
author={Google Research Team},
journal={Google Research Blog},
year={2024},
url={https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/}
}
License
This project is licensed under the Apache License 2.0, the same license as the original Google graph-mining project. See the LICENSE file for details.
Acknowledgments
This library provides Python bindings for the Fixed Dimensional Encoding implementation from:
The core C++ implementation is adapted from Google's original code with modifications for Python integration.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file muvera_fde-0.1.0.tar.gz.
File metadata
- Download URL: muvera_fde-0.1.0.tar.gz
- Upload date:
- Size: 120.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8bedbbb70d5332de53cba91d835315d480008355fa189b4b885dc2d625d0b50d
|
|
| MD5 |
6c876814f146fd5b4753f0c05aea53e5
|
|
| BLAKE2b-256 |
46750af76f98378b96697df226d5777d81817339cfd2a637e8cfb90b1715c21b
|