Skip to main content

Python bindings for Google's graph-mining Fixed Dimensional Encoding (FDE) from MUVERA

Project description

MUVERA FDE

Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm from the graph-mining project, as described in the MUVERA paper.

Overview

This library provides Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm, a key component of the MUVERA (Multi-Vector Retrieval Aggregation) system. FDE is a technique for encoding variable-sized sets of vectors into fixed-dimensional representations while preserving similarity structure.

Key Applications

  • Multi-vector retrieval: Efficiently search collections of multi-vector data (e.g., video segments, document chunks)
  • Chamfer similarity approximation: Approximate Chamfer distance between point clouds
  • Fixed-size embeddings: Convert variable-length sequences into fixed-size vectors for ML models
  • Efficient similarity search: Enable fast nearest-neighbor search on multi-vector data

Algorithm Details

FDE uses SimHash projections to partition the input space and aggregates points within each partition. This approach preserves the similarity structure while providing a compact fixed-dimensional representation. The algorithm supports both sum aggregation (for queries) and average aggregation (for documents), making it suitable for asymmetric similarity search scenarios.

Relationship to MUVERA

MUVERA (Multi-Vector Retrieval Aggregation) is Google's system for efficient multi-vector search that achieves single-vector search speeds. FDE is a core component of MUVERA that enables:

  1. Compression: Reducing multiple vectors to a single fixed-size representation
  2. Similarity Preservation: Maintaining approximate Chamfer distances between point sets
  3. Asymmetric Search: Different encodings for queries (sum) and documents (average)
  4. Scalability: Enabling billion-scale multi-vector retrieval

This implementation focuses on the FDE component, which can be used standalone or as part of a larger retrieval system.

Features

  • Zero-copy numpy integration: Efficient memory usage with numpy arrays
  • Flexible encoding types: Support for both query (sum) and document (average) encodings
  • Configurable parameters: Control dimensionality, projections, and aggregation methods
  • Modern Python packaging: Built with uv and scikit-build-core

Installation

From source with uv

cd muvera-fde
uv pip install -e .

Development installation

uv pip install -e ".[dev]"

Quick Start

import numpy as np
from muvera_fde import (
    FixedDimensionalEncodingConfig,
    FixedDimensionalEncoder,
    EncodingType
)

# Create configuration
config = FixedDimensionalEncodingConfig(
    dimension=3,                    # Input point dimension
    num_simhash_projections=8,      # Number of SimHash projections
    num_repetitions=2,              # Number of repetitions for robustness
    encoding_type=EncodingType.DEFAULT_SUM  # Sum aggregation for queries
)

# Initialize encoder
encoder = FixedDimensionalEncoder(config)

# Generate random point cloud
points = np.random.randn(100, 3).astype(np.float32)

# Encode the point cloud
encoding = encoder.encode(points)
print(f"Encoding shape: {encoding.shape}")
print(f"Output dimension: {encoder.output_dimension}")

Configuration Options

The FixedDimensionalEncodingConfig dataclass supports the following parameters:

  • dimension: Dimension of input points (default: 3)
  • num_repetitions: Number of encoding repetitions (default: 1)
  • num_simhash_projections: Number of SimHash projections (default: 8)
  • seed: Random seed for reproducibility (default: 1)
  • encoding_type: EncodingType.DEFAULT_SUM for queries, EncodingType.AVERAGE for documents
  • projection_dimension: Optional dimension to project points before encoding
  • projection_type: ProjectionType.DEFAULT_IDENTITY or ProjectionType.AMS_SKETCH
  • fill_empty_partitions: Whether to fill empty partitions with zeros (default: False)
  • final_projection_dimension: Optional final output dimension

Advanced Usage

Query vs Document Encoding

# Query encoding (sum aggregation)
query_encoding = encoder.encode_query(query_points)

# Document encoding (average aggregation)
doc_encoding = encoder.encode_document(document_points)

Dimensionality Reduction

# Use AMS sketch projection
config = FixedDimensionalEncodingConfig(
    dimension=128,
    projection_dimension=32,
    projection_type=ProjectionType.AMS_SKETCH,
    final_projection_dimension=256
)

Building from Source

The project uses CMake and scikit-build-core for building the C++ extension. Eigen is automatically downloaded during the build process.

# Clean build
uv pip install --force-reinstall -e .

# Run tests
uv run pytest

Citation

If you use this library in your research, please cite the MUVERA paper:

@article{muvera2024,
  title={MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings},
  author={Google Research Team},
  journal={Google Research Blog},
  year={2024},
  url={https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/}
}

License

This project is licensed under the Apache License 2.0, the same license as the original Google graph-mining project. See the LICENSE file for details.

Acknowledgments

This library provides Python bindings for the Fixed Dimensional Encoding implementation from:

The core C++ implementation is adapted from Google's original code with modifications for Python integration.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

muvera_fde-0.1.0.tar.gz (120.5 kB view details)

Uploaded Source

File details

Details for the file muvera_fde-0.1.0.tar.gz.

File metadata

  • Download URL: muvera_fde-0.1.0.tar.gz
  • Upload date:
  • Size: 120.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for muvera_fde-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8bedbbb70d5332de53cba91d835315d480008355fa189b4b885dc2d625d0b50d
MD5 6c876814f146fd5b4753f0c05aea53e5
BLAKE2b-256 46750af76f98378b96697df226d5777d81817339cfd2a637e8cfb90b1715c21b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page