A package to aggregate embeddings in a Faiss vector store based on metadata columns.
Project description
Faiss Embeddings Aggregation Library
This Python library provides various methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It supports a wide range of aggregation techniques, from simple averaging to advanced methods like PCA and Attentive Pooling.
Features
- Simple Average: Compute the arithmetic mean of embeddings.
- Weighted Average: Compute a weighted average of embeddings.
- Geometric Mean: Compute the geometric mean across embeddings.
- Harmonic Mean: Compute the harmonic mean across embeddings.
- Centroid (K-Means): Use K-Means clustering to find the centroid of the embeddings.
- Principal Component (PCA): Use PCA to reduce embeddings to a single principal component.
- Median: Compute the element-wise median of embeddings.
- Trimmed Mean: Compute the mean after trimming outliers.
- Max-Pooling: Take the maximum value for each dimension across embeddings.
- Min-Pooling: Take the minimum value for each dimension across embeddings.
- Entropy-Weighted Average: Weight embeddings by their entropy (information content).
- Attentive Pooling: Use an attention mechanism to learn the weights for combining embeddings.
- Tukey's Biweight: A robust method to down-weight outliers.
Installation
To install the package, you can use pip:
pip install faiss_vector_aggregator
Usage
Here are some examples of how to use the library to aggregate embeddings.
Example 1: Simple Average Aggregation
Suppose you have a collection of embeddings stored in a FAISS index, and you want to aggregate them by their associated document IDs using simple averaging. Here's how you can do it:
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using simple averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="average"
)
In this example:
input_folder
: Path to the folder containing the input FAISS index and metadata.column_name
: The column or metadata field by which to aggregate embeddings.output_folder
: Path where the output FAISS index and metadata will be saved.method="average"
: Specifies that the average method should be used for aggregation.
Example 2: Weighted Average Aggregation
If you have different weights for the embeddings, you can apply a weighted average to give more importance to certain embeddings. For instance:
from faiss_vector_aggregator import aggregate_embeddings
# Example weights for the embeddings
weights = [0.1, 0.3, 0.6]
# Aggregate embeddings using weighted averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="weighted_average",
weights=weights
)
In this example:
weights
: A list of weights corresponding to each embedding.method="weighted_average"
: Specifies that the weighted average method should be used for aggregation.
Example 3: Principal Component Analysis (PCA) Aggregation
If you want to reduce high-dimensional embeddings to a single vector using Principal Component Analysis (PCA), you can use the following approach:
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using PCA
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="pca"
)
In this example:
method="pca"
: Specifies that PCA should be used to reduce and aggregate the embeddings into a single vector.
Example 4: Centroid Aggregation (K-Means)
To use K-Means clustering for finding the centroid of embeddings for each document ID:
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using K-Means clustering to find the centroid
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="centroid"
)
In this example:
method="centroid"
: Specifies that K-Means clustering should be used to find the centroid for aggregation.
Parameters
input_folder
: Path to the folder containing the input FAISS index and metadata.column_name
: The column or metadata field by which to aggregate embeddings (e.g., 'id').output_folder
: Path to the folder where the output FAISS index and metadata will be saved.method
: The aggregation method to use (average
,weighted_average
,geometric_mean
,harmonic_mean
,centroid
,pca
,median
,trimmed_mean
,max_pooling
,min_pooling
,entropy_weighted_average
,attentive_pooling
,tukeys_biweight
).weights
: Optional weights for theweighted_average
method.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.
License
This project is licensed under the MIT License. See the LICENSE
file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for faiss_vector_aggregator-0.2.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | a765d09d8049f64fefd4e2c22f3d41ec47917276216d6e76cacf04ac72359f88 |
|
MD5 | c377cf469394d6f68d90c6cee2b97518 |
|
BLAKE2b-256 | 1d15931d68b2cecddc39de23e2f8b8a915cd881291a572c22ee5c19748708880 |
Hashes for faiss_vector_aggregator-0.2.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9f0d9ab523ecc4e423f78a098c4e4086eb4e9e8336598f900daf201a9461cf6 |
|
MD5 | 0749921682c5a5ec23afdc1aa31c3d63 |
|
BLAKE2b-256 | 6dbb33dced895e1dfa2f0bdb49541783cbcb829a486169a66f1492efe41545a9 |