A package to aggregate embeddings in a Faiss vector store based on metadata columns.
Project description
Faiss Embeddings Aggregation Library
This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It supports a wide range of aggregation techniques, from simple averaging to sophisticated methods like PCA and Attentive Pooling.
Table of Contents
Features
- Simple Average: Compute the arithmetic mean of embeddings.
- Weighted Average: Compute a weighted average of embeddings.
- Geometric Mean: Compute the geometric mean across embeddings (for positive values).
- Harmonic Mean: Compute the harmonic mean across embeddings (for positive values).
- Centroid (K-Means): Use K-Means clustering to find the centroid of the embeddings.
- Principal Component Analysis (PCA): Use PCA to reduce embeddings to a single representative vector.
- Median: Compute the element-wise median of embeddings.
- Trimmed Mean: Compute the mean after trimming outliers.
- Max-Pooling: Take the maximum value for each dimension across embeddings.
- Min-Pooling: Take the minimum value for each dimension across embeddings.
- Entropy-Weighted Average: Weight embeddings by their entropy (information content).
- Attentive Pooling: Use an attention mechanism to learn the weights for combining embeddings.
- Tukey's Biweight: A robust method to down-weight outliers.
- Exemplar: Select the embedding that best represents the group by minimizing average distance.
Installation
To install the package, you can use pip:
pip install faiss_vector_aggregator
Usage
Below are examples demonstrating how to use the library to aggregate embeddings using different methods.
Example 1: Simple Average Aggregation
Suppose you have a collection of embeddings stored in a FAISS index, and you want to aggregate them by their associated document IDs using simple averaging.
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using simple averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="average"
)
- Parameters:
input_folder
: Path to the folder containing the input FAISS index and metadata.column_name
: The metadata field by which to aggregate embeddings (e.g.,'id'
).output_folder
: Path where the output FAISS index and metadata will be saved.method="average"
: Specifies the aggregation method.
Example 2: Weighted Average Aggregation
If you have different weights for the embeddings, you can apply a weighted average to give more importance to certain embeddings.
from faiss_vector_aggregator import aggregate_embeddings
# Example weights for the embeddings
weights = [0.1, 0.3, 0.6]
# Aggregate embeddings using weighted averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="weighted_average",
weights=weights
)
- Parameters:
weights
: A list or array of weights corresponding to each embedding.method="weighted_average"
: Specifies the weighted average method.
Example 3: Principal Component Analysis (PCA) Aggregation
To reduce high-dimensional embeddings to a single representative vector using PCA:
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using PCA
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="pca"
)
- Parameters:
method="pca"
: Specifies that PCA should be used for aggregation.
Example 4: Centroid Aggregation (K-Means)
Use K-Means clustering to find the centroid of embeddings for each document ID.
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using K-Means clustering to find the centroid
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="centroid"
)
- Parameters:
method="centroid"
: Specifies that K-Means clustering should be used.
Example 5: Attentive Pooling Aggregation
To use an attention mechanism for aggregating embeddings:
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using Attentive Pooling
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="attentive_pooling"
)
- Parameters:
method="attentive_pooling"
: Specifies the attentive pooling method.
Aggregation Methods
Below is a detailed description of each aggregation method supported by the library:
- average: Compute the arithmetic mean of embeddings.
- weighted_average: Compute a weighted average of embeddings. Requires
weights
. - geometric_mean: Compute the geometric mean across embeddings. Only for positive values.
- harmonic_mean: Compute the harmonic mean across embeddings. Only for positive values.
- median: Compute the element-wise median of embeddings.
- trimmed_mean: Compute the mean after trimming a percentage of outliers. Use
trim_percentage
parameter. - centroid: Use K-Means clustering to find the centroid of the embeddings.
- pca: Use Principal Component Analysis to project embeddings onto the first principal component.
- exemplar: Select the embedding that minimizes the average cosine distance to others.
- max_pooling: Take the maximum value for each dimension across embeddings.
- min_pooling: Take the minimum value for each dimension across embeddings.
- entropy_weighted_average: Weight embeddings by their entropy (information content).
- attentive_pooling: Use an attention mechanism based on similarity to aggregate embeddings.
- tukeys_biweight: A robust method to down-weight outliers in the embeddings.
Parameters
input_folder
(str): Path to the folder containing the input FAISS index (index.faiss
) and metadata (index.pkl
).column_name
(str): The metadata field by which to aggregate embeddings (e.g.,'id'
).output_folder
(str): Path where the output FAISS index and metadata will be saved.method
(str): The aggregation method to use. Options include:'average'
,'weighted_average'
,'geometric_mean'
,'harmonic_mean'
,'centroid'
,'pca'
,'median'
,'trimmed_mean'
,'max_pooling'
,'min_pooling'
,'entropy_weighted_average'
,'attentive_pooling'
,'tukeys_biweight'
,'exemplar'
.
weights
(list or np.ndarray, optional): Weights for theweighted_average
method.trim_percentage
(float, optional): Fraction to trim from each end fortrimmed_mean
. Should be between 0 and less than 0.5.weights
(list or np.ndarray, optional): Weights for theweighted_average
method.
Dependencies
Ensure you have the following packages installed:
- faiss: For handling FAISS indexes.
- numpy: For numerical computations.
- scipy: For statistical functions.
- scikit-learn: For PCA and K-Means clustering.
- langchain: For handling document stores and vector stores.
You can install the dependencies using:
pip install faiss-cpu numpy scipy scikit-learn langchain
Note: Replace faiss-cpu
with faiss-gpu
if you prefer to use the GPU version of FAISS.
Contributing
Contributions are welcome! Please feel free to submit a pull request or open an issue on the GitHub repository.
When contributing, please ensure that your code adheres to the following guidelines:
- Follow PEP 8 coding standards.
- Include docstrings and comments where necessary.
- Write unit tests for new features or bug fixes.
- Update the documentation to reflect changes.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Additional Notes
- Usage with LangChain:
- This library is compatible with LangChain's
FAISS
vector store. Ensure that your embeddings and indexes are handled consistently when integrating with LangChain.
- This library is compatible with LangChain's
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for faiss_vector_aggregator-0.3.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | d673b7584d02a056ee14d565013d7c6f643391e123cd7812663cbfac0abdbeb2 |
|
MD5 | fbdd17183ace84bcd1735a3bd97fc03d |
|
BLAKE2b-256 | 355ebadaf299c8fbb3ebfc3aeb7f68ab8630fa5e2af90ea44f752772c3aeeb19 |
Hashes for faiss_vector_aggregator-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50b6a96462ad640be9a69569eaca1f41e603a202cdfd1d400d4c3617d5e11417 |
|
MD5 | 25d9094de50cac39ed019f007549a979 |
|
BLAKE2b-256 | 1a0c051c3d406f9a0905cf36f21c07a3ef90b5356475709e54dd9488e90f8a11 |