A package to aggregate embeddings in a Faiss vector store based on metadata columns.
Project description
Faiss Embeddings Aggregation Library
This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It supports a wide range of aggregation techniques, from simple averaging to sophisticated methods like PCA and Attentive Pooling.
Table of Contents
Features
- Simple Average: Compute the arithmetic mean of embeddings.
- Weighted Average: Compute a weighted average of embeddings.
- Geometric Mean: Compute the geometric mean across embeddings (for positive values).
- Harmonic Mean: Compute the harmonic mean across embeddings (for positive values).
- Centroid (K-Means): Use K-Means clustering to find the centroid of the embeddings.
- Principal Component Analysis (PCA): Use PCA to reduce embeddings to a single representative vector.
- Median: Compute the element-wise median of embeddings.
- Trimmed Mean: Compute the mean after trimming outliers.
- Max-Pooling: Take the maximum value for each dimension across embeddings.
- Min-Pooling: Take the minimum value for each dimension across embeddings.
- Entropy-Weighted Average: Weight embeddings by their entropy (information content).
- Attentive Pooling: Use an attention mechanism to learn the weights for combining embeddings.
- Tukey's Biweight: A robust method to down-weight outliers.
- Exemplar: Select the embedding that best represents the group by minimizing average distance.
Installation
To install the package, you can use pip:
pip install faiss_vector_aggregator
Usage
Below are examples demonstrating how to use the library to aggregate embeddings using different methods.
Example 1: Simple Average Aggregation
Suppose you have a collection of embeddings stored in a FAISS index, and you want to aggregate them by their associated document IDs using simple averaging.
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using simple averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="average"
)
- Parameters:
input_folder
: Path to the folder containing the input FAISS index and metadata.column_name
: The metadata field by which to aggregate embeddings (e.g.,'id'
).output_folder
: Path where the output FAISS index and metadata will be saved.method="average"
: Specifies the aggregation method.
Example 2: Weighted Average Aggregation
If you have different weights for the embeddings, you can apply a weighted average to give more importance to certain embeddings.
from faiss_vector_aggregator import aggregate_embeddings
# Example weights for the embeddings
weights = [0.1, 0.3, 0.6]
# Aggregate embeddings using weighted averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="weighted_average",
weights=weights
)
- Parameters:
weights
: A list or array of weights corresponding to each embedding.method="weighted_average"
: Specifies the weighted average method.
Example 3: Principal Component Analysis (PCA) Aggregation
To reduce high-dimensional embeddings to a single representative vector using PCA:
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using PCA
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="pca"
)
- Parameters:
method="pca"
: Specifies that PCA should be used for aggregation.
Example 4: Centroid Aggregation (K-Means)
Use K-Means clustering to find the centroid of embeddings for each document ID.
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using K-Means clustering to find the centroid
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="centroid"
)
- Parameters:
method="centroid"
: Specifies that K-Means clustering should be used.
Example 5: Attentive Pooling Aggregation
To use an attention mechanism for aggregating embeddings:
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using Attentive Pooling
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="attentive_pooling"
)
- Parameters:
method="attentive_pooling"
: Specifies the attentive pooling method.
Aggregation Methods
Below is a detailed description of each aggregation method supported by the library:
- average: Compute the arithmetic mean of embeddings.
- weighted_average: Compute a weighted average of embeddings. Requires
weights
. - geometric_mean: Compute the geometric mean across embeddings. Only for positive values.
- harmonic_mean: Compute the harmonic mean across embeddings. Only for positive values.
- median: Compute the element-wise median of embeddings.
- trimmed_mean: Compute the mean after trimming a percentage of outliers. Use
trim_percentage
parameter. - centroid: Use K-Means clustering to find the centroid of the embeddings.
- pca: Use Principal Component Analysis to project embeddings onto the first principal component.
- exemplar: Select the embedding that minimizes the average cosine distance to others.
- max_pooling: Take the maximum value for each dimension across embeddings.
- min_pooling: Take the minimum value for each dimension across embeddings.
- entropy_weighted_average: Weight embeddings by their entropy (information content).
- attentive_pooling: Use an attention mechanism based on similarity to aggregate embeddings.
- tukeys_biweight: A robust method to down-weight outliers in the embeddings.
Parameters
input_folder
(str): Path to the folder containing the input FAISS index (index.faiss
) and metadata (index.pkl
).column_name
(str): The metadata field by which to aggregate embeddings (e.g.,'id'
).output_folder
(str): Path where the output FAISS index and metadata will be saved.method
(str): The aggregation method to use. Options include:'average'
,'weighted_average'
,'geometric_mean'
,'harmonic_mean'
,'centroid'
,'pca'
,'median'
,'trimmed_mean'
,'max_pooling'
,'min_pooling'
,'entropy_weighted_average'
,'attentive_pooling'
,'tukeys_biweight'
,'exemplar'
.
weights
(list or np.ndarray, optional): Weights for theweighted_average
method.trim_percentage
(float, optional): Fraction to trim from each end fortrimmed_mean
. Should be between 0 and less than 0.5.weights
(list or np.ndarray, optional): Weights for theweighted_average
method.
Dependencies
Ensure you have the following packages installed:
- faiss: For handling FAISS indexes.
- numpy: For numerical computations.
- scipy: For statistical functions.
- scikit-learn: For PCA and K-Means clustering.
- langchain: For handling document stores and vector stores.
You can install the dependencies using:
pip install faiss-cpu numpy scipy scikit-learn langchain
Note: Replace faiss-cpu
with faiss-gpu
if you prefer to use the GPU version of FAISS.
Contributing
Contributions are welcome! Please feel free to submit a pull request or open an issue on the GitHub repository.
When contributing, please ensure that your code adheres to the following guidelines:
- Follow PEP 8 coding standards.
- Include docstrings and comments where necessary.
- Write unit tests for new features or bug fixes.
- Update the documentation to reflect changes.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Additional Notes
- Usage with LangChain:
- This library is compatible with LangChain's
FAISS
vector store. Ensure that your embeddings and indexes are handled consistently when integrating with LangChain.
- This library is compatible with LangChain's
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file faiss_vector_aggregator-0.3.0.tar.gz
.
File metadata
- Download URL: faiss_vector_aggregator-0.3.0.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d673b7584d02a056ee14d565013d7c6f643391e123cd7812663cbfac0abdbeb2 |
|
MD5 | fbdd17183ace84bcd1735a3bd97fc03d |
|
BLAKE2b-256 | 355ebadaf299c8fbb3ebfc3aeb7f68ab8630fa5e2af90ea44f752772c3aeeb19 |
File details
Details for the file faiss_vector_aggregator-0.3.0-py3-none-any.whl
.
File metadata
- Download URL: faiss_vector_aggregator-0.3.0-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 50b6a96462ad640be9a69569eaca1f41e603a202cdfd1d400d4c3617d5e11417 |
|
MD5 | 25d9094de50cac39ed019f007549a979 |
|
BLAKE2b-256 | 1a0c051c3d406f9a0905cf36f21c07a3ef90b5356475709e54dd9488e90f8a11 |