A package to aggregate embeddings in a Faiss vector store based on metadata columns.
Project description
Faiss Embeddings Aggregation Library
This Python library provides various methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It supports a wide range of aggregation techniques, from simple averaging to advanced methods like PCA and Attentive Pooling.
Features
- Simple Average: Compute the arithmetic mean of embeddings.
- Weighted Average: Compute a weighted average of embeddings.
- Geometric Mean: Compute the geometric mean across embeddings.
- Harmonic Mean: Compute the harmonic mean across embeddings.
- Centroid (K-Means): Use K-Means clustering to find the centroid of the embeddings.
- Principal Component (PCA): Use PCA to reduce embeddings to a single principal component.
- Median: Compute the element-wise median of embeddings.
- Trimmed Mean: Compute the mean after trimming outliers.
- Max-Pooling: Take the maximum value for each dimension across embeddings.
- Min-Pooling: Take the minimum value for each dimension across embeddings.
- Entropy-Weighted Average: Weight embeddings by their entropy (information content).
- Attentive Pooling: Use an attention mechanism to learn the weights for combining embeddings.
- Tukey's Biweight: A robust method to down-weight outliers.
Installation
To install the package, you can use pip:
```bash pip install faiss_vector_aggregator ```
Usage
Here are some examples of how to use the library to aggregate embeddings.
Example 1: Simple Average Aggregation
Suppose you have a collection of embeddings stored in a FAISS index, and you want to aggregate them by their associated document IDs using simple averaging. Here's how you can do it:
```python from your_package_name import aggregate_embeddings
Aggregate embeddings using simple averaging
aggregate_embeddings( input_folder="data/input", column_name="id", output_folder="data/output", method="average" ) ```
In this example:
input_folder
: Path to the folder containing the input FAISS index and metadata.column_name
: The column or metadata field by which to aggregate embeddings.output_folder
: Path where the output FAISS index and metadata will be saved.method="average"
: Specifies that the average method should be used for aggregation.
Example 2: Weighted Average Aggregation
If you have different weights for the embeddings, you can apply a weighted average to give more importance to certain embeddings. For instance:
```python from your_package_name import aggregate_embeddings
Example weights for the embeddings
weights = [0.1, 0.3, 0.6]
Aggregate embeddings using weighted averaging
aggregate_embeddings( input_folder="data/input", column_name="id", output_folder="data/output", method="weighted_average", weights=weights ) ```
In this example:
weights
: A list of weights corresponding to each embedding.method="weighted_average"
: Specifies that the weighted average method should be used for aggregation.
Example 3: Principal Component Analysis (PCA) Aggregation
If you want to reduce high-dimensional embeddings to a single vector using Principal Component Analysis (PCA), you can use the following approach:
```python from your_package_name import aggregate_embeddings
Aggregate embeddings using PCA
aggregate_embeddings( input_folder="data/input", column_name="id", output_folder="data/output", method="pca" ) ```
In this example:
method="pca"
: Specifies that PCA should be used to reduce and aggregate the embeddings into a single vector.
Example 4: Centroid Aggregation (K-Means)
To use K-Means clustering for finding the centroid of embeddings for each document ID:
```python from your_package_name import aggregate_embeddings
Aggregate embeddings using K-Means clustering to find the centroid
aggregate_embeddings( input_folder="data/input", column_name="id", output_folder="data/output", method="centroid" ) ```
In this example:
method="centroid"
: Specifies that K-Means clustering should be used to find the centroid for aggregation.
Parameters
input_folder
: Path to the folder containing the input FAISS index and metadata.column_name
: The column or metadata field by which to aggregate embeddings (e.g., 'id').output_folder
: Path to the folder where the output FAISS index and metadata will be saved.method
: The aggregation method to use (average
,weighted_average
,geometric_mean
,harmonic_mean
,centroid
,pca
,median
,trimmed_mean
,max_pooling
,min_pooling
,entropy_weighted_average
,attentive_pooling
,tukeys_biweight
).weights
: Optional weights for theweighted_average
method.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request or open an Issue.
License
This project is licensed under the MIT License. See the LICENSE
file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for faiss_vector_aggregator-0.2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8c0b7fcf3655ba9751dd6fb05a19d7698c3007d6e4348d8d814219fa66b30bf |
|
MD5 | b89672c544887bf653660e5c6ad89ea2 |
|
BLAKE2b-256 | 2baf87a410cd98b0de580c854db79fa52cb06e3d49bbdcba582dc88b4dfea48a |
Hashes for faiss_vector_aggregator-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41f53c803399fa23e6c82ba40eef963c8ef852e2e86bf79dedcb3841fb64fa92 |
|
MD5 | 63534bcddf5a96b454a214241571306f |
|
BLAKE2b-256 | 543da9592f2202da0af2c11f9f9861e995e40a3771b8029017c3321902dd52f1 |