A package to aggregate embeddings in a Chroma vector store based on metadata columns.
Project description
Chroma Embeddings Aggregation Library
This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It is designed to work with Chroma vector stores and is compatible with LangChain's Chroma integration.
Table of Contents
Features
- Multiple aggregation methods (average, weighted average, geometric mean, harmonic mean, centroid, PCA, etc.)
- Compatible with Chroma vector stores and LangChain
- Easy-to-use API for aggregating embeddings
Installation
To install the package, you can use pip:
pip install chroma_vector_aggregator
Usage
Here's an example demonstrating how to use the library to aggregate embeddings using simple averaging:
Example: Simple Average Aggregation
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import FakeEmbeddings
from langchain.schema import Document
from chroma_vector_aggregator import aggregate_embeddings
# Create a sample Chroma collection
embeddings = FakeEmbeddings(size=10)
documents = [
Document(page_content="Test document 1", metadata={"id": "group1"}),
Document(page_content="Test document 2", metadata={"id": "group1"}),
Document(page_content="Test document 3", metadata={"id": "group2"}),
Document(page_content="Test document 4", metadata={"id": "group2"}),
]
chroma_collection = Chroma.from_documents(documents, embeddings)
# Aggregate embeddings using simple averaging
aggregated_collection = aggregate_embeddings(
chroma_collection=chroma_collection,
column_name="id",
method="average"
)
# Use the aggregated collection for similarity search
results = aggregated_collection.similarity_search("Test query", k=2)
Aggregation Methods
average
: Compute the arithmetic mean of embeddings.weighted_average
: Compute a weighted average of embeddings.geometric_mean
: Compute the geometric mean across embeddings.harmonic_mean
: Compute the harmonic mean across embeddings.median
: Compute the element-wise median of embeddings.trimmed_mean
: Compute the mean after trimming outliers.centroid
: Use K-Means clustering to find the centroid of the embeddings.pca
: Use Principal Component Analysis to reduce embeddings.exemplar
: Select the embedding that best represents the group.max_pooling
: Take the maximum value for each dimension across embeddings.min_pooling
: Take the minimum value for each dimension across embeddings.entropy_weighted_average
: Weight embeddings by their entropy.attentive_pooling
: Use an attention mechanism to aggregate embeddings.tukeys_biweight
: A robust method to down-weight outliers.
Parameters
chroma_collection
: The Chroma collection to aggregate embeddings from.column_name
: The metadata field by which to aggregate embeddings (e.g., 'id').method
: The aggregation method to use.weights
(optional): Weights for theweighted_average
method.trim_percentage
(optional): Fraction to trim from each end fortrimmed_mean
.
Dependencies
- chromadb
- numpy
- scipy
- scikit-learn
- langchain
Contributing
Contributions are welcome! Please feel free to submit a pull request.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file chroma_vector_aggregator-0.1.0.tar.gz
.
File metadata
- Download URL: chroma_vector_aggregator-0.1.0.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 888568a2fa98864d515e7393c6f09f24aca7b290087e75e3f4a45eac93601a87 |
|
MD5 | 29452503829a23d987e5316b8a12b7db |
|
BLAKE2b-256 | 6fbc8b0a01a52d177cda98583bfa649504a4d7cf1dc1bce3a75bd53ea09d658f |
File details
Details for the file chroma_vector_aggregator-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: chroma_vector_aggregator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb5b7a3ce0a2c291e1a895abdcaf259254d013485fee567e907e697901e15df0 |
|
MD5 | ec107ebf220b44fae52f8f96ac397bc8 |
|
BLAKE2b-256 | 2a4783a2495ba4acc2eff18be1d1d0bffa464939886d0282f2cf2ae5c0d4971b |