A package to aggregate embeddings in a Chroma vector store based on metadata columns.
Project description
Chroma Embeddings Aggregation Library
This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It is designed to work with Chroma vector stores and is compatible with LangChain's Chroma integration.
Table of Contents
Features
- Multiple aggregation methods (average, weighted average, geometric mean, harmonic mean, centroid, PCA, etc.)
- Compatible with Chroma vector stores and LangChain
- Easy-to-use API for aggregating embeddings
Installation
To install the package, you can use pip:
pip install chroma_vector_aggregator
Usage
Here's an example demonstrating how to use the library to aggregate embeddings using simple averaging:
Example: Simple Average Aggregation
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import FakeEmbeddings
from langchain.schema import Document
from chroma_vector_aggregator import aggregate_embeddings
# Create a sample Chroma collection
embeddings = FakeEmbeddings(size=10)
documents = [
Document(page_content="Test document 1", metadata={"id": "group1"}),
Document(page_content="Test document 2", metadata={"id": "group1"}),
Document(page_content="Test document 3", metadata={"id": "group2"}),
Document(page_content="Test document 4", metadata={"id": "group2"}),
]
chroma_collection = Chroma.from_documents(documents, embeddings)
# Aggregate embeddings using simple averaging
aggregated_collection = aggregate_embeddings(
chroma_collection=chroma_collection,
column_name="id",
method="average"
)
# Use the aggregated collection for similarity search
results = aggregated_collection.similarity_search("Test query", k=2)
Aggregation Methods
average
: Compute the arithmetic mean of embeddings.weighted_average
: Compute a weighted average of embeddings.geometric_mean
: Compute the geometric mean across embeddings.harmonic_mean
: Compute the harmonic mean across embeddings.median
: Compute the element-wise median of embeddings.trimmed_mean
: Compute the mean after trimming outliers.centroid
: Use K-Means clustering to find the centroid of the embeddings.pca
: Use Principal Component Analysis to reduce embeddings.exemplar
: Select the embedding that best represents the group.max_pooling
: Take the maximum value for each dimension across embeddings.min_pooling
: Take the minimum value for each dimension across embeddings.entropy_weighted_average
: Weight embeddings by their entropy.attentive_pooling
: Use an attention mechanism to aggregate embeddings.tukeys_biweight
: A robust method to down-weight outliers.
Parameters
chroma_collection
: The Chroma collection to aggregate embeddings from.column_name
: The metadata field by which to aggregate embeddings (e.g., 'id').method
: The aggregation method to use.weights
(optional): Weights for theweighted_average
method.trim_percentage
(optional): Fraction to trim from each end fortrimmed_mean
.
Dependencies
- chromadb
- numpy
- scipy
- scikit-learn
- langchain
Contributing
Contributions are welcome! Please feel free to submit a pull request.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for chroma_vector_aggregator-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 888568a2fa98864d515e7393c6f09f24aca7b290087e75e3f4a45eac93601a87 |
|
MD5 | 29452503829a23d987e5316b8a12b7db |
|
BLAKE2b-256 | 6fbc8b0a01a52d177cda98583bfa649504a4d7cf1dc1bce3a75bd53ea09d658f |
Hashes for chroma_vector_aggregator-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb5b7a3ce0a2c291e1a895abdcaf259254d013485fee567e907e697901e15df0 |
|
MD5 | ec107ebf220b44fae52f8f96ac397bc8 |
|
BLAKE2b-256 | 2a4783a2495ba4acc2eff18be1d1d0bffa464939886d0282f2cf2ae5c0d4971b |