A package to aggregate embeddings in a Chroma vector store based on metadata columns.

These details have not been verified by PyPI

Project links

Homepage

Project description

Chroma Embeddings Aggregation Library

This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It is designed to work with Chroma vector stores and is compatible with LangChain's Chroma integration.

Features
Installation
Usage
- Example: Simple Average Aggregation
Aggregation Methods
Parameters
Dependencies
Contributing
License

Features

Multiple aggregation methods (average, weighted average, geometric mean, harmonic mean, centroid, PCA, etc.)
Compatible with Chroma vector stores and LangChain
Easy-to-use API for aggregating embeddings

Installation

To install the package, you can use pip:

pip install chroma_vector_aggregator

Usage

Here's an example demonstrating how to use the library to aggregate embeddings using simple averaging:

Example: Simple Average Aggregation

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import FakeEmbeddings
from langchain.schema import Document
from chroma_vector_aggregator import aggregate_embeddings

# Create a sample Chroma collection
embeddings = FakeEmbeddings(size=10)
documents = [
    Document(page_content="Test document 1", metadata={"id": "group1"}),
    Document(page_content="Test document 2", metadata={"id": "group1"}),
    Document(page_content="Test document 3", metadata={"id": "group2"}),
    Document(page_content="Test document 4", metadata={"id": "group2"}),
]
chroma_collection = Chroma.from_documents(documents, embeddings)

# Aggregate embeddings using simple averaging
aggregated_collection = aggregate_embeddings(
    chroma_collection=chroma_collection,
    column_name="id",
    method="average"
)

# Use the aggregated collection for similarity search
results = aggregated_collection.similarity_search("Test query", k=2)

Aggregation Methods

average: Compute the arithmetic mean of embeddings.
weighted_average: Compute a weighted average of embeddings.
geometric_mean: Compute the geometric mean across embeddings.
harmonic_mean: Compute the harmonic mean across embeddings.
median: Compute the element-wise median of embeddings.
trimmed_mean: Compute the mean after trimming outliers.
centroid: Use K-Means clustering to find the centroid of the embeddings.
pca: Use Principal Component Analysis to reduce embeddings.
exemplar: Select the embedding that best represents the group.
max_pooling: Take the maximum value for each dimension across embeddings.
min_pooling: Take the minimum value for each dimension across embeddings.
entropy_weighted_average: Weight embeddings by their entropy.
attentive_pooling: Use an attention mechanism to aggregate embeddings.
tukeys_biweight: A robust method to down-weight outliers.

Parameters

chroma_collection: The Chroma collection to aggregate embeddings from.
column_name: The metadata field by which to aggregate embeddings (e.g., 'id').
method: The aggregation method to use.
weights (optional): Weights for the weighted_average method.
trim_percentage (optional): Fraction to trim from each end for trimmed_mean.

Dependencies

chromadb
numpy
scipy
scikit-learn
langchain

Contributing

Contributions are welcome! Please feel free to submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Sep 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chroma_vector_aggregator-0.1.0.tar.gz (6.1 kB view hashes)

Uploaded Sep 21, 2024 Source

Built Distribution

chroma_vector_aggregator-0.1.0-py3-none-any.whl (7.6 kB view hashes)

Uploaded Sep 21, 2024 Python 3

Hashes for chroma_vector_aggregator-0.1.0.tar.gz

Hashes for chroma_vector_aggregator-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`888568a2fa98864d515e7393c6f09f24aca7b290087e75e3f4a45eac93601a87`
MD5	`29452503829a23d987e5316b8a12b7db`
BLAKE2b-256	`6fbc8b0a01a52d177cda98583bfa649504a4d7cf1dc1bce3a75bd53ea09d658f`

Hashes for chroma_vector_aggregator-0.1.0-py3-none-any.whl

Hashes for chroma_vector_aggregator-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eb5b7a3ce0a2c291e1a895abdcaf259254d013485fee567e907e697901e15df0`
MD5	`ec107ebf220b44fae52f8f96ac397bc8`
BLAKE2b-256	`2a4783a2495ba4acc2eff18be1d1d0bffa464939886d0282f2cf2ae5c0d4971b`