Skip to main content

A package to aggregate embeddings in a Chroma vector store based on metadata columns.

Project description

Chroma Embeddings Aggregation Library

This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It is designed to work with Chroma vector stores and is compatible with LangChain's Chroma integration.

Table of Contents

Features

  • Multiple aggregation methods (average, weighted average, geometric mean, harmonic mean, centroid, PCA, etc.)
  • Compatible with Chroma vector stores and LangChain
  • Easy-to-use API for aggregating embeddings

Installation

To install the package, you can use pip:

pip install chroma_vector_aggregator

Usage

Here's an example demonstrating how to use the library to aggregate embeddings using simple averaging:

Example: Simple Average Aggregation

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import FakeEmbeddings
from langchain.schema import Document
from chroma_vector_aggregator import aggregate_embeddings

# Create a sample Chroma collection
embeddings = FakeEmbeddings(size=10)
documents = [
    Document(page_content="Test document 1", metadata={"id": "group1"}),
    Document(page_content="Test document 2", metadata={"id": "group1"}),
    Document(page_content="Test document 3", metadata={"id": "group2"}),
    Document(page_content="Test document 4", metadata={"id": "group2"}),
]
chroma_collection = Chroma.from_documents(documents, embeddings)

# Aggregate embeddings using simple averaging
aggregated_collection = aggregate_embeddings(
    chroma_collection=chroma_collection,
    column_name="id",
    method="average"
)

# Use the aggregated collection for similarity search
results = aggregated_collection.similarity_search("Test query", k=2)

Aggregation Methods

  • average: Compute the arithmetic mean of embeddings.
  • weighted_average: Compute a weighted average of embeddings.
  • geometric_mean: Compute the geometric mean across embeddings.
  • harmonic_mean: Compute the harmonic mean across embeddings.
  • median: Compute the element-wise median of embeddings.
  • trimmed_mean: Compute the mean after trimming outliers.
  • centroid: Use K-Means clustering to find the centroid of the embeddings.
  • pca: Use Principal Component Analysis to reduce embeddings.
  • exemplar: Select the embedding that best represents the group.
  • max_pooling: Take the maximum value for each dimension across embeddings.
  • min_pooling: Take the minimum value for each dimension across embeddings.
  • entropy_weighted_average: Weight embeddings by their entropy.
  • attentive_pooling: Use an attention mechanism to aggregate embeddings.
  • tukeys_biweight: A robust method to down-weight outliers.

Parameters

  • chroma_collection: The Chroma collection to aggregate embeddings from.
  • column_name: The metadata field by which to aggregate embeddings (e.g., 'id').
  • method: The aggregation method to use.
  • weights (optional): Weights for the weighted_average method.
  • trim_percentage (optional): Fraction to trim from each end for trimmed_mean.

Dependencies

  • chromadb
  • numpy
  • scipy
  • scikit-learn
  • langchain

Contributing

Contributions are welcome! Please feel free to submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chroma_vector_aggregator-0.1.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

chroma_vector_aggregator-0.1.0-py3-none-any.whl (7.6 kB view details)

Uploaded Python 3

File details

Details for the file chroma_vector_aggregator-0.1.0.tar.gz.

File metadata

File hashes

Hashes for chroma_vector_aggregator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 888568a2fa98864d515e7393c6f09f24aca7b290087e75e3f4a45eac93601a87
MD5 29452503829a23d987e5316b8a12b7db
BLAKE2b-256 6fbc8b0a01a52d177cda98583bfa649504a4d7cf1dc1bce3a75bd53ea09d658f

See more details on using hashes here.

File details

Details for the file chroma_vector_aggregator-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for chroma_vector_aggregator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eb5b7a3ce0a2c291e1a895abdcaf259254d013485fee567e907e697901e15df0
MD5 ec107ebf220b44fae52f8f96ac397bc8
BLAKE2b-256 2a4783a2495ba4acc2eff18be1d1d0bffa464939886d0282f2cf2ae5c0d4971b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page