Skip to main content

Chroma Datasets - easy to use datasets for vector retrieval

Project description

Chroma Datasets

Making it easy to load data into Chroma since 2023

pip install chroma_datasets

Current Datasets

Dataset Size Contributor Python Class
State of the Union 51kb Chroma from chroma_datasets import StateOfTheUnion
Paul Graham Essay 1.3mb Chroma from chroma_datasets import PaulGrahamEssay
SciQ 2.8mb Hugging Face from chroma_datasets import SciQ
Huberman Podcasts 4.3mb Dexa AI from chroma_datasets import HubermanPodcasts

chroma_datasets is generally backed by hugging face datasets, but it is not a requirement.

How to use

The following will:

  1. Download the 2022 State of the Union
  2. Chunk it up for you
  3. Embed it using Chroma's default open-source embedding function
  4. Import it into Chroma
import chromadb
from chroma_datasets import StateOfTheUnion
from chroma_datasets.utils import import_into_chroma

chroma_client = chromadb.Client()
collection = import_into_chroma(chroma_client=chroma_client, dataset=StateOfTheUnion)
result = collection.query(query_texts=["The United States of America"])
print(result)

Adding a New Dataset

We welcome new datasets!

These datasets can be anything generally useful to developer education for processing and using embeddings.

Datasets can be:

  • raw text (like StateOfTheUnion)
  • pre-chunked data (like SciPy)
  • chunked and embedded (like PaulGrahamEssay)

See the examples/upload.ipynb for an example of how to create a dataset on Hugging Face (the default path)

Create a new dataset from a Chroma Collection

(more examples of this in examples/upload.ipynb and examples/upload_embeddings.ipynb)

Install dependencies

pip install datasets huggingface_hub chromadb

Login into Hugging Face

huggingface-cli login

Upload an existing collection to Hugging Face ** Hugging Face requires the data to have a "split name" - I suggest using a default of "data" **

import chromadb
from chroma_datasets.utils import export_collection_to_hf_dataset
client = chromadb.PersistantClient(path="./chroma_data")
dataset = export_collection_to_hf_dataset(
    client=client, 
    collection_name="paul_graham_essay", 
    license="MIT")
dataset.push_to_hub(
    repo_id="chromadb/paul_graham_essay", 
    split="data")

Create a Dataset Class

  • Set the string name of the embedding function you used to embed the data, this will make it possible for users to use the embeddings. Please also customize the helpful error message so if users pass in no embedding function or the wrong one, they get help.
  • raw_text is optional
  • chunked and to_chroma you can copy letter for letter if you uploaded with the method above.
class PaulGrahamEssay(Dataset):
    """
    http://www.paulgraham.com/worked.html

    Columns:
        - id: unique identifier for each chunk
        - document: the text of the chunk
        - embedding: the embedding of the chunk (OpenAI-ada-002)
        - metadata: metadata about the chunk
    """
    hf_data = None
    embedding_function = "OpenAIEmbeddingFunction" # name of embedding function inside Chroma
    embedding_function_instructions = """
        from chromadb.utils import embedding_functions
        openai_ef = embedding_functions.OpenAIEmbeddingFunction(
            api_key="YOUR_API_KEY",
            model_name="text-embedding-ada-002"
        )
    """

    @classmethod
    def load_data(cls):
        cls.hf_data = load_huggingface_dataset(
            "chromadb/paul_graham_essay",
            split_name="data"
        )

    @classmethod
    def raw_text(cls) -> str:
        if cls.hf_data is None:
            cls.load_data()
        return "\n".join(cls.hf_data["document"])
    
    @classmethod
    def chunked(cls) -> List[Datapoint]:
        if cls.hf_data is None:
            cls.load_data()
        return cls.hf_data
    
    @classmethod
    def to_chroma(cls) -> AddEmbedding:
        return to_chroma_schema(cls.chunked())

Add it to the manifest at chroma_datasets/__init__.py to make it easy for people to retrieve (optional)

Utility API Documentation

Many of these methods are purely conveneient. This makes it easy to save and load Chroma Collections to disk. See ./examples/example_export.ipynb for example use.

from chromadb.utils import (
    export_collection_to_hf_dataset,
    export_collection_to_hf_dataset_to_disk,
    import_chroma_exported_hf_dataset_from_disk,
    import_chroma_exported_hf_dataset
)

# Exports a Chroma collection to an in-memory HuggingFace Dataset
def export_collection_to_hf_dataset(chroma_client, collection_name, license="MIT"):

# Exports a Chroma collection to a HF dataset and saves to the path
def export_collection_to_hf_dataset_to_disk(chroma_client, collection_name, path, license="MIT"):

# Imports a HuggingFace Dataset into a Chroma Collection
def import_chroma_exported_hf_dataset(chroma_client, dataset, collection_name, embedding_function=None):

# Imports a HuggingFace Dataset from Disk and loads it into a Chroma Collection
def import_chroma_exported_hf_dataset_from_disk(chroma_client, path, collection_name, embedding_function=None):

Todo

  • Add test suite to test some of the critical paths
  • Add automated pypi release
  • Add loaders for other locations (remote like S3, local like CSV... etc)
  • Super easy streamlit/gradio wrapper to push up a collection to interact with

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chroma_datasets-0.1.5.tar.gz (160.4 kB view details)

Uploaded Source

Built Distribution

chroma_datasets-0.1.5-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file chroma_datasets-0.1.5.tar.gz.

File metadata

  • Download URL: chroma_datasets-0.1.5.tar.gz
  • Upload date:
  • Size: 160.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for chroma_datasets-0.1.5.tar.gz
Algorithm Hash digest
SHA256 4724c501b5d40bbb3868648036d7957bcd086eae9cca41e7d775c6f53de67fa1
MD5 d50cac40c76b9ca1ae712de546d8659d
BLAKE2b-256 fbe7a550e1da7b5ecbe94df310a5b2c8bb06a03ef533930b29728d568076a276

See more details on using hashes here.

File details

Details for the file chroma_datasets-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for chroma_datasets-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 386f68978f7ffe59b31a55d98d961fc205e459d55e2a57022e05ad668430d7dc
MD5 5ce9404762fc9cc799df30f4ba2fbaa8
BLAKE2b-256 048741b4383f64c870c7d7b501a7cb0a5220dffea1c9f587511191e2595e5c0b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page