Skip to main content

Chroma Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB

Project description

ChromaDB Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB

ChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of "do one thing and do it well".

Roadmap:

  • ✅ Integration with LangChain 🦜🔗
  • 🚫 Integration with LlamaIndex 🦙
  • ✅ Support more than all-MiniLM-L6-v2 as embedding functions (head over to Embedding Processors for more info)
  • 🚫 Multimodal support
  • ♾️ Much more!

Installation

pip install chromadb-data-pipes

Usage

Get help:

cdp --help

Example Use Cases

This is a short list of use cases to evaluate whether this is the right tool for your needs:

  • Importing large datasets from local documents (PDF, TXT, etc.), from HuggingFace, from local persisted Chroma DB or even another remote Chroma DB.
  • Exporting large dataset to HuggingFace or any other dataformat supported by the library (if your format is not supported, either implement it in a small function or open an issue)
  • Create a dataset from your data that you can share with others (including the embeddings)
  • Clone Collection with different embedding function, distance function, and other HNSW fine-tuning parameters
  • Re-embed documents in a collection with a different embedding function
  • Backup your data to a jsonl file
  • Use other existing unix or other tools to transform your data after exporting from or before importing into Chroma DB

Importing

Import data from HuggingFace Datasets to .jsonl file:

cdp ds-get "hf://tazarov/chroma-qna?split=train" > chroma-qna.jsonl

Import data from HuggingFace Datasets to Chroma DB:

The below command will import the train split of the given dataset to Chroma chroma-qna chroma-qna collection. The collection will be created if it does not exist and documents will be upserted.

cdp ds-get "hf://tazarov/chroma-qna?split=train" | cdp import "http://localhost:8000/chroma-qna" --upsert --create

Importing from a directory with PDF files into Local Persisted Chroma DB:

cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500 | cdp embed --ef default | cdp import "file://chroma-data/my-pdfs" --upsert --create

Note: The above command will import the first PDF file from the sample-data/papers/ directory, chunk it into 500 word chunks, embed each chunk and import the chunks to the my-pdfs collection in Chroma DB.

Exporting

Export data from Local Persisted Chroma DB to .jsonl file:

The below command will export the first 10 documents from the chroma-qna collection to chroma-qna.jsonl file.

cdp export "file://chroma-data/chroma-qna" --limit 10 > chroma-qna.jsonl

Export data from Local Persisted Chroma DB to .jsonl file with filter:

The below command will export data from local persisted Chroma DB to a .jsonl file using a where filter to select the documents to export.

cdp export "file://chroma-data/chroma-qna" --where '{"document_id": "123"}' > chroma-qna.jsonl

Export data from Chroma DB to HuggingFace Datasets:

The below command will export the first 10 documents with offset 10 from the chroma-qna collection to HuggingFace Datasets tazarov/chroma-qna dataset. The dataset will be uploaded to HF.

HF Auth and Privacy: Make sure you have HF_TOKEN=hf_.... environment variable set. If you want your dataset to be private, add --private flag to the cdp ds-put command.

cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "hf://tazarov/chroma-qna-modified"

To export a dataset to a file, use --uri with file:// prefix:

cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "file://chroma-qna"

File Location The file is relative to the current working directory.

Processing

Copy collection from one Chroma collection to another and re-embed the documents:

cdp export "http://localhost:8000/chroma-qna" | cdp embed --ef default | cdp import "http://localhost:8000/chroma-qna-def-emb" --upsert --create

Note: See Embedding Processors for more info about supported embedding functions.

Import dataset from HF to Local Persisted Chroma and embed the documents:

cdp ds-get "hf://tazarov/ds2?split=train" | cdp embed --ef default | cdp import "file://chroma-data/chroma-qna-def-emb-hf" --upsert --create

Chunk Large Documents:

cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500

Misc

Count the number of documents in a collection:

cdp export "http://localhost:8000/chroma-qna" | wc -l

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chromadb_data_pipes-0.0.12.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

chromadb_data_pipes-0.0.12-py3-none-any.whl (31.6 kB view details)

Uploaded Python 3

File details

Details for the file chromadb_data_pipes-0.0.12.tar.gz.

File metadata

  • Download URL: chromadb_data_pipes-0.0.12.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.4 CPython/3.9.20 Linux/6.5.0-1025-azure

File hashes

Hashes for chromadb_data_pipes-0.0.12.tar.gz
Algorithm Hash digest
SHA256 5fbe147fa2c6c7f9e79b8c1b2a6049af090fbea532e64992575842934b8d0293
MD5 bc3506cd1266a2d52ade522f4b7a2fd5
BLAKE2b-256 9b4a5606ab53dad47f2fbdf006c1f842bcd3b6559203cd993ae3910d5f49cd1f

See more details on using hashes here.

File details

Details for the file chromadb_data_pipes-0.0.12-py3-none-any.whl.

File metadata

File hashes

Hashes for chromadb_data_pipes-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 92be2ca5e27cabdd8b881e86c8c6887ed4d1737f603a3f650e3222b7e6a593fc
MD5 1350c057bbbb319586be8bf88f7bb11c
BLAKE2b-256 f80defedeac38ef22d88d4655597bd56b087d47b342a4e11d90269eb82db74f8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page