Chroma Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB
Project description
ChromaDB Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB
ChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of "do one thing and do it well".
Roadmap:
- ✅ Integration with LangChain 🦜🔗
- 🚫 Integration with LlamaIndex 🦙
- ✅ Support more than
all-MiniLM-L6-v2
as embedding functions (head over to Embedding Processors for more info) - 🚫 Multimodal support
- ♾️ Much more!
Installation
pip install chromadb-data-pipes
Usage
Get help:
cdp --help
Example Use Cases
This is a short list of use cases to evaluate whether this is the right tool for your needs:
- Importing large datasets from local documents (PDF, TXT, etc.), from HuggingFace, from local persisted Chroma DB or even another remote Chroma DB.
- Exporting large dataset to HuggingFace or any other dataformat supported by the library (if your format is not supported, either implement it in a small function or open an issue)
- Create a dataset from your data that you can share with others (including the embeddings)
- Clone Collection with different embedding function, distance function, and other HNSW fine-tuning parameters
- Re-embed documents in a collection with a different embedding function
- Backup your data to a
jsonl
file - Use other existing unix or other tools to transform your data after exporting from or before importing into Chroma DB
Importing
Import data from HuggingFace Datasets to .jsonl
file:
cdp ds-get "hf://tazarov/chroma-qna?split=train" > chroma-qna.jsonl
Import data from HuggingFace Datasets to Chroma DB:
The below command will import the train
split of the given dataset to Chroma chroma-qna chroma-qna
collection. The
collection will be created if it does not exist and documents will be upserted.
cdp ds-get "hf://tazarov/chroma-qna?split=train" | cdp import "http://localhost:8000/chroma-qna" --upsert --create
Importing from a directory with PDF files into Local Persisted Chroma DB:
cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500 | cdp embed --ef default | cdp import "file://chroma-data/my-pdfs" --upsert --create
Note: The above command will import the first PDF file from the
sample-data/papers/
directory, chunk it into 500 word chunks, embed each chunk and import the chunks to themy-pdfs
collection in Chroma DB.
Exporting
Export data from Local Persisted Chroma DB to .jsonl
file:
The below command will export the first 10 documents from the chroma-qna
collection to chroma-qna.jsonl
file.
cdp export "file://chroma-data/chroma-qna" --limit 10 > chroma-qna.jsonl
Export data from Local Persisted Chroma DB to .jsonl
file with filter:
The below command will export data from local persisted Chroma DB to a .jsonl
file using a where
filter to select
the documents to export.
cdp export "file://chroma-data/chroma-qna" --where '{"document_id": "123"}' > chroma-qna.jsonl
Export data from Chroma DB to HuggingFace Datasets:
The below command will export the first 10 documents with offset 10 from the chroma-qna
collection to HuggingFace
Datasets tazarov/chroma-qna
dataset. The dataset will be uploaded to HF.
HF Auth and Privacy: Make sure you have
HF_TOKEN=hf_....
environment variable set. If you want your dataset to be private, add--private
flag to thecdp ds-put
command.
cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "hf://tazarov/chroma-qna-modified"
To export a dataset to a file, use --uri
with file://
prefix:
cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "file://chroma-qna"
File Location The file is relative to the current working directory.
Processing
Copy collection from one Chroma collection to another and re-embed the documents:
cdp export "http://localhost:8000/chroma-qna" | cdp embed --ef default | cdp import "http://localhost:8000/chroma-qna-def-emb" --upsert --create
Note: See Embedding Processors for more info about supported embedding functions.
Import dataset from HF to Local Persisted Chroma and embed the documents:
cdp ds-get "hf://tazarov/ds2?split=train" | cdp embed --ef default | cdp import "file://chroma-data/chroma-qna-def-emb-hf" --upsert --create
Chunk Large Documents:
cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500
Misc
Count the number of documents in a collection:
cdp export "http://localhost:8000/chroma-qna" | wc -l
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file chromadb_data_pipes-0.0.12.tar.gz
.
File metadata
- Download URL: chromadb_data_pipes-0.0.12.tar.gz
- Upload date:
- Size: 22.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.9.20 Linux/6.5.0-1025-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5fbe147fa2c6c7f9e79b8c1b2a6049af090fbea532e64992575842934b8d0293 |
|
MD5 | bc3506cd1266a2d52ade522f4b7a2fd5 |
|
BLAKE2b-256 | 9b4a5606ab53dad47f2fbdf006c1f842bcd3b6559203cd993ae3910d5f49cd1f |
File details
Details for the file chromadb_data_pipes-0.0.12-py3-none-any.whl
.
File metadata
- Download URL: chromadb_data_pipes-0.0.12-py3-none-any.whl
- Upload date:
- Size: 31.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.4 CPython/3.9.20 Linux/6.5.0-1025-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92be2ca5e27cabdd8b881e86c8c6887ed4d1737f603a3f650e3222b7e6a593fc |
|
MD5 | 1350c057bbbb319586be8bf88f7bb11c |
|
BLAKE2b-256 | f80defedeac38ef22d88d4655597bd56b087d47b342a4e11d90269eb82db74f8 |