ChromaDB Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB
Project description
ChromaDB Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB
ChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of "do one thing and do it well".
Roadmap:
- ✅ Integration with LangChain 🦜🔗
- 🚫 Integration with LlamaIndex 🦙
- ✅ Support more than
all-MiniLM-L6-v2
as embedding functions (head over to Embedding Processors for more info) - 🚫 Multimodal support
- ♾️ Much more!
Installation
pip install chromadb-data-pipes
Usage
Get help:
cdp --help
Importing
Import data from HuggingFace Datasets to .jsonl
file:
cdp ds-get "hf://tazarov/chroma-qna?split=train" > chroma-qna.jsonl
Import data from HuggingFace Datasets to Chroma DB:
The below command will import the train
split of the given dataset to Chroma chroma-qna chroma-qna
collection. The
collection will be created if it does not exist and documents will be upserted.
cdp ds-get "hf://tazarov/chroma-qna?split=train" | cdp import "http://localhost:8000/chroma-qna" --upsert --create
Importing from a directory with PDF files into Local Persisted Chroma DB:
cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500 | cdp embed --ef default | cdp import "file://chroma-data/my-pdfs" --upsert --create
Note: The above command will import the first PDF file from the
sample-data/papers/
directory, chunk it into 500 word chunks, embed each chunk and import the chunks to themy-pdfs
collection in Chroma DB.
Exporting
Export data from Local Persisted Chroma DB to .jsonl
file:
The below command will export the first 10 documents from the chroma-qna
collection to chroma-qna.jsonl
file.
cdp export "file://chroma-data/chroma-qna" --limit 10 > chroma-qna.jsonl
Export data from Local Persisted Chroma DB to .jsonl
file with filter:
The below command will export data from local persisted Chroma DB to a .jsonl
file using a where
filter to select
the documents to export.
cdp export "file://chroma-data/chroma-qna" --where '{"document_id": "123"}' > chroma-qna.jsonl
Export data from Chroma DB to HuggingFace Datasets:
The below command will export the first 10 documents with offset 10 from the chroma-qna
collection to HuggingFace
Datasets tazarov/chroma-qna
dataset. The dataset will be uploaded to HF.
HF Auth and Privacy: Make sure you have
HF_TOKEN=hf_....
environment variable set. If you want your dataset to be private, add--private
flag to thecdp ds-put
command.
cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "hf://tazarov/chroma-qna-modified"
To export a dataset to a file, use --uri
with file://
prefix:
cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "file://chroma-qna"
File Location The file is relative to the current working directory.
Processing
Copy collection from one Chroma collection to another and re-embed the documents:
cdp export "http://localhost:8000/chroma-qna" | cdp embed --ef default | cdp import "http://localhost:8000/chroma-qna-def-emb" --upsert --create
Note: See Embedding Processors for more info about supported embedding functions.
Import dataset from HF to Local Persisted Chroma and embed the documents:
cdp ds-get "hf://tazarov/ds2?split=train" | cdp embed --ef default | cdp import "file://chroma-data/chroma-qna-def-emb-hf" --upsert --create
Chunk Large Documents:
cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500
Misc
Count the number of documents in a collection:
cdp export "http://localhost:8000/chroma-qna" | wc -l
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for chromadb_data_pipes-0.0.5.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce1cb5d95cbf9a5d7f02d782ac7d344204087f539bfc50f43d218353db6227a5 |
|
MD5 | ade5f232c2791d09556b5f9423277025 |
|
BLAKE2b-256 | 38c9e74cae23e5e45a0cf3420844c430182744bc5edcacc3bebaa9224a428b67 |
Hashes for chromadb_data_pipes-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6718fb1dc9ac5e2f3489589040ade86d73eedaf87db81435a78e3f582439c8e6 |
|
MD5 | 943fd5f71b34a73f2046139c1566d7ba |
|
BLAKE2b-256 | 7f04a3ad2697480b2ec77b2a9380aba291164416418fd5c155935b90b0841d15 |