List directories (safe root), filter .txt/.md files, read as text, chunk, embed, and push to Chroma.
Project description
HecVec
List directories with a safe root, filter .txt/.md files, read them as text, and optionally chunk and push to Chroma — library only, no API.
Install
pip install hecvec
One-call pipeline (list → filter → token-chunk → Chroma):
pip install hecvec[chroma]
Optional chunking only (no Chroma):
pip install hecvec[chunk]
Usage
One-call pipeline (list → filter → chunk → Chroma)
Runs entirely in the library (no API). You need Chroma running (e.g. docker run -p 8000:8000 chromadb/chroma) and OPENAI_API_KEY set (in the environment or in a .env file; the library loads .env via python-dotenv when you use hecvec[chroma]).
import hecvec
# Class-style: use defaults, then slice
test = hecvec.Slicer()
result = test.slice(path="/path/to/folder")
# → {"files": N, "chunks": M, "collection": "hecvec"}
# Or call slice on the class (same flow)
result = hecvec.Slicer.slice(path="/path/to/folder")
Flow: resolve path → listdir → filter .txt/.md → token-chunk (200 tokens, cl100k_base) → embed with OpenAI → push to Chroma.
Optional config (instance or Slicer.slice(..., key=value)):
root,collection_name,chroma_host,chroma_portembedding_model,chunk_size,chunk_overlap,encoding_name,batch_sizeopenai_api_key(or setOPENAI_API_KEYin the environment or in a.envfile; optionaldotenv_pathto point to a specific.env)
Low-level building blocks
from pathlib import Path
from hecvec import ListDir, ListDirTextFiles, ReadText
root = Path("/path/to/repo")
# List all entries under a path (restricted to root)
lister = ListDir(root=root)
for rel in lister.listdir("."):
print(rel)
# Only .txt and .md files, recursively
text_lister = ListDirTextFiles(root=root)
paths = text_lister.listdir_recursive_txt_md("docs")
# Read each file as text
reader = ReadText(paths)
for path, text in reader:
print(path, len(text))
Chunking (optional)
With pip install hecvec[chunk]:
from hecvec import ListDirTextFiles, ReadText
from hecvec.chunking import chunk_documents
lister = ListDirTextFiles(root=root)
paths = lister.listdir_recursive_txt_md(".")
reader = ReadText(paths)
path_and_text = reader.read_all()
chunks = chunk_documents(path_and_text)
# list of {"path": "...", "chunk_index": 0, "content": "..."}
CLI
hecvec-listdir [path] [root]
# or
python -m hecvec.cli [path] [root]
Test the full pipeline (the method that does everything)
From the project root, with Chroma running and OPENAI_API_KEY set (e.g. in .env):
# Start Chroma (one terminal)
docker run -p 8000:8000 chromadb/chroma
# Run the test script (another terminal)
uv run python scripts/test_slice.py
# or: python scripts/test_slice.py
The script creates a temp folder with two .txt files, runs Slicer.slice(path=...), and prints PASS or FAIL with the result (files, chunks, collection).
Modular layout (easy to study)
Each step of the pipeline lives in its own module:
| Module | Responsibility |
|---|---|
hecvec.env |
Load .env and OPENAI_API_KEY |
hecvec.listdir |
List dirs under a safe root; filter by extension (.txt/.md) |
hecvec.reading |
Read files as text (UTF-8 / latin-1 / cp1252 fallback) |
hecvec.token_splitter |
Token-based chunking (TokenTextSplitter) |
hecvec.chunking |
Recursive-character chunking (RecursiveCharacterTextSplitter) |
hecvec.embeddings |
OpenAI embeddings (embed_texts) |
hecvec.chroma_client |
Chroma client, get/create collection, add documents |
hecvec.chroma_list |
List Chroma collections and counts |
hecvec.pipeline |
Orchestrator: Slicer and slice(path=...) |
Example: use one step on its own:
from hecvec import embed_texts, token_chunk_text, list_collections
chunks = token_chunk_text("Some long document...", chunk_size=200)
vecs = embed_texts(chunks, api_key="sk-...")
names_and_counts = list_collections(host="localhost", port=8000)
Development
From the repo root:
uv sync
uv run python -c "from hecvec import ListDir; print(ListDir('.').listdir('.'))"
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hecvec-5.1.0.tar.gz.
File metadata
- Download URL: hecvec-5.1.0.tar.gz
- Upload date:
- Size: 244.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
354c837d2f7218f2afa9fa4eb4815e48b751b184f2588495cce3935200faac6d
|
|
| MD5 |
95dd863b25359776cb8374aa9c12e5fc
|
|
| BLAKE2b-256 |
d260ef3800665008a69edb688a53f523f0208a6dfecc04d5976bb7be75aed00c
|
File details
Details for the file hecvec-5.1.0-py3-none-any.whl.
File metadata
- Download URL: hecvec-5.1.0-py3-none-any.whl
- Upload date:
- Size: 22.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a9638b4db4f022f8dba6c64deb9433c101485a018d3f02f9580ec9c2f42584c0
|
|
| MD5 |
b86f2026b3e324641de8a0772b947c22
|
|
| BLAKE2b-256 |
fc0d5d81c291e6c464acb862284ba19f56151d74610c4c7586743bc0d0c25f41
|