List directories (safe root), filter .txt/.md files, read as text, chunk, embed, and push to Chroma.
Project description
HecVec
HecVec is a Python library that discovers .txt and .md files, chunks them (token, text, semantic, or LLM-based), embeds with OpenAI, and stores vectors in Chroma. It is library-only — no HTTP API. All work runs in-process.
Table of contents
- Install
- Requirements to run the pipeline
- Workflow
- Quick start
- Parameters
- API reference: methods and parameters
- Chunking methods
- Chroma server
- Environment and API key
- Collection naming
- Building blocks
- Module layout
- Development
- License
Install
Full pipeline (list → verify Chroma is up → read → chunk → embed → Chroma):
pip install hecvec
A Chroma server must be running; the pipeline connects only to that server (see Chroma server). There is no local/ephemeral mode.
Requirements to run the pipeline
To use the full Slicer.slice(...) pipeline you need:
- Python 3.9–3.13.
- Dependencies installed via
pip install hecvec. - OpenAI API key for embeddings (and for
semantic/llmchunking). SetOPENAI_API_KEYin the environment or in a.envfile (see Environment and API key). - Chroma server — A Chroma server must be listening at
host:port(defaultlocalhost:8000). If nothing is listening, the pipeline raises. Start one e.g. withdocker run -p 8000:8000 -v ./chroma-data:/chroma/chroma chromadb/chroma(bind-mounts data so it persists across container restarts).
Workflow
The main entry point is Slicer.slice(path=..., **kwargs). It runs six logged steps:
| Step | Description |
|---|---|
| 0 | Resolve path, resolve collection name (base_name + _ + chunking_method). |
| 1 | Discover files: single .txt/.md file or recursive list under a directory. |
| 2 | Chroma server check: connect to the server and fail fast if nothing is listening (before read/chunk/embed so you don’t pay for OpenAI when Chroma is down). The client is reused for the final write. |
| 3 | Read file contents as text (UTF-8 with fallbacks). |
| 4 | Chunk using the chosen method (token, text, semantic, or llm). |
| 5 | Generate embeddings with OpenAI. |
| 6 | Connect (already verified in step 2), list collections; if the collection already exists, skip adding. Otherwise create the collection and add documents. |
Progress is logged as [0/6] … [6/6] with timings. If the collection already exists, the log states that clearly after embeddings and no new documents are added.
Quick start
import hecvec
# Default: token chunking, Chroma at localhost:8000
result = hecvec.Slicer.slice(path="/path/to/folder_or_file")
# → {"files": N, "chunks": M, "collection": "folder_or_file_name_token_cs200_ov0_enccl100k_base"}
# Custom host/port and semantic chunking
result = hecvec.Slicer.slice(
path="/path/to/docs",
host="localhost",
port=8000,
chunking_method="semantic",
)
Or use an instance:
slicer = hecvec.Slicer(
host="chroma", # e.g. Docker Compose service name (see `.devcontainer/docker-compose.yml`)
port=8000,
chunking_method="token",
)
result = slicer.slice(path="/data/myfile.txt")
Run the test script (from repo root, with Chroma running and OPENAI_API_KEY set):
# Terminal 1: start Chroma
docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma chromadb/chroma
# Terminal 2: run pipeline
uv run python scripts/test_slice.py
# Or with a path:
uv run python scripts/test_slice.py /path/to/file_or_folder
Parameters
All of these can be passed to Slicer(...) or to Slicer.slice(..., key=value).
| Parameter | Default | Description |
|---|---|---|
path |
(required) | File or directory to process (.txt/.md only). |
root |
path.parent (file) or path (dir) |
Safe root for resolving paths (used when listing under a directory). |
collection_name |
"hecvec" |
Base name for the Chroma collection. If "hecvec", it is replaced by the file stem or directory name; the final name includes method + chunk config (so different chunk_size/chunk_overlap don’t collide). |
db |
"chroma" |
Database to use. Only "chroma" is supported. When db="chroma", connection uses host and port. |
host |
"localhost" |
Server host (used when db="chroma"). Server must be listening. |
port |
8000 |
Server port (used when db="chroma"). |
user |
None |
Optional Basic Auth username for Chroma. Use together with password. |
password |
None |
Optional Basic Auth password for Chroma. Use together with user. |
chunking_method |
"token" |
Chunking strategy: "token" | "text" | "semantic" | "llm". See Chunking methods. |
chunk_size |
200 |
Target chunk size (tokens for token, characters for text; also used by llm). |
chunk_overlap |
0 |
Overlap between consecutive chunks. |
encoding_name |
"cl100k_base" |
Tiktoken encoding for token chunking. |
embedding_model |
"text-embedding-3-small" |
OpenAI embedding model. |
batch_size |
100 |
Batch size for embedding API calls. |
openai_api_key |
from env / .env |
OpenAI API key. Overrides env if provided. |
dotenv_path |
None |
Path to .env file for loading OPENAI_API_KEY. |
API reference: methods and parameters
Public methods and functions with their parameters. All are available from import hecvec unless a submodule is noted.
Pipeline
Slicer(root=None, collection_name="hecvec", db="chroma", host="localhost", port=8000, user=None, password=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, openai_api_key=None, dotenv_path=None)
| Parameter | Type | Default | Description |
|---|---|---|---|
root |
str | Path | None |
None (→ cwd) |
Safe root for path resolution. |
collection_name |
str |
"hecvec" |
Base collection name; see Collection naming. |
db |
DbType |
"chroma" |
Database backend. Only "chroma" is supported. |
host |
str |
"localhost" |
Server host (when db="chroma"). |
port |
int |
8000 |
Server port (when db="chroma"). |
user |
str | None |
None |
Optional Basic Auth username for Chroma. Use together with password. |
password |
str | None |
None |
Optional Basic Auth password for Chroma. Use together with user. |
embedding_model |
str |
"text-embedding-3-small" |
OpenAI embedding model. |
chunk_size |
int |
200 |
Chunk size (tokens or chars by method). |
chunk_overlap |
int |
0 |
Overlap between chunks. |
encoding_name |
str |
"cl100k_base" |
Tiktoken encoding. |
batch_size |
int |
100 |
Embedding batch size. |
openai_api_key |
str | None |
None |
OpenAI key (else env / .env). |
dotenv_path |
str | Path | None |
None |
Path to .env file. |
Slicer.slice(path, *, root=None, collection_name="hecvec", db="chroma", host="localhost", port=8000, user=None, password=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, chunking_method="token", openai_api_key=None, dotenv_path=None)
Same parameters as above, plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str | Path |
(required) | File or directory to process (.txt/.md). |
Returns: dict with files, chunks, collection, and optionally message (e.g. when collection already exists).
Listing and reading
ListDir(root)
| Parameter | Type | Description |
|---|---|---|
root |
str | Path |
Root directory; all listed paths are under this. |
ListDir.listdir(path=".") → list[str]
List one level under path (relative to root). Returns sorted relative path strings (dirs first, then files).
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str | Path |
"." |
Path under root. |
ListDir.listdir_recursive(path=".", max_depth=None) → list[str]
List all entries under path recursively.
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str | Path |
"." |
Path under root. |
max_depth |
int | None |
None |
Max depth; None = unlimited. |
ListDirTextFiles(root, allowed_extensions=(".txt", ".md"))
Subclass of ListDir that filters to .txt/.md only.
ListDirTextFiles.filter_txt_md(relative_paths) → list[Path]
From relative path strings, return full paths of files with allowed extensions.
ListDirTextFiles.listdir_txt_md(path=".") → list[Path]
One-level list of .txt/.md files under path.
ListDirTextFiles.listdir_recursive_txt_md(path=".", max_depth=None) → list[Path]
Recursive list of .txt/.md files under path.
ReadText(paths, encoding="utf-8")
| Parameter | Type | Default | Description |
|---|---|---|---|
paths |
list[str] | list[Path] |
— | File paths to read. |
encoding |
str |
"utf-8" |
Preferred encoding; fallbacks are latin-1, cp1252. |
ReadText.read_all() → list[tuple[Path, str]]
Read all files; returns (path, text) pairs. Skips non-files and unreadable paths.
ReadText is iterable: for path, text in reader: yields (path, text).
Chunking
chunk_text(text, chunk_size=400, chunk_overlap=0, separators=None) → list[str]
Single-document recursive character split. Requires hecvec[chunk].
Note: Slicer.slice does not expose separators directly; it uses the defaults from the low-level chunker.
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
— | Document text. |
chunk_size |
int |
400 |
Max characters per chunk. |
chunk_overlap |
int |
0 |
Overlap. |
separators |
list[str] | None |
None |
Split order; default ["\n\n\n", "\n\n", "\n", ". ", " ", ""]. |
chunk_documents(path_and_texts, chunk_size=400, chunk_overlap=0, separators=None) → list[dict]
Multiple documents, recursive character split. Each dict: {"path", "chunk_index", "content"}. Requires hecvec[chunk].
Note: Slicer.slice does not expose separators directly; it uses the defaults from the low-level chunker.
token_chunk_text(text, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base") → list[str]
Single-document token split (tiktoken).
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
— | Document text. |
chunk_size |
int |
200 |
Max tokens per chunk. |
chunk_overlap |
int |
0 |
Overlap. |
encoding_name |
str |
"cl100k_base" |
Tiktoken encoding. |
token_chunk_documents(path_and_texts, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base") → tuple[list[str], list[str]]
Multiple documents, token split. Returns (ids, documents) with ids like chunk_0, chunk_1, ...
chunk_documents_by_method(path_and_texts, method="token", *, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", separators=None, openai_api_key=None, semantic_max_chunk_size=400, semantic_min_chunk_size=50, llm_model="gpt-4o-mini") → tuple[list[str], list[str]]
Chunk by method: "token" | "text" | "semantic" | "llm". Returns (ids, documents).
Note: this is a low-level helper with advanced knobs. Slicer.slice forwards only:
chunk_size, chunk_overlap, encoding_name, chunking_method (as method), and (when needed) openai_api_key.
So you only need chunk_size + chunk_overlap at the Slicer level; separators, semantic_max_chunk_size, semantic_min_chunk_size, and llm_model stay at their defaults unless you call this helper directly.
| Parameter | Type | Default | Description |
|---|---|---|---|
path_and_texts |
list[tuple[(Path, str)]] |
— | (path, text) pairs. |
method |
ChunkingMethod |
"token" |
"token" | "text" | "semantic" | "llm". |
chunk_size |
int |
200 |
Used by token, text, llm. |
chunk_overlap |
int |
0 |
Used by token, text. |
encoding_name |
str |
"cl100k_base" |
Token method. |
separators |
list[str] | None |
None |
Text method only. |
openai_api_key |
str | None |
None |
Required for semantic/llm. |
semantic_max_chunk_size |
int |
400 |
Semantic method. |
semantic_min_chunk_size |
int |
50 |
Semantic method. |
llm_model |
str |
"gpt-4o-mini" |
LLM method. |
Embeddings and Chroma
embed_texts(texts, api_key, model="text-embedding-3-small", batch_size=100) → list[list[float]]
OpenAI embeddings for a list of strings.
| Parameter | Type | Default | Description |
|---|---|---|---|
texts |
list[str] |
— | Texts to embed. |
api_key |
str |
— | OpenAI API key. |
model |
str |
"text-embedding-3-small" |
Embedding model. |
batch_size |
int |
100 |
Request batch size. |
get_client(host="localhost", port=8000, user=None, password=None) → Chroma HttpClient
Connects to the Chroma server at host:port. Raises if nothing is listening. If user and password are provided, uses Chroma Basic Auth.
| Parameter | Type | Default | Description |
|---|---|---|---|
host |
str |
"localhost" |
Chroma server host. |
port |
int |
8000 |
Chroma server port. |
user |
str | None |
None |
Basic Auth username for Chroma. Use together with password. |
password |
str | None |
None |
Basic Auth password for Chroma. Use together with user. |
get_or_create_collection(client, name, metadata=None)
Get or create a Chroma collection (default cosine similarity). metadata default: {"hnsw:space": "cosine"}.
add_documents(client, collection_name, ids, embeddings, documents) → dict
Add documents to a collection. Returns {"collection_existed": bool}.
list_collections(host="localhost", port=8000) → list[tuple[str, int]]
List collection names and document counts on a Chroma server: [(name, count), ...].
Environment
load_dotenv_if_available(dotenv_path=None)
Load .env into os.environ if python-dotenv is installed. No-op otherwise.
load_openai_key(dotenv_path=None) → str | None
Load .env if available, then return os.environ.get("OPENAI_API_KEY").
Chunking methods
| Method | Description | Requires |
|---|---|---|
token |
Split by token count (tiktoken, cl100k_base). Fast and deterministic. |
— |
text |
Recursive character splitter (paragraph → line → sentence, etc.). | — |
semantic |
Embed small segments, then group by similarity (DP) into larger chunks. | OPENAI_API_KEY |
llm |
Use an LLM to choose split points for thematic sections. | OPENAI_API_KEY |
Use chunking_method="token" or "text" to avoid API calls during chunking. Use "semantic" or "llm" for more coherent, topic-aware chunks (at the cost of extra OpenAI usage).
Chroma server
The pipeline connects only to a Chroma server. There is no ephemeral or persistent local client; a server must be running at host:port (default localhost:8000). If nothing is listening, slice() and collections() raise RuntimeError.
Start a server (e.g. Docker):
docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma chromadb/chroma
-v is a Docker bind mount: -v ./chroma-data:/chroma/chroma maps your local ./chroma-data directory into the container.
In practice, “persistent data” means Chroma’s database files (collections, vectors, metadata) are written to disk and survive docker stop / docker start (and even container recreation), so reruns can append without losing history.
Two common ways to run Chroma persistently:
- Plain Docker on the host: run the
docker run ... -v ./chroma-data:...command above. - Inside the provided devcontainer: use the compose setup in
.devcontainer/docker-compose.yml(the devcontainer “compose” config in this repo). It starts achromaservice with a persistent Docker volume (chroma-data) mounted at/chroma/dataandIS_PERSISTENT=TRUE, so reopening the devcontainer keeps your vectors.
Containers: If the app runs in a devcontainer and Chroma is in the same Docker Compose stack, use the service name as host (in this repo: host="chroma"). If Chroma is on the host and the app in the container, use host="host.docker.internal".
Environment and API key
-
OpenAI: The pipeline (and
semantic/llmchunking) needs an API key. It is read in this order:- Argument
openai_api_key=... - Environment variable
OPENAI_API_KEY - A
.envfile in the current working directory (loaded viapython-dotenvwhen you usehecvec)
- Argument
-
.env: Create a.envin your project root (or setdotenv_pathto point to one):OPENAI_API_KEY=sk-...
-
Do not commit
.envor expose the key in logs or source code.
Collection naming
-
If you pass
collection_name="hecvec"(default), the base name is taken from the input:- Single file:
path.stem(e.g.mydoc) - Directory:
path.name(e.g.docs)
- Single file:
-
The final collection name is always:
{base_name}_{chunking_method}_{chunk_config}Examples:
- token:
mydoc_token_cs200_ov0_enccl100k_base - text:
mydoc_text_cs400_ov0 - llm/semantic:
mydoc_llm_cs200
- token:
-
If a collection with that name already exists, the pipeline does not add documents again. It logs that the collection already exists and returns something like:
{"files": N, "chunks": 0, "collection": "...", "message": "Collection already exists; no documents added."}
Building blocks
You can use the pipeline step-by-step.
List and read:
from pathlib import Path
from hecvec import ListDir, ListDirTextFiles, ReadText
root = Path("/path/to/repo")
lister = ListDir(root=root)
for rel in lister.listdir("."):
print(rel)
text_lister = ListDirTextFiles(root=root)
paths = text_lister.listdir_recursive_txt_md("docs")
reader = ReadText(paths)
for path, text in reader:
print(path, len(text))
Chunk only (e.g. recursive character, with hecvec[chunk]):
from hecvec import ListDirTextFiles, ReadText
from hecvec.chunking import chunk_documents
paths = ListDirTextFiles(root=root).listdir_recursive_txt_md(".")
path_and_text = ReadText(paths).read_all()
chunks = chunk_documents(path_and_text) # list of {"path", "chunk_index", "content"}
Token chunk + embed + list Chroma collections:
from hecvec import token_chunk_text, embed_texts, list_collections
chunks = token_chunk_text("Some long document...", chunk_size=200)
vecs = embed_texts(chunks, api_key="sk-...")
names_and_counts = list_collections(host="localhost", port=8000)
CLI (list directory under a root):
hecvec-listdir [path] [root]
# or
python -m hecvec.cli [path] [root]
Module layout
| Module | Responsibility |
|---|---|
hecvec.env |
Load .env and OPENAI_API_KEY |
hecvec.listdir |
List dirs under a safe root; filter .txt/.md |
hecvec.reading |
Read files as text (UTF-8 / latin-1 / cp1252 fallback) |
hecvec.token_splitter |
Token-based chunking (tiktoken) |
hecvec.chunking |
Recursive character chunking (chunk_documents, chunk_text) |
hecvec.chunkers |
Multi-method chunking: token, text, semantic, llm |
hecvec.embeddings |
OpenAI embeddings (embed_texts) |
hecvec.chroma_client |
Chroma client, get/create collection, add documents |
hecvec.chroma_list |
List Chroma collections and counts |
hecvec.pipeline |
Orchestrator: Slicer and slice(path=...) |
Development
From the repo root:
uv sync
uv run python -c "from hecvec import ListDir; print(ListDir('.').listdir('.'))"
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hecvec-6.6.0.tar.gz.
File metadata
- Download URL: hecvec-6.6.0.tar.gz
- Upload date:
- Size: 250.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f85135d6851ad4ec04ad25afa7d4b489355d8a0d7f7db6dd50c699053f4900eb
|
|
| MD5 |
631da00a963066b20686d0ba524862ae
|
|
| BLAKE2b-256 |
7de5b689a0ed8a50ecc839fc1903a7a0ee2e6346b68054453fbdf7d23b5612ee
|
File details
Details for the file hecvec-6.6.0-py3-none-any.whl.
File metadata
- Download URL: hecvec-6.6.0-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a4a06e6875d3b8a2fc3e53f836433d10a48100e1ceafb1ad4053020ba1d7f6f
|
|
| MD5 |
56265344d29b3ec74f265177a414ab15
|
|
| BLAKE2b-256 |
ea4757879637c18f6cdd365710ccb3dc624d0cd93791d19dff3b18fe28a05509
|