Skip to main content

List directories (safe root), filter .txt/.md files, read as text, chunk, embed, and push to Chroma.

Project description

HecVec

HecVec is a Python library that discovers .txt and .md files, chunks them (token, text, semantic, or LLM-based), embeds with OpenAI, and stores vectors in Chroma. It is library-only — no HTTP API. All work runs in-process.


Table of contents


Install

Full pipeline (list → verify Chroma is up → read → chunk → embed → Chroma):

pip install hecvec

A Chroma server must be running; the pipeline connects only to that server (see Chroma server). There is no local/ephemeral mode.


Requirements to run the pipeline

To use the full Slicer.slice(...) pipeline you need:

  1. Python 3.9–3.13.
  2. Dependencies installed via pip install hecvec.
  3. OpenAI API key for embeddings (and for semantic / llm chunking). Set OPENAI_API_KEY in the environment or in a .env file (see Environment and API key).
  4. Chroma server — A Chroma server must be listening at host:port (default localhost:8000). If nothing is listening, the pipeline raises. Start one e.g. with docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma chromadb/chroma (bind-mounts data so it persists across container restarts).

Workflow

The main entry point is Slicer.slice(path=..., **kwargs). It runs six logged steps:

Step Description
0 Resolve path, resolve collection name (base_name + _ + chunking_method).
1 Discover files: single .txt/.md file or recursive list under a directory.
2 Chroma server check: connect to the server and fail fast if nothing is listening (before read/chunk/embed so you don’t pay for OpenAI when Chroma is down). The client is reused for the final write.
3 Read file contents as text (UTF-8 with fallbacks).
4 Chunk using the chosen method (token, text, semantic, or llm).
5 Generate embeddings. The API provider is inferred from embedding_model (e.g. text-embedding-3-small → OpenAI).
6 Write vectors to Chroma. If the collection already exists, new chunks are appended (same collection name).

Progress is logged as [0/6][6/6] with timings.


Quick start

import hecvec

# Default: token chunking, Chroma at localhost:8000
result = hecvec.Slicer.slice(
    path="/path/to/folder_or_file",
    embedding_model="text-embedding-3-small",
)
# → {"files": N, "chunks": M, "collection": "folder_or_file_name_token_cs200"}

# Custom host/port and semantic chunking
result = hecvec.Slicer.slice(
    path="/path/to/docs",
    host="localhost",
    port=8000,
    embedding_model="text-embedding-3-small",
    chunking_method="semantic",
)

Or use an instance:

slicer = hecvec.Slicer(
    host="chroma",  # e.g. Docker Compose service name (see `.devcontainer/docker-compose.yml`)
    port=8000,
    embedding_model="text-embedding-3-small",
    chunking_method="token",
)
result = slicer.slice(path="/data/myfile.txt")

Run the test script (from repo root, with Chroma running and OPENAI_API_KEY set):

# Terminal 1: start Chroma
docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma chromadb/chroma

# Terminal 2: run pipeline
uv run python scripts/test_slice.py
# Or with a path:
uv run python scripts/test_slice.py /path/to/file_or_folder

Parameters

All of these can be passed to Slicer(...) or to Slicer.slice(..., key=value).

Parameter Default Description
path (required) File or directory to process (.txt/.md only).
root path.parent (file) or path (dir) Safe root for resolving paths (used when listing under a directory).
collection_name "hecvec" Base name for the Chroma collection. If "hecvec", it is replaced by the file stem or directory name; the final name includes method + chunk size. Full config is recorded in collection/collections_info.md when a new collection is created.
server "chroma" Backend server to use: "chroma" (self-hosted) or "chroma_cloud" (Chroma Cloud). Alias: db.
host "localhost" Server host (only for server="chroma").
port 8000 Server port (only for server="chroma").
user None Optional Basic Auth username (only for server="chroma"). Use together with password.
password None Optional Basic Auth password (only for server="chroma"). Use together with user.
cloud_api_key None Chroma Cloud API key (only for server="chroma_cloud"). If not passed, hecvec reads it from .env/env.
cloud_tenant None Optional Chroma Cloud tenant (only for server="chroma_cloud"). If not passed, hecvec reads CHROMA_TENANT from .env/env.
cloud_database None Optional Chroma Cloud database (only for server="chroma_cloud"). If not passed, hecvec reads CHROMA_DATABASE from .env/env.
chunking_method "token" Chunking strategy: "token" | "text" | "semantic" | "llm". See Chunking methods.
chunk_size 200 Target chunk size (tokens for token, characters for text; also used by llm).
chunk_overlap 0 Overlap between consecutive chunks.
encoding_name "cl100k_base" Tiktoken encoding for token chunking.
embedding_model "text-embedding-3-small" Embedding model id; provider is inferred from the name (OpenAI for text-embedding-*, etc.). Aliases: llm_model, embeding_model (deprecated typo).
batch_size 100 Batch size for embedding API calls.
token_llm from env / .env OpenAI API key for embeddings (and forwarded for chunking where needed). Overrides OPENAI_API_KEY from env / .env if set.
dotenv_path None Path to .env file for loading OPENAI_API_KEY.

API reference: methods and parameters

Public methods and functions with their parameters. All are available from import hecvec unless a submodule is noted.

Pipeline

Slicer(root=None, collection_name="hecvec", server="chroma", host="localhost", port=8000, user=None, password=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, token_llm=None, dotenv_path=None)

Parameter Type Default Description
root str | Path | None None (→ cwd) Safe root for path resolution.
collection_name str "hecvec" Base collection name; see Collection naming.
server DbType "chroma" Backend server: "chroma" (self-hosted) or "chroma_cloud" (Chroma Cloud). Alias: db.
host str "localhost" Server host (when server="chroma").
port int 8000 Server port (when server="chroma").
user str | None None Optional Basic Auth username for Chroma. Use together with password.
password str | None None Optional Basic Auth password for Chroma. Use together with user.
cloud_api_key str | None None Chroma Cloud API key (required when server="chroma_cloud").
cloud_tenant str | None None Optional Chroma Cloud tenant. If provided, cloud_database must also be provided.
cloud_database str | None None Optional Chroma Cloud database. If provided, cloud_tenant must also be provided.
embedding_model str "text-embedding-3-small" Embedding model id; provider is inferred (see infer_embedding_provider). Aliases: llm_model, embeding_model (deprecated typo).
chunk_size int 200 Chunk size (tokens or chars by method).
chunk_overlap int 0 Overlap between chunks.
encoding_name str "cl100k_base" Tiktoken encoding.
batch_size int 100 Embedding batch size.
token_llm str | None None OpenAI API key for embeddings; else OPENAI_API_KEY from env / .env.
dotenv_path str | Path | None None Path to .env file.

Slicer.slice(path, *, root=None, collection_name="hecvec", server="chroma", host="localhost", port=8000, user=None, password=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, chunking_method="token", token_llm=None, dotenv_path=None)

Same parameters as above, plus:

Parameter Type Default Description
path str | Path (required) File or directory to process (.txt/.md).

Returns: dict with files, chunks, collection, and optionally message (e.g. when collection already exists).


Listing and reading

ListDir(root)

Parameter Type Description
root str | Path Root directory; all listed paths are under this.

ListDir.listdir(path=".")list[str]
List one level under path (relative to root). Returns sorted relative path strings (dirs first, then files).

Parameter Type Default Description
path str | Path "." Path under root.

ListDir.listdir_recursive(path=".", max_depth=None)list[str]
List all entries under path recursively.

Parameter Type Default Description
path str | Path "." Path under root.
max_depth int | None None Max depth; None = unlimited.

ListDirTextFiles(root, allowed_extensions=(".txt", ".md"))
Subclass of ListDir that filters to .txt/.md only.

ListDirTextFiles.filter_txt_md(relative_paths)list[Path]
From relative path strings, return full paths of files with allowed extensions.

ListDirTextFiles.listdir_txt_md(path=".")list[Path]
One-level list of .txt/.md files under path.

ListDirTextFiles.listdir_recursive_txt_md(path=".", max_depth=None)list[Path]
Recursive list of .txt/.md files under path.

ReadText(paths, encoding="utf-8")

Parameter Type Default Description
paths list[str] | list[Path] File paths to read.
encoding str "utf-8" Preferred encoding; fallbacks are latin-1, cp1252.

ReadText.read_all()list[tuple[Path, str]]
Read all files; returns (path, text) pairs. Skips non-files and unreadable paths.

ReadText is iterable: for path, text in reader: yields (path, text).


Chunking

chunk_text(text, chunk_size=400, chunk_overlap=0, separators=None)list[str]
Single-document recursive character split. Requires hecvec[chunk].

Note: Slicer.slice does not expose separators directly; it uses the defaults from the low-level chunker.

Parameter Type Default Description
text str Document text.
chunk_size int 400 Max characters per chunk.
chunk_overlap int 0 Overlap.
separators list[str] | None None Split order; default ["\n\n\n", "\n\n", "\n", ". ", " ", ""].

chunk_documents(path_and_texts, chunk_size=400, chunk_overlap=0, separators=None)list[dict]
Multiple documents, recursive character split. Each dict: {"path", "chunk_index", "content"}. Requires hecvec[chunk].

Note: Slicer.slice does not expose separators directly; it uses the defaults from the low-level chunker.

token_chunk_text(text, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base")list[str]
Single-document token split (tiktoken).

Parameter Type Default Description
text str Document text.
chunk_size int 200 Max tokens per chunk.
chunk_overlap int 0 Overlap.
encoding_name str "cl100k_base" Tiktoken encoding.

token_chunk_documents(path_and_texts, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base")tuple[list[str], list[str]]
Multiple documents, token split. Returns (ids, documents) with ids like chunk_0, chunk_1, ...

chunk_documents_by_method(path_and_texts, method="token", *, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", separators=None, openai_api_key=None, semantic_max_chunk_size=400, semantic_min_chunk_size=50, llm_model="gpt-4o-mini")tuple[list[str], list[str]]
Chunk by method: "token" | "text" | "semantic" | "llm". Returns (ids, documents).

Note: this is a low-level helper with advanced knobs. Slicer.slice forwards only: chunk_size, chunk_overlap, encoding_name, chunking_method (as method), and (when needed) openai_api_key.
So you only need chunk_size + chunk_overlap at the Slicer level; separators, semantic_max_chunk_size, semantic_min_chunk_size, and llm_model stay at their defaults unless you call this helper directly.

Parameter Type Default Description
path_and_texts list[tuple[(Path, str)]] (path, text) pairs.
method ChunkingMethod "token" "token" | "text" | "semantic" | "llm".
chunk_size int 200 Used by token, text, llm.
chunk_overlap int 0 Used by token, text.
encoding_name str "cl100k_base" Token method.
separators list[str] | None None Text method only.
openai_api_key str | None None Required for semantic/llm.
semantic_max_chunk_size int 400 Semantic method.
semantic_min_chunk_size int 50 Semantic method.
llm_model str "gpt-4o-mini" LLM method.

Embeddings and Chroma

embed_texts(texts, *, api_key, embedding_model="text-embedding-3-small", batch_size=100)list[list[float]]
Embeddings for a list of strings. Provider is inferred from embedding_model.

infer_embedding_provider(embedding_model)"openai"
Returns the provider for a model name, or raises ValueError if unknown / unsupported (e.g. Gemini ids are rejected until supported).

Parameter Type Default Description
texts list[str] Texts to embed.
api_key str API key for the inferred provider (OpenAI today).
embedding_model str "text-embedding-3-small" Model id. Aliases when calling: llm_model, model.
batch_size int 100 Request batch size.

get_client(host="localhost", port=8000, user=None, password=None) → Chroma HttpClient Connects to the Chroma server at host:port. Raises if nothing is listening. If user and password are provided, uses Chroma Basic Auth.

Parameter Type Default Description
host str "localhost" Chroma server host.
port int 8000 Chroma server port.
user str | None None Basic Auth username for Chroma. Use together with password.
password str | None None Basic Auth password for Chroma. Use together with user.

get_or_create_collection(client, name, metadata=None)
Get or create a Chroma collection (default cosine similarity). metadata default: {"hnsw:space": "cosine"}.

add_documents(client, collection_name, ids, embeddings, documents)dict
Add documents to a collection. Returns {"collection_existed": bool}.

list_collections(host="localhost", port=8000, *, server="chroma", db=None, ...)list[tuple[str, int]]
List collection names and document counts: [(name, count), ...].

  • Self-hosted (server="chroma", default): uses host / port (and optional user / password). Same as Slicer(..., server="chroma").
  • Chroma Cloud (server="chroma_cloud"): pass cloud_api_key (or use .env / env vars); host and port are not used. Alias: db= (same as server=).

If you wrote to Cloud with Slicer.slice(..., server="chroma_cloud"), list with list_collections(server="chroma_cloud", dotenv_path="...") (or pass the same cloud kwargs)—not host="localhost", port=8000.


Environment

load_dotenv_if_available(dotenv_path=None)
Load .env into os.environ if python-dotenv is installed. No-op otherwise.

load_openai_key(dotenv_path=None)str | None
Load .env if available, then return os.environ.get("OPENAI_API_KEY").


Chunking methods

Method Description Requires
token Split by token count (tiktoken, cl100k_base). Fast and deterministic.
text Recursive character splitter (paragraph → line → sentence, etc.).
semantic Embed small segments, then group by similarity (DP) into larger chunks. OPENAI_API_KEY
llm Use an LLM to choose split points for thematic sections. OPENAI_API_KEY

Use chunking_method="token" or "text" to avoid API calls during chunking. Use "semantic" or "llm" for more coherent, topic-aware chunks (at the cost of extra OpenAI usage).


Self-hosted Chroma server (server="chroma")

Use this when you run Chroma yourself (Docker, EC2, devcontainer, etc). hecvec connects to a Chroma server over HTTP using host and port.

Start a server (e.g. Docker):

docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma chromadb/chroma

-v is a Docker bind mount: -v ./chroma-data:/chroma/chroma maps your local ./chroma-data directory into the container.

In practice, “persistent data” means Chroma’s database files (collections, vectors, metadata) are written to disk and survive docker stop / docker start (and even container recreation), so reruns can append without losing history.

Parameters used by server="chroma":

  • Required: host, port
  • Optional: user, password (Basic Auth header; both-or-neither)

Authentication (recommended)

Modern Chroma releases (the chromadb/chroma:latest image) do not reliably support the older CHROMA_SERVER_AUTHN_* env-var based auth providers. If you need enforced auth on a self-hosted instance (e.g. EC2), the simplest reliable pattern is:

  • Run Chroma on a private network interface (or at least don’t expose it directly).
  • Put a reverse proxy in front that enforces Basic Auth (or token/JWT) and TLS.

When you pass user= and password= to hecvec.Slicer(...), hecvec will send a standard HTTP Authorization: Basic ... header, which works with a proxy-enforced Basic Auth setup.

Two common ways to run Chroma persistently:

  1. Plain Docker on the host: run the docker run ... -v ./chroma-data:... command above.
  2. Inside the provided devcontainer: use the compose setup in .devcontainer/docker-compose.yml (the devcontainer “compose” config in this repo). It starts a chroma service with a persistent Docker volume (chroma-data) mounted at /chroma/data and IS_PERSISTENT=TRUE, so reopening the devcontainer keeps your vectors.

Containers: If the app runs in a devcontainer and Chroma is in the same Docker Compose stack, use the service name as host (in this repo: host="chroma"). If Chroma is on the host and the app in the container, use host="host.docker.internal".


Chroma Cloud (server="chroma_cloud")

Use this when you want a managed Chroma deployment with built-in auth and TLS. hecvec uses the Chroma Python client’s Cloud client under the hood.

Parameters used by server="chroma_cloud":

  • Required: cloud_api_key (or set via .env/env)
  • Optional: cloud_tenant and cloud_database (either provide both, or omit both)
  • Ignored: host, port, user, password

Environment / .env variables (loaded automatically):

  • CHROMA_CLOUD_API_KEY=... (preferred) or CHROMA_API_KEY=...
  • CHROMA_TENANT=... (optional; used when cloud_tenant not passed)
  • CHROMA_DATABASE=... (optional; used when cloud_database not passed)

Example:

import hecvec

# Minimal: only API key needed (tenant/database optional depending on your Cloud setup)
slicer = hecvec.Slicer(server="chroma_cloud", cloud_api_key="ck-...")
result = slicer.slice(path="/path/to/docs")

Environment and API key

  • OpenAI: The pipeline (and semantic / llm chunking) needs an API key. It is read in this order:

    1. Argument token_llm=... (for embeddings; provider is inferred from embedding_model)
    2. Environment variable OPENAI_API_KEY
    3. A .env file in the current working directory (loaded via python-dotenv when you use hecvec)
  • .env: Create a .env in your project root (or set dotenv_path to point to one):

    OPENAI_API_KEY=sk-...
    
  • Do not commit .env or expose the key in logs or source code.


Collection naming

  • If you pass collection_name="hecvec" (default), the base name is taken from the input:

    • Single file: path.stem (e.g. mydoc)
    • Directory: path.name (e.g. docs)
  • The final collection name is always:

    {base_name}_{chunking_method}_cs{chunk_size}

    Examples:

    • token: mydoc_token_cs200
    • text: mydoc_text_cs400
    • llm/semantic: mydoc_llm_cs200
  • Detailed collection configuration is persisted to:

    • collection/collections_info.md
    • A new row is appended only when a collection is newly created.
  • If a collection with that name already exists, the pipeline appends new chunks to it and increments append_runs in collection/collections_info.md when possible.


Building blocks

You can use the pipeline step-by-step.

List and read:

from pathlib import Path
from hecvec import ListDir, ListDirTextFiles, ReadText

root = Path("/path/to/repo")
lister = ListDir(root=root)
for rel in lister.listdir("."):
    print(rel)

text_lister = ListDirTextFiles(root=root)
paths = text_lister.listdir_recursive_txt_md("docs")
reader = ReadText(paths)
for path, text in reader:
    print(path, len(text))

Chunk only (e.g. recursive character, with hecvec[chunk]):

from hecvec import ListDirTextFiles, ReadText
from hecvec.chunking import chunk_documents

paths = ListDirTextFiles(root=root).listdir_recursive_txt_md(".")
path_and_text = ReadText(paths).read_all()
chunks = chunk_documents(path_and_text)  # list of {"path", "chunk_index", "content"}

Token chunk + embed + list Chroma collections:

from hecvec import token_chunk_text, embed_texts, list_collections

chunks = token_chunk_text("Some long document...", chunk_size=200)
vecs = embed_texts(chunks, api_key="sk-...", embedding_model="text-embedding-3-small")
# Self-hosted Chroma on localhost:8000
names_and_counts = list_collections(host="localhost", port=8000)
# Chroma Cloud (same backend as Slicer with server="chroma_cloud")
names_and_counts = list_collections(server="chroma_cloud", dotenv_path=".env")

CLI (list directory under a root):

hecvec-listdir [path] [root]
# or
python -m hecvec.cli [path] [root]

Module layout

Module Responsibility
hecvec.env Load .env and OPENAI_API_KEY
hecvec.listdir List dirs under a safe root; filter .txt/.md
hecvec.reading Read files as text (UTF-8 / latin-1 / cp1252 fallback)
hecvec.token_splitter Token-based chunking (tiktoken)
hecvec.chunking Recursive character chunking (chunk_documents, chunk_text)
hecvec.chunkers Multi-method chunking: token, text, semantic, llm
hecvec.embeddings embed_texts, infer_embedding_provider
hecvec.chroma_client Chroma client, get/create collection, add documents
hecvec.chroma_list List Chroma collections and counts
hecvec.pipeline Orchestrator: Slicer and slice(path=...)

Development

From the repo root:

uv sync
uv run python -c "from hecvec import ListDir; print(ListDir('.').listdir('.'))"

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hecvec-6.8.1.tar.gz (245.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hecvec-6.8.1-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file hecvec-6.8.1.tar.gz.

File metadata

  • Download URL: hecvec-6.8.1.tar.gz
  • Upload date:
  • Size: 245.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hecvec-6.8.1.tar.gz
Algorithm Hash digest
SHA256 13dc9272e53048b078ad9d0ec8f8febe33d52a2eb7927bd95dff9f4b2d38113f
MD5 9afe53fd0a4ff06b4bb027363d859918
BLAKE2b-256 2f9cef94b63c785fa8f86f0e19a59493c87189bf95b7427cb0b9ca826aaf1e33

See more details on using hashes here.

File details

Details for the file hecvec-6.8.1-py3-none-any.whl.

File metadata

  • Download URL: hecvec-6.8.1-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hecvec-6.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4902046bbc17583caa012bbb46ad95d34951158541a4282cf2e8dac5e5ade773
MD5 92e7c6924dc7dc6c738c52430f4b9a43
BLAKE2b-256 192a780219905048037e1951d5ae6b9888956c19f1ab0732e799c7839aa7478e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page