List directories (safe root), filter .txt/.md files, read as text, chunk, embed, and push to Chroma.

These details have not been verified by PyPI

Project description

HecVec

HecVec is a Python library that discovers .txt and .md files, chunks them (token, text, semantic, or LLM-based), embeds with OpenAI, and stores vectors in Chroma. It is library-only — no HTTP API. All work runs in-process.

Install
Requirements to run the pipeline
Workflow
Quick start
Parameters
API reference: methods and parameters
Chunking methods
Chroma server
Environment and API key
Collection naming
Building blocks
Module layout
Development
License

Install

Full pipeline (list → read → chunk → embed → Chroma):

pip install hecvec

A Chroma server must be running; the pipeline connects only to that server (see Chroma server). There is no local/ephemeral mode.

Requirements to run the pipeline

To use the full Slicer.slice(...) pipeline you need:

Python 3.9–3.13.
Dependencies installed via pip install hecvec.
OpenAI API key for embeddings (and for semantic / llm chunking). Set OPENAI_API_KEY in the environment or in a .env file (see Environment and API key).
Chroma server — A Chroma server must be listening at host:port (default localhost:8000). If nothing is listening, the pipeline raises. Start one e.g. with docker run -p 8000:8000 chromadb/chroma.

Workflow

The main entry point is Slicer.slice(path=..., **kwargs). It runs five steps:

Step	Description
0	Resolve path, resolve collection name (`base_name` + `_` + `chunking_method`).
1	Discover files: single `.txt`/`.md` file or recursive list under a directory.
2	Read file contents as text (UTF-8 with fallbacks).
3	Chunk using the chosen method (`token`, `text`, `semantic`, or `llm`).
4	Generate embeddings with OpenAI.
5	Connect to Chroma; if the collection already exists, skip adding (no duplicate docs). Otherwise create the collection and add documents.

Progress is logged as [0/5] … [5/5] with timings. If the collection already exists, the log states that clearly and no new documents are added.

Quick start

import hecvec

# Default: token chunking, Chroma at localhost:8000
result = hecvec.Slicer.slice(path="/path/to/folder_or_file")
# → {"files": N, "chunks": M, "collection": "folder_or_file_name_token"}

# Custom host/port and semantic chunking
result = hecvec.Slicer.slice(
    path="/path/to/docs",
    host="localhost",
    port=8000,
    chunking_method="semantic",
)

Or use an instance:

slicer = hecvec.Slicer(
    host="chroma",  # e.g. Docker service name
    port=8000,
    chunking_method="token",
)
result = slicer.slice(path="/data/myfile.txt")

Run the test script (from repo root, with Chroma running and OPENAI_API_KEY set):

# Terminal 1: start Chroma
docker run -p 8000:8000 chromadb/chroma

# Terminal 2: run pipeline
uv run python scripts/test_slice.py
# Or with a path:
uv run python scripts/test_slice.py /path/to/file_or_folder

Parameters

All of these can be passed to Slicer(...) or to Slicer.slice(..., key=value).

Parameter	Default	Description
`path`	(required)	File or directory to process (`.txt`/`.md` only).
`root`	`path.parent` (file) or `path` (dir)	Safe root for resolving paths (used when listing under a directory).
`collection_name`	`"hecvec"`	Base name for the Chroma collection. If `"hecvec"`, it is replaced by the file stem or directory name; the final name is always `{collection_name}_{chunking_method}` (e.g. `mydoc_token`).
`db`	`"chroma"`	Database to use. Only `"chroma"` is supported. When `db="chroma"`, connection uses `host` and `port`.
`host`	`"localhost"`	Server host (used when `db="chroma"`). Server must be listening.
`port`	`8000`	Server port (used when `db="chroma"`).
`auth`	`None`	Optional Basic Auth credentials for `db="chroma"` as `"username:password"` (matches Chroma Basic Auth).
`chunking_method`	`"token"`	Chunking strategy: `"token"` \| `"text"` \| `"semantic"` \| `"llm"`. See Chunking methods.
`chunk_size`	`200`	Target chunk size (tokens for `token`, characters for `text`; also used by `llm`).
`chunk_overlap`	`0`	Overlap between consecutive chunks.
`encoding_name`	`"cl100k_base"`	Tiktoken encoding for token chunking.
`embedding_model`	`"text-embedding-3-small"`	OpenAI embedding model.
`batch_size`	`100`	Batch size for embedding API calls.
`openai_api_key`	from env / `.env`	OpenAI API key. Overrides env if provided.
`dotenv_path`	`None`	Path to `.env` file for loading `OPENAI_API_KEY`.

API reference: methods and parameters

Public methods and functions with their parameters. All are available from import hecvec unless a submodule is noted.

Pipeline

Slicer(root=None, collection_name="hecvec", db="chroma", host="localhost", port=8000, auth=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, openai_api_key=None, dotenv_path=None)

Parameter	Type	Default	Description
`root`	`str` \| `Path` \| `None`	`None` (→ cwd)	Safe root for path resolution.
`collection_name`	`str`	`"hecvec"`	Base collection name; see Collection naming.
`db`	`DbType`	`"chroma"`	Database backend. Only `"chroma"` is supported.
`host`	`str`	`"localhost"`	Server host (when `db="chroma"`).
`port`	`int`	`8000`	Server port (when `db="chroma"`).
`auth`	`str` \| `None`	`None`	Optional Basic Auth credentials for Chroma as `"username:password"`.
`embedding_model`	`str`	`"text-embedding-3-small"`	OpenAI embedding model.
`chunk_size`	`int`	`200`	Chunk size (tokens or chars by method).
`chunk_overlap`	`int`	`0`	Overlap between chunks.
`encoding_name`	`str`	`"cl100k_base"`	Tiktoken encoding.
`batch_size`	`int`	`100`	Embedding batch size.
`openai_api_key`	`str` \| `None`	`None`	OpenAI key (else env / `.env`).
`dotenv_path`	`str` \| `Path` \| `None`	`None`	Path to `.env` file.

Slicer.slice(path, *, root=None, collection_name="hecvec", db="chroma", host="localhost", port=8000, auth=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, chunking_method="token", openai_api_key=None, dotenv_path=None)

Same parameters as above, plus:

Parameter	Type	Default	Description
`path`	`str` \| `Path`	(required)	File or directory to process (`.txt`/`.md`).

Returns: dict with files, chunks, collection, and optionally message (e.g. when collection already exists).

Listing and reading

ListDir(root)

Parameter	Type	Description
`root`	`str` \| `Path`	Root directory; all listed paths are under this.

ListDir.listdir(path=".") → list[str]
List one level under path (relative to root). Returns sorted relative path strings (dirs first, then files).

Parameter	Type	Default	Description
`path`	`str` \| `Path`	`"."`	Path under root.

ListDir.listdir_recursive(path=".", max_depth=None) → list[str]
List all entries under path recursively.

Parameter	Type	Default	Description
`path`	`str` \| `Path`	`"."`	Path under root.
`max_depth`	`int` \| `None`	`None`	Max depth; `None` = unlimited.

ListDirTextFiles(root, allowed_extensions=(".txt", ".md"))
Subclass of ListDir that filters to .txt/.md only.

ListDirTextFiles.filter_txt_md(relative_paths) → list[Path]
From relative path strings, return full paths of files with allowed extensions.

ListDirTextFiles.listdir_txt_md(path=".") → list[Path]
One-level list of .txt/.md files under path.

ListDirTextFiles.listdir_recursive_txt_md(path=".", max_depth=None) → list[Path]
Recursive list of .txt/.md files under path.

ReadText(paths, encoding="utf-8")

Parameter	Type	Default	Description
`paths`	`list[str]` \| `list[Path]`	—	File paths to read.
`encoding`	`str`	`"utf-8"`	Preferred encoding; fallbacks are latin-1, cp1252.

ReadText.read_all() → list[tuple[Path, str]]
Read all files; returns (path, text) pairs. Skips non-files and unreadable paths.

ReadText is iterable: for path, text in reader: yields (path, text).

Chunking

chunk_text(text, chunk_size=400, chunk_overlap=0, separators=None) → list[str]
Single-document recursive character split. Requires hecvec[chunk].

Note: Slicer.slice does not expose separators directly; it uses the defaults from the low-level chunker.

Parameter	Type	Default	Description
`text`	`str`	—	Document text.
`chunk_size`	`int`	`400`	Max characters per chunk.
`chunk_overlap`	`int`	`0`	Overlap.
`separators`	`list[str]` \| `None`	`None`	Split order; default `["\n\n\n", "\n\n", "\n", ". ", " ", ""]`.

chunk_documents(path_and_texts, chunk_size=400, chunk_overlap=0, separators=None) → list[dict]
Multiple documents, recursive character split. Each dict: {"path", "chunk_index", "content"}. Requires hecvec[chunk].

Note: Slicer.slice does not expose separators directly; it uses the defaults from the low-level chunker.

token_chunk_text(text, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base") → list[str]
Single-document token split (tiktoken).

Parameter	Type	Default	Description
`text`	`str`	—	Document text.
`chunk_size`	`int`	`200`	Max tokens per chunk.
`chunk_overlap`	`int`	`0`	Overlap.
`encoding_name`	`str`	`"cl100k_base"`	Tiktoken encoding.

token_chunk_documents(path_and_texts, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base") → tuple[list[str], list[str]]
Multiple documents, token split. Returns (ids, documents) with ids like chunk_0, chunk_1, ...

chunk_documents_by_method(path_and_texts, method="token", *, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", separators=None, openai_api_key=None, semantic_max_chunk_size=400, semantic_min_chunk_size=50, llm_model="gpt-4o-mini") → tuple[list[str], list[str]]
Chunk by method: "token" | "text" | "semantic" | "llm". Returns (ids, documents).

Note: this is a low-level helper with advanced knobs. Slicer.slice forwards only: chunk_size, chunk_overlap, encoding_name, chunking_method (as method), and (when needed) openai_api_key.
So you only need chunk_size + chunk_overlap at the Slicer level; separators, semantic_max_chunk_size, semantic_min_chunk_size, and llm_model stay at their defaults unless you call this helper directly.

Parameter	Type	Default	Description
`path_and_texts`	`list[tuple[(Path, str)]]`	—	`(path, text)` pairs.
`method`	`ChunkingMethod`	`"token"`	`"token"` \| `"text"` \| `"semantic"` \| `"llm"`.
`chunk_size`	`int`	`200`	Used by token, text, llm.
`chunk_overlap`	`int`	`0`	Used by token, text.
`encoding_name`	`str`	`"cl100k_base"`	Token method.
`separators`	`list[str]` \| `None`	`None`	Text method only.
`openai_api_key`	`str` \| `None`	`None`	Required for semantic/llm.
`semantic_max_chunk_size`	`int`	`400`	Semantic method.
`semantic_min_chunk_size`	`int`	`50`	Semantic method.
`llm_model`	`str`	`"gpt-4o-mini"`	LLM method.

Embeddings and Chroma

embed_texts(texts, api_key, model="text-embedding-3-small", batch_size=100) → list[list[float]]
OpenAI embeddings for a list of strings.

Parameter	Type	Default	Description
`texts`	`list[str]`	—	Texts to embed.
`api_key`	`str`	—	OpenAI API key.
`model`	`str`	`"text-embedding-3-small"`	Embedding model.
`batch_size`	`int`	`100`	Request batch size.

get_client(host="localhost", port=8000, auth=None) → Chroma HttpClient
Connects to the Chroma server at host:port. Raises if nothing is listening. If auth is provided, it is "username:password" for Chroma Basic Auth.

Parameter	Type	Default	Description
`host`	`str`	`"localhost"`	Chroma server host.
`port`	`int`	`8000`	Chroma server port.
`auth`	`str` \| `None`	`None`	Chroma Basic Auth credentials as `"username:password"`.

get_or_create_collection(client, name, metadata=None)
Get or create a Chroma collection (default cosine similarity). metadata default: {"hnsw:space": "cosine"}.

add_documents(client, collection_name, ids, embeddings, documents) → dict
Add documents to a collection. Returns {"collection_existed": bool}.

list_collections(host="localhost", port=8000) → list[tuple[str, int]]
List collection names and document counts on a Chroma server: [(name, count), ...].

Environment

load_dotenv_if_available(dotenv_path=None)
Load .env into os.environ if python-dotenv is installed. No-op otherwise.

load_openai_key(dotenv_path=None) → str | None
Load .env if available, then return os.environ.get("OPENAI_API_KEY").

Chunking methods

Method	Description	Requires
`token`	Split by token count (tiktoken, `cl100k_base`). Fast and deterministic.	—
`text`	Recursive character splitter (paragraph → line → sentence, etc.).	—
`semantic`	Embed small segments, then group by similarity (DP) into larger chunks.	`OPENAI_API_KEY`
`llm`	Use an LLM to choose split points for thematic sections.	`OPENAI_API_KEY`

Use chunking_method="token" or "text" to avoid API calls during chunking. Use "semantic" or "llm" for more coherent, topic-aware chunks (at the cost of extra OpenAI usage).

Chroma server

The pipeline connects only to a Chroma server. There is no ephemeral or persistent local client; a server must be running at host:port (default localhost:8000). If nothing is listening, slice() and collections() raise RuntimeError.

Start a server (e.g. Docker):

docker run -p 8000:8000 chromadb/chroma

Containers: If the app runs in a devcontainer and Chroma is in the same Docker Compose stack, use the service name as host (e.g. chroma). If Chroma is on the host and the app in the container, use host="host.docker.internal".

Environment and API key

OpenAI: The pipeline (and semantic / llm chunking) needs an API key. It is read in this order:
1. Argument openai_api_key=...
2. Environment variable OPENAI_API_KEY
3. A .env file in the current working directory (loaded via python-dotenv when you use hecvec)
.env: Create a .env in your project root (or set dotenv_path to point to one):
```
OPENAI_API_KEY=sk-...
```
Do not commit .env or expose the key in logs or source code.

Collection naming

If you pass collection_name="hecvec" (default), the base name is taken from the input:
- Single file: path.stem (e.g. mydoc)
- Directory: path.name (e.g. docs)
The final collection name is always:

{base_name}_{chunking_method}

Examples: mydoc_token, docs_semantic, CNSF-S0043-0032-2025_CONDUSEF-005190-08_token.
If a collection with that name already exists, the pipeline does not add documents again. It logs that the collection already exists and returns something like:

{"files": N, "chunks": 0, "collection": "...", "message": "Collection already exists; no documents added."}

Building blocks

You can use the pipeline step-by-step.

List and read:

from pathlib import Path
from hecvec import ListDir, ListDirTextFiles, ReadText

root = Path("/path/to/repo")
lister = ListDir(root=root)
for rel in lister.listdir("."):
    print(rel)

text_lister = ListDirTextFiles(root=root)
paths = text_lister.listdir_recursive_txt_md("docs")
reader = ReadText(paths)
for path, text in reader:
    print(path, len(text))

Chunk only (e.g. recursive character, with hecvec[chunk]):

from hecvec import ListDirTextFiles, ReadText
from hecvec.chunking import chunk_documents

paths = ListDirTextFiles(root=root).listdir_recursive_txt_md(".")
path_and_text = ReadText(paths).read_all()
chunks = chunk_documents(path_and_text)  # list of {"path", "chunk_index", "content"}

Token chunk + embed + list Chroma collections:

from hecvec import token_chunk_text, embed_texts, list_collections

chunks = token_chunk_text("Some long document...", chunk_size=200)
vecs = embed_texts(chunks, api_key="sk-...")
names_and_counts = list_collections(host="localhost", port=8000)

CLI (list directory under a root):

hecvec-listdir [path] [root]
# or
python -m hecvec.cli [path] [root]

Module layout

Module	Responsibility
`hecvec.env`	Load `.env` and `OPENAI_API_KEY`
`hecvec.listdir`	List dirs under a safe root; filter `.txt`/`.md`
`hecvec.reading`	Read files as text (UTF-8 / latin-1 / cp1252 fallback)
`hecvec.token_splitter`	Token-based chunking (tiktoken)
`hecvec.chunking`	Recursive character chunking (`chunk_documents`, `chunk_text`)
`hecvec.chunkers`	Multi-method chunking: token, text, semantic, llm
`hecvec.embeddings`	OpenAI embeddings (`embed_texts`)
`hecvec.chroma_client`	Chroma client, get/create collection, add documents
`hecvec.chroma_list`	List Chroma collections and counts
`hecvec.pipeline`	Orchestrator: `Slicer` and `slice(path=...)`

Development

From the repo root:

uv sync
uv run python -c "from hecvec import ListDir; print(ListDir('.').listdir('.'))"

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

6.10.3

Apr 10, 2026

6.10.2

Apr 10, 2026

6.10.1

Apr 10, 2026

6.10.0

Apr 8, 2026

6.9.0

Apr 7, 2026

6.8.1

Apr 6, 2026

6.8.0

Mar 27, 2026

6.7.3

Mar 27, 2026

6.7.1

Mar 27, 2026

6.7.0

Mar 27, 2026

6.6.1

Mar 26, 2026

6.6.0

Mar 26, 2026

6.5.0

Mar 26, 2026

6.4.0

Mar 25, 2026

6.3.0

Mar 24, 2026

6.2.0

Mar 19, 2026

6.1.3

Mar 19, 2026

6.1.2

Mar 19, 2026

6.1.1

Mar 19, 2026

This version

6.1.0

Mar 19, 2026

6.0.0

Mar 18, 2026

5.1.0

Mar 12, 2026

5.0.0

Mar 12, 2026

0.4.6

Mar 12, 2026

0.4.5

Mar 12, 2026

0.4.4

Mar 12, 2026

0.4.3

Mar 12, 2026

0.4.2

Mar 11, 2026

0.4.1

Mar 11, 2026

0.4.0

Mar 9, 2026

0.3.0

Mar 7, 2026

0.2.0

Mar 4, 2026

0.1.0

Mar 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hecvec-6.1.0.tar.gz (249.2 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hecvec-6.1.0-py3-none-any.whl (26.6 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file hecvec-6.1.0.tar.gz.

File metadata

Download URL: hecvec-6.1.0.tar.gz
Upload date: Mar 19, 2026
Size: 249.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for hecvec-6.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8d060d7ba0828a2cfdfcf0561fa70cbe9efd57a04b7f30b023db3075f43ec356`
MD5	`a1ad0825280e1e3858cf9507056c60e7`
BLAKE2b-256	`d98696f30804c895d60772e37b610876aaf92b33b3ffb03ae67ce13ada7586a9`

See more details on using hashes here.

File details

Details for the file hecvec-6.1.0-py3-none-any.whl.

File metadata

Download URL: hecvec-6.1.0-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 26.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for hecvec-6.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba9fefdc13474877954b4596b4395c6a75fbeeac82cc80f6b4351601811c79d0`
MD5	`d62309746d912c86df6c519f70038f31`
BLAKE2b-256	`46c69ec27bf2f701f41a814b4581ed867f0a944b88e23fba4b227c5e8fb6874d`

See more details on using hashes here.

hecvec 6.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

HecVec

Table of contents

Install

Requirements to run the pipeline

Workflow

Quick start

Parameters

API reference: methods and parameters

Pipeline

Listing and reading

Chunking

Embeddings and Chroma

Environment

Chunking methods

Chroma server

Environment and API key

Collection naming

Building blocks

Module layout

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes