Structure-aware deterministic chunking for code, prose, and markup.
Project description
Chunk code, prose, and markup files with structure awareness.
omnichunk is a Python library that splits files into smaller pieces while keeping useful context:
- Code: respects function/class boundaries, includes scope and import information
- Markdown: respects headings and sections
- JSON/YAML/TOML: splits by top-level keys/sections
- HTML/XML: splits by elements
- Mixed files: handles notebooks and Python files with long docstrings
Each chunk includes:
- The original text slice
- Byte and line ranges for lossless reconstruction
- Context (scope, entities, headings, imports, siblings)
- Optional
contextualized_textfor embeddings
The library is deterministic and works without external APIs.
Installation
pip install omnichunk
Optional extras:
pip install omnichunk[tiktoken] # tiktoken tokenizer support
pip install omnichunk[transformers] # HuggingFace tokenizer support
pip install omnichunk[all-languages] # Extended language grammars
pip install omnichunk[langchain] # LangChain Document export support
pip install omnichunk[llamaindex] # LlamaIndex Document export support
pip install omnichunk[profiling] # py-spy / line-profiler helpers
pip install omnichunk[rust] # maturin tooling for Rust backend PoC
pip install omnichunk[dev] # Development tools
pip install omnichunk[pinecone] # Vector DB adapter extra (no client lib)
pip install omnichunk[weaviate] # Vector DB adapter extra (no client lib)
pip install omnichunk[supabase] # Vector DB adapter extra (no client lib)
pip install omnichunk[vectordb] # Meta-group for all vector export extras (empty deps)
pip install omnichunk[semantic] # Marker extra (semantic stack uses core numpy only)
pip install omnichunk[graph] # Marker extra (GraphRAG uses existing chunk entities)
pip install omnichunk[pdf] # PDF text extraction (pypdf)
pip install omnichunk[docx] # Word documents (python-docx)
pip install omnichunk[formats] # pdf + docx
v0.9 adds multiformat chunking (.ipynb, .tex, optional .pdf / .docx), near-duplicate removal (dedup_chunks), and offline evaluation (evaluate_chunks). Jupyter and LaTeX need no extra packages; PDF and DOCX use the extras above. Chunker.chunk_file() picks loaders by extension. Evaluate saved JSONL with:
omnichunk eval ./chunks.jsonl --metrics all --source ./original.txt
Examples
Runnable scripts and Jupyter notebooks live under examples/. They cover chunking, hierarchical trees, incremental diffs, token budgets, semantic boundaries, GraphRAG, vector export shapes, and the plugin API. See examples/README.md for how to run them.
CLI
omnichunk ./src --glob "**/*.py" --max-size 512 --size-unit chars --format jsonl > chunks.jsonl
omnichunk app.py --max-size 256 --size-unit chars --stats
omnichunk app.py --max-size 256 --size-unit chars --nws-backend python
omnichunk README.md --format csv --output chunks.csv
Quick start
One-shot API
from omnichunk import chunk
code = """
import os
def hello(name: str) -> str:
return f"hello {name}"
"""
chunks = chunk("example.py", code, max_chunk_size=128, size_unit="chars")
for c in chunks:
print(c.index, c.byte_range, c.context.breadcrumb)
print(c.contextualized_text)
Reusable Chunker
from omnichunk import Chunker
chunker = Chunker(
max_chunk_size=1024,
min_chunk_size=80,
tokenizer="cl100k_base",
context_mode="full",
overlap=0.1,
overlap_lines=1,
)
chunks = chunker.chunk("api.py", source_code)
for c in chunker.stream("large.py", large_source):
consume(c)
Async API
import asyncio
from omnichunk import Chunker
chunker = Chunker(max_chunk_size=1024, size_unit="tokens")
# Single file async
chunks = asyncio.run(chunker.achunk("api.py", source_code))
# Async streaming
async def process():
async for chunk in chunker.astream("large.py", large_source):
consume(chunk)
# Async batch (concurrent)
results = asyncio.run(chunker.abatch(
[
{"filepath": "a.py", "code": code_a},
{"filepath": "b.ts", "code": code_b},
],
concurrency=8,
))
batch_results = chunker.batch(
[
{"filepath": "a.py", "code": code_a},
{"filepath": "b.ts", "code": code_b},
{"filepath": "README.md", "code": readme_md},
],
concurrency=8,
)
directory_results = chunker.chunk_directory(
"./src",
glob="**/*.py",
exclude=["**/tests/**"],
concurrency=8,
)
all_chunks = [chunk for result in directory_results for chunk in result.chunks]
jsonl_payload = chunker.to_jsonl(all_chunks)
csv_payload = chunker.to_csv(all_chunks)
stats = chunker.chunk_stats(all_chunks, size_unit="chars")
quality = chunker.quality_scores(
all_chunks,
min_chunk_size=80,
max_chunk_size=1024,
size_unit="chars",
)
langchain_docs = chunker.to_langchain_docs(all_chunks)
llamaindex_docs = chunker.to_llamaindex_docs(all_chunks)
# Vector DB–ready rows (you compute embeddings elsewhere)
from omnichunk import chunks_to_pinecone_vectors, chunks_to_supabase_rows
emb = [[0.1, 0.2, 0.3] for _ in all_chunks] # same length as chunks
pinecone_batch = chunks_to_pinecone_vectors(all_chunks, emb, namespace="my_ns")
weaviate_batch = chunker.to_weaviate_objects(all_chunks, emb, class_name="Doc")
supabase_rows = chunks_to_supabase_rows(all_chunks, emb)
Semantic chunking
Embedding boundaries are user-supplied (semantic_embed_fn). Omnichunk never calls an external API.
import numpy as np
from omnichunk import Chunker
def embed(texts):
# Replace with your actual embedding model
return np.random.default_rng(0).standard_normal((len(texts), 384))
chunker = Chunker(max_chunk_size=512, size_unit="tokens")
essay = "Your prose here…"
chunks = chunker.semantic_chunk("essay.md", essay, embed_fn=embed)
For code and other non-prose content types, structural engines are used even if semantic=True.
Topic shift detection
from omnichunk.semantic import detect_topic_shifts, split_sentences
text = "Your document…"
sentences_with_offsets = split_sentences(text)
sentences = [s for s, _, _ in sentences_with_offsets]
shifts = detect_topic_shifts(sentences, window=5, threshold=0.4)
GraphRAG: entity-chunk graph
from omnichunk import Chunker, build_chunk_graph
source = "class MyClass:\n pass\n"
chunks = Chunker().chunk("repo.py", source)
graph = build_chunk_graph(chunks)
print(graph.entity_chunks("MyClass")) # chunk indices containing MyClass
print(graph.chunk_neighbors(0)) # chunks sharing entities with chunk 0
data = graph.to_dict() # JSON-serializable
Hierarchical chunking (multi-level RAG)
from omnichunk import Chunker
chunker = Chunker(size_unit="chars")
source = "..." # your file contents
tree = chunker.hierarchical_chunk(
"service.py", source,
levels=[64, 256, 1024], # leaf → root
)
small_chunks = tree.leaves() # embed and index these
large_chunks = tree.roots() # pass these to LLM as context
parent = tree.parent(small_chunks[0]) # navigate up
Incremental / differential chunking
from omnichunk import Chunker
chunker = Chunker(max_chunk_size=512, size_unit="chars")
new_source = "..." # updated file contents
diff = chunker.chunk_diff(
"api.py",
new_source,
previous_chunks=old_chunks,
)
# diff.added → upsert to vector DB
# diff.removed_ids → delete from vector DB
# diff.unchanged → skip re-embedding
Token budget optimizer
from omnichunk.budget import TokenBudgetOptimizer
optimizer = TokenBudgetOptimizer(budget=4096, strategy="greedy")
result = optimizer.select(retrieved_chunks, scores=relevance_scores)
# result.selected → pass to LLM
Vector database export (serialization)
Adapters produce plain dicts/lists only—no Pinecone, Weaviate, or Supabase client is installed by these extras. You compute embeddings yourself and pass parallel lists:
chunks_to_pinecone_vectors/Chunker.to_pinecone_vectors—id,values,metadata(+ optionalnamespaceper row)chunks_to_weaviate_objects/Chunker.to_weaviate_objects—class,vector,propertieschunks_to_supabase_rows/Chunker.to_supabase_rows—content,embedding, plus flat metadata columns
Plugin API
Register custom parsers or formatters at import time (no edits to omnichunk core):
from omnichunk import register_parser, register_formatter, Chunker
def my_parse(filepath: str, content: str):
# Return a tree-sitter-like tree, or None to use the built-in parser.
return None
register_parser("python", my_parse, overwrite=True)
def my_fmt(chunks):
return str(len(chunks))
register_formatter("count", my_fmt)
File API
from omnichunk import chunk_file
chunks = chunk_file("path/to/file.py")
Directory API
from omnichunk import chunk_directory
results = chunk_directory("./src", glob="**/*.py", max_chunk_size=512, size_unit="chars")
for result in results:
if result.error:
print("error", result.filepath, result.error)
else:
print(result.filepath, len(result.chunks))
Chunk model
Every Chunk includes raw content, exact offsets, and rich context:
text: exact source slice (lossless reconstruction)contextualized_text: embedding-ready representationbyte_range,line_rangecontext: scope, entities, siblings, imports, headings, section metadatatoken_count,char_count,nws_count
Supported content
Code
- Python
- JavaScript / TypeScript
- Rust
- Go
- Java
- C / C++ / C#
- Ruby / PHP / Kotlin / Swift (grammar-dependent)
Prose
- Markdown
- Plaintext
Markdown fenced blocks are delegated by language:
- fenced code (
python,ts, etc.) routes toCodeEngine - fenced markup (
json,yaml,toml,html,xml) routes toMarkupEngine
Markup
- JSON
- YAML
- TOML
- HTML / XML
Hybrid
- Python with heavy docstrings
- Notebook-style
# %%cell files
Architecture
src/omnichunk/
├── chunker.py
├── cli.py
├── quality.py
├── serialization.py
├── types.py
├── engine/
│ ├── router.py
│ ├── code_engine.py
│ ├── prose_engine.py
│ ├── markup_engine.py
│ └── hybrid_engine.py
├── parser/
│ ├── tree_sitter.py
│ ├── markdown_parser.py
│ ├── html_parser.py
│ └── languages.py
├── context/
│ ├── entities.py
│ ├── scope.py
│ ├── siblings.py
│ ├── imports.py
│ └── format.py
├── sizing/
│ ├── nws.py
│ ├── tokenizers.py
│ └── counter.py
└── windowing/
├── greedy.py
├── merge.py
├── split.py
└── overlap.py
Determinism & integrity guarantees
omnichunk is built to preserve source fidelity:
- Chunk boundaries are deterministic
- Empty/whitespace-only chunks are dropped
- Chunks are contiguous and non-overlapping in source order
- Byte range integrity is validated in tests:
original_bytes = source.encode("utf-8")
for chunk in chunks:
assert original_bytes[chunk.byte_range.start:chunk.byte_range.end].decode("utf-8") == chunk.text
Testing
Run the test suite:
pytest -q
Run benchmark scenarios:
python benchmarks/run_benchmarks.py
python benchmarks/run_comparisons.py
python benchmarks/run_quality_report.py
python benchmarks/run_large_corpus.py --mode mega-python --repeat 120
python benchmarks/run_hotspot_profile.py --mode mega-python --repeat 120 --limit 30
Run repository checks:
python scripts/check_ai_rules_sync.py
python scripts/check_benchmarks.py
python scripts/check_benchmarks.py --run-quality
Current suite covers:
- API usage (
chunk,chunk_file,Chunker) - Code/prose/markup/hybrid behavior
- Context metadata (imports, siblings, scope, headings)
- Sizing/tokenization/NWS logic
- Overlap behavior
- Edge cases (empty input, unicode, malformed syntax, range contiguity)
Contributing
Contribution and project process files:
CONTRIBUTING.mdCODE_OF_CONDUCT.mdSECURITY.mdGOVERNANCE.mdMAINTAINERS.mdROADMAP.mdARCHITECTURE.md
Install dev tooling and run pre-commit hooks:
pip install -e .[dev]
pre-commit install
pre-commit run --all-files
Notes
- Tree-sitter grammars are resolved dynamically and cached per language.
- If a parser is unavailable, the system degrades gracefully with fallback heuristics.
contextualized_textis optimized for embedding quality while preserving rawtextseparately.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omnichunk-0.9.0.tar.gz.
File metadata
- Download URL: omnichunk-0.9.0.tar.gz
- Upload date:
- Size: 139.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77e249ddb4cbd1f69f6a540188ee61c40a97411ac5406cb3c03470595d6155a6
|
|
| MD5 |
342cbdb61e411919544ecc5f3149d26e
|
|
| BLAKE2b-256 |
f266ac955718db141df417f46f22ce73a3833fe90a1a114d40eebaab1dedb05e
|
Provenance
The following attestation bundles were made for omnichunk-0.9.0.tar.gz:
Publisher:
release.yml on oguzhankir/omnichunk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
omnichunk-0.9.0.tar.gz -
Subject digest:
77e249ddb4cbd1f69f6a540188ee61c40a97411ac5406cb3c03470595d6155a6 - Sigstore transparency entry: 1164800101
- Sigstore integration time:
-
Permalink:
oguzhankir/omnichunk@ad44cd18a1a1f3dff59521aa97b05918fe37ab68 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/oguzhankir
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ad44cd18a1a1f3dff59521aa97b05918fe37ab68 -
Trigger Event:
release
-
Statement type:
File details
Details for the file omnichunk-0.9.0-py3-none-any.whl.
File metadata
- Download URL: omnichunk-0.9.0-py3-none-any.whl
- Upload date:
- Size: 99.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
279eaedf16eb6a43a48b09a3107e72dfa326de0648578e9bfb6c0528fe4445df
|
|
| MD5 |
e0637cecd65d41c544e12ef9d0d5676d
|
|
| BLAKE2b-256 |
6ffff27efd85df4d711c4c6438c2e744f16989aaaf466301d06fc5bae179cac9
|
Provenance
The following attestation bundles were made for omnichunk-0.9.0-py3-none-any.whl:
Publisher:
release.yml on oguzhankir/omnichunk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
omnichunk-0.9.0-py3-none-any.whl -
Subject digest:
279eaedf16eb6a43a48b09a3107e72dfa326de0648578e9bfb6c0528fe4445df - Sigstore transparency entry: 1164800191
- Sigstore integration time:
-
Permalink:
oguzhankir/omnichunk@ad44cd18a1a1f3dff59521aa97b05918fe37ab68 -
Branch / Tag:
refs/tags/v0.9.0 - Owner: https://github.com/oguzhankir
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@ad44cd18a1a1f3dff59521aa97b05918fe37ab68 -
Trigger Event:
release
-
Statement type: