Structure-aware deterministic chunking for code, prose, and markup.
Project description
Chunk code, prose, and markup files with structure awareness.
omnichunk is a Python library that splits files into smaller pieces while keeping useful context:
- Code: respects function/class boundaries, includes scope and import information
- Markdown: respects headings and sections
- JSON/YAML/TOML: splits by top-level keys/sections
- HTML/XML: splits by elements
- Mixed files: handles notebooks and Python files with long docstrings
Each chunk includes:
- The original text slice
- Byte and line ranges for lossless reconstruction
- Context (scope, entities, headings, imports, siblings)
- Optional
contextualized_textfor embeddings
The library is deterministic and works without external APIs.
Installation
pip install omnichunk
Optional extras:
pip install omnichunk[tiktoken] # tiktoken tokenizer support
pip install omnichunk[transformers] # HuggingFace tokenizer support
pip install omnichunk[all-languages] # Extended language grammars
pip install omnichunk[dev] # Development tools
Quick start
One-shot API
from omnichunk import chunk
code = """
import os
def hello(name: str) -> str:
return f"hello {name}"
"""
chunks = chunk("example.py", code, max_chunk_size=128, size_unit="chars")
for c in chunks:
print(c.index, c.byte_range, c.context.breadcrumb)
print(c.contextualized_text)
Reusable Chunker
from omnichunk import Chunker
chunker = Chunker(
max_chunk_size=1024,
min_chunk_size=80,
tokenizer="cl100k_base",
context_mode="full",
overlap=0.1,
overlap_lines=1,
)
chunks = chunker.chunk("api.py", source_code)
for c in chunker.stream("large.py", large_source):
consume(c)
batch_results = chunker.batch(
[
{"filepath": "a.py", "code": code_a},
{"filepath": "b.ts", "code": code_b},
{"filepath": "README.md", "code": readme_md},
],
concurrency=8,
)
File API
from omnichunk import chunk_file
chunks = chunk_file("path/to/file.py")
Chunk model
Every Chunk includes raw content, exact offsets, and rich context:
text: exact source slice (lossless reconstruction)contextualized_text: embedding-ready representationbyte_range,line_rangecontext: scope, entities, siblings, imports, headings, section metadatatoken_count,char_count,nws_count
Supported content
Code
- Python
- JavaScript / TypeScript
- Rust
- Go
- Java
- C / C++ / C#
- Ruby / PHP / Kotlin / Swift (grammar-dependent)
Prose
- Markdown
- Plaintext
Markdown fenced blocks are delegated by language:
- fenced code (
python,ts, etc.) routes toCodeEngine - fenced markup (
json,yaml,toml,html,xml) routes toMarkupEngine
Markup
- JSON
- YAML
- TOML
- HTML / XML
Hybrid
- Python with heavy docstrings
- Notebook-style
# %%cell files
Architecture
src/omnichunk/
├── chunker.py
├── types.py
├── engine/
│ ├── router.py
│ ├── code_engine.py
│ ├── prose_engine.py
│ ├── markup_engine.py
│ └── hybrid_engine.py
├── parser/
│ ├── tree_sitter.py
│ ├── markdown_parser.py
│ ├── html_parser.py
│ └── languages.py
├── context/
│ ├── entities.py
│ ├── scope.py
│ ├── siblings.py
│ ├── imports.py
│ └── format.py
├── sizing/
│ ├── nws.py
│ ├── tokenizers.py
│ └── counter.py
└── windowing/
├── greedy.py
├── merge.py
├── split.py
└── overlap.py
Determinism & integrity guarantees
omnichunk is built to preserve source fidelity:
- Chunk boundaries are deterministic
- Empty/whitespace-only chunks are dropped
- Chunks are contiguous and non-overlapping in source order
- Byte range integrity is validated in tests:
original_bytes = source.encode("utf-8")
for chunk in chunks:
assert original_bytes[chunk.byte_range.start:chunk.byte_range.end].decode("utf-8") == chunk.text
Testing
Run the test suite:
pytest -q
Run benchmark scenarios:
python benchmarks/run_benchmarks.py
python benchmarks/run_comparisons.py
python benchmarks/run_quality_report.py
Run repository checks:
python scripts/check_ai_rules_sync.py
python scripts/check_benchmarks.py
python scripts/check_benchmarks.py --run-quality
Current suite covers:
- API usage (
chunk,chunk_file,Chunker) - Code/prose/markup/hybrid behavior
- Context metadata (imports, siblings, scope, headings)
- Sizing/tokenization/NWS logic
- Overlap behavior
- Edge cases (empty input, unicode, malformed syntax, range contiguity)
Contributing
Contribution and project process files:
CONTRIBUTING.mdCODE_OF_CONDUCT.mdSECURITY.mdGOVERNANCE.mdMAINTAINERS.mdROADMAP.mdARCHITECTURE.md
Install dev tooling and run pre-commit hooks:
pip install -e .[dev]
pre-commit install
pre-commit run --all-files
Notes
- Tree-sitter grammars are resolved dynamically and cached per language.
- If a parser is unavailable, the system degrades gracefully with fallback heuristics.
contextualized_textis optimized for embedding quality while preserving rawtextseparately.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omnichunk-0.1.2.tar.gz.
File metadata
- Download URL: omnichunk-0.1.2.tar.gz
- Upload date:
- Size: 43.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e9750e1939420c850756ab142b648c88c8bf5715bb121904cc817352638d199
|
|
| MD5 |
ecdd6c07facc07ba2c909f40f39343f9
|
|
| BLAKE2b-256 |
591fe1ec351f508cd190257518df8ca1ac72c1628b881d602a32725c233a184e
|
File details
Details for the file omnichunk-0.1.2-py3-none-any.whl.
File metadata
- Download URL: omnichunk-0.1.2-py3-none-any.whl
- Upload date:
- Size: 47.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ff0e290c00ebf435fb326c9563c1f4422e222c7bf609782f23fbc7a77b175a1
|
|
| MD5 |
6aadd84b8b9437813709a5a607af6460
|
|
| BLAKE2b-256 |
e1e6f529445defaa35806a22183041cdfc8040e1f39aaa5c60ecd020c32ced76
|