omnichunk

Structure-aware deterministic chunking for code, prose, and markup.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

oguzhankir

These details have not been verified by PyPI

Project description

Chunk code, prose, and markup files with structure awareness.

omnichunk is a Python library that splits files into smaller pieces while keeping useful context:

Code: respects function/class boundaries, includes scope and import information
Markdown: respects headings and sections
JSON/YAML/TOML: splits by top-level keys/sections
HTML/XML: splits by elements
Mixed files: handles notebooks and Python files with long docstrings

Each chunk includes:

The original text slice
Byte and line ranges for lossless reconstruction
Context (scope, entities, headings, imports, siblings)
Optional contextualized_text for embeddings

The library is deterministic and works without external APIs.

Installation

pip install omnichunk

Optional extras:

pip install omnichunk[tiktoken]        # tiktoken tokenizer support
pip install omnichunk[transformers]    # HuggingFace tokenizer support
pip install omnichunk[all-languages]   # Extended language grammars
pip install omnichunk[langchain]       # LangChain Document export support
pip install omnichunk[llamaindex]      # LlamaIndex Document export support
pip install omnichunk[dev]             # Development tools

CLI

omnichunk ./src --glob "**/*.py" --max-size 512 --size-unit chars --format jsonl > chunks.jsonl
omnichunk app.py --max-size 256 --size-unit chars --stats
omnichunk README.md --format csv --output chunks.csv

Quick start

One-shot API

from omnichunk import chunk

code = """
import os

def hello(name: str) -> str:
    return f"hello {name}"
"""

chunks = chunk("example.py", code, max_chunk_size=128, size_unit="chars")

for c in chunks:
    print(c.index, c.byte_range, c.context.breadcrumb)
    print(c.contextualized_text)

Reusable `Chunker`

from omnichunk import Chunker

chunker = Chunker(
    max_chunk_size=1024,
    min_chunk_size=80,
    tokenizer="cl100k_base",
    context_mode="full",
    overlap=0.1,
    overlap_lines=1,
)

chunks = chunker.chunk("api.py", source_code)

for c in chunker.stream("large.py", large_source):
    consume(c)

batch_results = chunker.batch(
    [
        {"filepath": "a.py", "code": code_a},
        {"filepath": "b.ts", "code": code_b},
        {"filepath": "README.md", "code": readme_md},
    ],
    concurrency=8,
)

directory_results = chunker.chunk_directory(
    "./src",
    glob="**/*.py",
    exclude=["**/tests/**"],
    concurrency=8,
)

all_chunks = [chunk for result in directory_results for chunk in result.chunks]

jsonl_payload = chunker.to_jsonl(all_chunks)
csv_payload = chunker.to_csv(all_chunks)

stats = chunker.chunk_stats(all_chunks, size_unit="chars")
quality = chunker.quality_scores(
    all_chunks,
    min_chunk_size=80,
    max_chunk_size=1024,
    size_unit="chars",
)

langchain_docs = chunker.to_langchain_docs(all_chunks)
llamaindex_docs = chunker.to_llamaindex_docs(all_chunks)

File API

from omnichunk import chunk_file

chunks = chunk_file("path/to/file.py")

Directory API

from omnichunk import chunk_directory

results = chunk_directory("./src", glob="**/*.py", max_chunk_size=512, size_unit="chars")

for result in results:
    if result.error:
        print("error", result.filepath, result.error)
    else:
        print(result.filepath, len(result.chunks))

Chunk model

Every Chunk includes raw content, exact offsets, and rich context:

text: exact source slice (lossless reconstruction)
contextualized_text: embedding-ready representation
byte_range, line_range
context: scope, entities, siblings, imports, headings, section metadata
token_count, char_count, nws_count

Supported content

Code

Python
JavaScript / TypeScript
Rust
Go
Java
C / C++ / C#
Ruby / PHP / Kotlin / Swift (grammar-dependent)

Prose

Markdown
Plaintext

Markdown fenced blocks are delegated by language:

fenced code (python, ts, etc.) routes to CodeEngine
fenced markup (json, yaml, toml, html, xml) routes to MarkupEngine

Markup

JSON
YAML
TOML
HTML / XML

Hybrid

Python with heavy docstrings
Notebook-style # %% cell files

Architecture

src/omnichunk/
├── chunker.py
├── cli.py
├── quality.py
├── serialization.py
├── types.py
├── engine/
│   ├── router.py
│   ├── code_engine.py
│   ├── prose_engine.py
│   ├── markup_engine.py
│   └── hybrid_engine.py
├── parser/
│   ├── tree_sitter.py
│   ├── markdown_parser.py
│   ├── html_parser.py
│   └── languages.py
├── context/
│   ├── entities.py
│   ├── scope.py
│   ├── siblings.py
│   ├── imports.py
│   └── format.py
├── sizing/
│   ├── nws.py
│   ├── tokenizers.py
│   └── counter.py
└── windowing/
    ├── greedy.py
    ├── merge.py
    ├── split.py
    └── overlap.py

Determinism & integrity guarantees

omnichunk is built to preserve source fidelity:

Chunk boundaries are deterministic
Empty/whitespace-only chunks are dropped
Chunks are contiguous and non-overlapping in source order
Byte range integrity is validated in tests:

original_bytes = source.encode("utf-8")
for chunk in chunks:
    assert original_bytes[chunk.byte_range.start:chunk.byte_range.end].decode("utf-8") == chunk.text

Testing

Run the test suite:

pytest -q

Run benchmark scenarios:

python benchmarks/run_benchmarks.py
python benchmarks/run_comparisons.py
python benchmarks/run_quality_report.py

Run repository checks:

python scripts/check_ai_rules_sync.py
python scripts/check_benchmarks.py
python scripts/check_benchmarks.py --run-quality

Current suite covers:

API usage (chunk, chunk_file, Chunker)
Code/prose/markup/hybrid behavior
Context metadata (imports, siblings, scope, headings)
Sizing/tokenization/NWS logic
Overlap behavior
Edge cases (empty input, unicode, malformed syntax, range contiguity)

Contributing

Contribution and project process files:

CONTRIBUTING.md
CODE_OF_CONDUCT.md
SECURITY.md
GOVERNANCE.md
MAINTAINERS.md
ROADMAP.md
ARCHITECTURE.md

Install dev tooling and run pre-commit hooks:

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

Notes

Tree-sitter grammars are resolved dynamically and cached per language.
If a parser is unavailable, the system degrades gracefully with fallback heuristics.
contextualized_text is optimized for embedding quality while preserving raw text separately.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

oguzhankir

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.10.1

Mar 23, 2026

0.10.0

Mar 23, 2026

0.9.0

Mar 23, 2026

0.8.0

Mar 20, 2026

0.7.0

Mar 20, 2026

0.6.0

Mar 19, 2026

0.5.0

Mar 19, 2026

This version

0.4.0

Mar 19, 2026

0.3.0

Mar 19, 2026

0.2.0

Mar 19, 2026

0.1.2

Mar 19, 2026

0.1.1

Mar 19, 2026

0.1.0

Mar 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnichunk-0.4.0.tar.gz (57.8 kB view details)

Uploaded Mar 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

omnichunk-0.4.0-py3-none-any.whl (58.5 kB view details)

Uploaded Mar 19, 2026 Python 3

File details

Details for the file omnichunk-0.4.0.tar.gz.

File metadata

Download URL: omnichunk-0.4.0.tar.gz
Upload date: Mar 19, 2026
Size: 57.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omnichunk-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`f2d2f2a1d447df4b9f86f255a8178e38ecb56183ba10ec25103db562553ec469`
MD5	`4292a8bc11698b7dbd9a7648462299b2`
BLAKE2b-256	`57e1a399d814df6161de595db63924660aa1caa98149411c6d5fe98aa05a7d83`

See more details on using hashes here.

Provenance

The following attestation bundles were made for omnichunk-0.4.0.tar.gz:

Publisher: release.yml on oguzhankir/omnichunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: omnichunk-0.4.0.tar.gz
- Subject digest: f2d2f2a1d447df4b9f86f255a8178e38ecb56183ba10ec25103db562553ec469
- Sigstore transparency entry: 1138086408
- Sigstore integration time: Mar 19, 2026
Source repository:
- Permalink: oguzhankir/omnichunk@d269b631e0f8dcb19e701bffad7a4e65b28f144f
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/oguzhankir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d269b631e0f8dcb19e701bffad7a4e65b28f144f
- Trigger Event: release

File details

Details for the file omnichunk-0.4.0-py3-none-any.whl.

File metadata

Download URL: omnichunk-0.4.0-py3-none-any.whl
Upload date: Mar 19, 2026
Size: 58.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omnichunk-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a0d4546c5bd4622a343cc986a40d94754597022995e5fd96329941b39bb0da1`
MD5	`35027de67712c708a9e88883b09d36dc`
BLAKE2b-256	`fe757297133a196464d92e115cadff11a5b59d2a2dc9913a409b11b724165f80`

See more details on using hashes here.

Provenance

The following attestation bundles were made for omnichunk-0.4.0-py3-none-any.whl:

Publisher: release.yml on oguzhankir/omnichunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: omnichunk-0.4.0-py3-none-any.whl
- Subject digest: 2a0d4546c5bd4622a343cc986a40d94754597022995e5fd96329941b39bb0da1
- Sigstore transparency entry: 1138086457
- Sigstore integration time: Mar 19, 2026
Source repository:
- Permalink: oguzhankir/omnichunk@d269b631e0f8dcb19e701bffad7a4e65b28f144f
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/oguzhankir
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d269b631e0f8dcb19e701bffad7a4e65b28f144f
- Trigger Event: release

omnichunk 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Installation

CLI

Quick start

One-shot API

Reusable Chunker

File API

Directory API

Chunk model

Supported content

Code

Prose

Markup

Hybrid

Architecture

Determinism & integrity guarantees

Testing

Contributing

Notes

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Reusable `Chunker`