Skip to main content

Structure-aware deterministic chunking for code, prose, and markup.

Project description

omnichunk

PyPI CI Python License

Chunk code, prose, and markup files with structure awareness.

omnichunk is a Python library that splits files into smaller pieces while keeping useful context:

  • Code: respects function/class boundaries, includes scope and import information
  • Markdown: respects headings and sections
  • JSON/YAML/TOML: splits by top-level keys/sections
  • HTML/XML: splits by elements
  • Mixed files: handles notebooks and Python files with long docstrings

Each chunk includes:

  • The original text slice
  • Byte and line ranges for lossless reconstruction
  • Context (scope, entities, headings, imports, siblings)
  • Optional contextualized_text for embeddings

The library is deterministic and works without external APIs.

Installation

pip install omnichunk

Optional extras:

pip install omnichunk[tiktoken]        # tiktoken tokenizer support
pip install omnichunk[transformers]    # HuggingFace tokenizer support
pip install omnichunk[all-languages]   # Extended language grammars
pip install omnichunk[langchain]       # LangChain Document export support
pip install omnichunk[llamaindex]      # LlamaIndex Document export support
pip install omnichunk[dev]             # Development tools

CLI

omnichunk ./src --glob "**/*.py" --max-size 512 --size-unit chars --format jsonl > chunks.jsonl
omnichunk app.py --max-size 256 --size-unit chars --stats
omnichunk README.md --format csv --output chunks.csv

Quick start

One-shot API

from omnichunk import chunk

code = """
import os

def hello(name: str) -> str:
    return f"hello {name}"
"""

chunks = chunk("example.py", code, max_chunk_size=128, size_unit="chars")

for c in chunks:
    print(c.index, c.byte_range, c.context.breadcrumb)
    print(c.contextualized_text)

Reusable Chunker

from omnichunk import Chunker

chunker = Chunker(
    max_chunk_size=1024,
    min_chunk_size=80,
    tokenizer="cl100k_base",
    context_mode="full",
    overlap=0.1,
    overlap_lines=1,
)

chunks = chunker.chunk("api.py", source_code)

for c in chunker.stream("large.py", large_source):
    consume(c)

batch_results = chunker.batch(
    [
        {"filepath": "a.py", "code": code_a},
        {"filepath": "b.ts", "code": code_b},
        {"filepath": "README.md", "code": readme_md},
    ],
    concurrency=8,
)

directory_results = chunker.chunk_directory(
    "./src",
    glob="**/*.py",
    exclude=["**/tests/**"],
    concurrency=8,
)

all_chunks = [chunk for result in directory_results for chunk in result.chunks]

jsonl_payload = chunker.to_jsonl(all_chunks)
csv_payload = chunker.to_csv(all_chunks)

stats = chunker.chunk_stats(all_chunks, size_unit="chars")
quality = chunker.quality_scores(
    all_chunks,
    min_chunk_size=80,
    max_chunk_size=1024,
    size_unit="chars",
)

langchain_docs = chunker.to_langchain_docs(all_chunks)
llamaindex_docs = chunker.to_llamaindex_docs(all_chunks)

File API

from omnichunk import chunk_file

chunks = chunk_file("path/to/file.py")

Directory API

from omnichunk import chunk_directory

results = chunk_directory("./src", glob="**/*.py", max_chunk_size=512, size_unit="chars")

for result in results:
    if result.error:
        print("error", result.filepath, result.error)
    else:
        print(result.filepath, len(result.chunks))

Chunk model

Every Chunk includes raw content, exact offsets, and rich context:

  • text: exact source slice (lossless reconstruction)
  • contextualized_text: embedding-ready representation
  • byte_range, line_range
  • context: scope, entities, siblings, imports, headings, section metadata
  • token_count, char_count, nws_count

Supported content

Code

  • Python
  • JavaScript / TypeScript
  • Rust
  • Go
  • Java
  • C / C++ / C#
  • Ruby / PHP / Kotlin / Swift (grammar-dependent)

Prose

  • Markdown
  • Plaintext

Markdown fenced blocks are delegated by language:

  • fenced code (python, ts, etc.) routes to CodeEngine
  • fenced markup (json, yaml, toml, html, xml) routes to MarkupEngine

Markup

  • JSON
  • YAML
  • TOML
  • HTML / XML

Hybrid

  • Python with heavy docstrings
  • Notebook-style # %% cell files

Architecture

src/omnichunk/
├── chunker.py
├── cli.py
├── quality.py
├── serialization.py
├── types.py
├── engine/
│   ├── router.py
│   ├── code_engine.py
│   ├── prose_engine.py
│   ├── markup_engine.py
│   └── hybrid_engine.py
├── parser/
│   ├── tree_sitter.py
│   ├── markdown_parser.py
│   ├── html_parser.py
│   └── languages.py
├── context/
│   ├── entities.py
│   ├── scope.py
│   ├── siblings.py
│   ├── imports.py
│   └── format.py
├── sizing/
│   ├── nws.py
│   ├── tokenizers.py
│   └── counter.py
└── windowing/
    ├── greedy.py
    ├── merge.py
    ├── split.py
    └── overlap.py

Determinism & integrity guarantees

omnichunk is built to preserve source fidelity:

  • Chunk boundaries are deterministic
  • Empty/whitespace-only chunks are dropped
  • Chunks are contiguous and non-overlapping in source order
  • Byte range integrity is validated in tests:
original_bytes = source.encode("utf-8")
for chunk in chunks:
    assert original_bytes[chunk.byte_range.start:chunk.byte_range.end].decode("utf-8") == chunk.text

Testing

Run the test suite:

pytest -q

Run benchmark scenarios:

python benchmarks/run_benchmarks.py
python benchmarks/run_comparisons.py
python benchmarks/run_quality_report.py

Run repository checks:

python scripts/check_ai_rules_sync.py
python scripts/check_benchmarks.py
python scripts/check_benchmarks.py --run-quality

Current suite covers:

  • API usage (chunk, chunk_file, Chunker)
  • Code/prose/markup/hybrid behavior
  • Context metadata (imports, siblings, scope, headings)
  • Sizing/tokenization/NWS logic
  • Overlap behavior
  • Edge cases (empty input, unicode, malformed syntax, range contiguity)

Contributing

Contribution and project process files:

  • CONTRIBUTING.md
  • CODE_OF_CONDUCT.md
  • SECURITY.md
  • GOVERNANCE.md
  • MAINTAINERS.md
  • ROADMAP.md
  • ARCHITECTURE.md

Install dev tooling and run pre-commit hooks:

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

Notes

  • Tree-sitter grammars are resolved dynamically and cached per language.
  • If a parser is unavailable, the system degrades gracefully with fallback heuristics.
  • contextualized_text is optimized for embedding quality while preserving raw text separately.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omnichunk-0.4.0.tar.gz (57.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omnichunk-0.4.0-py3-none-any.whl (58.5 kB view details)

Uploaded Python 3

File details

Details for the file omnichunk-0.4.0.tar.gz.

File metadata

  • Download URL: omnichunk-0.4.0.tar.gz
  • Upload date:
  • Size: 57.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omnichunk-0.4.0.tar.gz
Algorithm Hash digest
SHA256 f2d2f2a1d447df4b9f86f255a8178e38ecb56183ba10ec25103db562553ec469
MD5 4292a8bc11698b7dbd9a7648462299b2
BLAKE2b-256 57e1a399d814df6161de595db63924660aa1caa98149411c6d5fe98aa05a7d83

See more details on using hashes here.

Provenance

The following attestation bundles were made for omnichunk-0.4.0.tar.gz:

Publisher: release.yml on oguzhankir/omnichunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file omnichunk-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: omnichunk-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 58.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for omnichunk-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2a0d4546c5bd4622a343cc986a40d94754597022995e5fd96329941b39bb0da1
MD5 35027de67712c708a9e88883b09d36dc
BLAKE2b-256 fe757297133a196464d92e115cadff11a5b59d2a2dc9913a409b11b724165f80

See more details on using hashes here.

Provenance

The following attestation bundles were made for omnichunk-0.4.0-py3-none-any.whl:

Publisher: release.yml on oguzhankir/omnichunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page