Skip to main content

Markdown chunker that respects your document's structure. Tables split at rows (not mid-row), headers never orphan from their content, every chunk carries a section path. 192 tests, zero deps, 3.9+.

Project description

structchunk

Structure-aware text chunking for RAG pipelines. v0.1.0

Pure-Python, zero external dependencies. Two algorithms for markdown chunking: hierarchical (section-tree based, semantically coherent chunks) and linear (greedy block-by-block, fast). Every chunk carries a header breadcrumb for full document context, and Snowflake-like BIGINT IDs for database-friendly storage.

中文版

Features

structchunk solves the core problems that hurt retrieval quality in RAG pipelines: headers orphaned from content, tables broken mid-row, chunks missing section context. It works on markdown documents and produces chunks that respect the document structure.

  • Structure-aware parsing: respects markdown headers, code fences, tables, and lists to find natural break points. Naive splitters (LangChain CharacterTextSplitter, etc.) split on character count and break tables mid-row.

  • Two algorithms: hierarchical (default, section-tree based) produces chunks that always start at a section header. linear (greedy block-by-block) gives fine-grained control over split points.

  • Header-prefix breadcrumbs: each chunk carries an in-document-order breadcrumb with markdown-level prefix (e.g., ['# H1', '## H2', '### H3']) that becomes part of the chunk content. Embeddings see the full section context.

  • H1 in every chunk: the document title is injected into every chunk via a post-pass. No chunk is contextually orphaned. Deep-nested sections retain the document-level context.

  • Sentence-boundary splitting: long paragraphs are split at sentence boundaries in both Chinese (。!?) and English (.!?). Single sentences are never broken unless they exceed the hard max size.

  • Table row-boundary splitting: oversized tables are split at row boundaries with column headers re-prepended to every continuation chunk. Lists split at item boundaries, code blocks at line boundaries.

  • Context absorption: when a table or list starts a new chunk group, the algorithm looks back for the most recent non-blank paragraph and absorbs it as context within the hard limit.

  • Snowflake BIGINT chunk IDs: each chunk gets a 64-bit Snowflake-like int that maps directly to a SQL BIGINT PRIMARY KEY column. Sortable by creation time. The embedded timestamp is recoverable via chunk_id_timestamp_ms().

  • Zero runtime dependencies: pure Python with no required external packages. Only pytest is needed for the test suite.

  • Fork-safe and clock-resilient: ID generation uses os.register_at_fork (POSIX) so worker processes never generate colliding IDs. System clock jumps are handled by spin-waiting up to 10 ms, then raising RuntimeError.

Installation

pip install structchunk

From source (includes test dependencies):

git clone https://github.com/yzp0111/structchunk
cd structchunk
pip install -e ".[test]"

Via uv:

uv pip install structchunk

Requires Python 3.9 or later. No runtime dependencies beyond the standard library.

Quick Start

import structchunk

chunks = structchunk.chunk(
    "# Title\n\nSome content with a long paragraph that needs splitting.",
    max_chars=500,
)

for c in chunks:
    print(f"[{c.metadata.chunk_index}] {c.metadata.header_breadcrumb}")
    print(c.content)
    print()

Output (default hierarchical algorithm):

[0] ['# Title']
# Title

Some content with a long paragraph that needs splitting.

The chunk() function is the main entry point. It accepts markdown text and returns a list of MarkdownChunk objects. The max_chars parameter caps every chunk at the given size. Additional keyword arguments are forwarded to the algorithm's chunk function.

The breadcrumb entry includes the # prefix, distinguishing header levels (# H1, ## H2, ### H3). The H1 document title is present in every chunk, not just the first one, so downstream embeddings always have the document-level context.

Each chunk also carries a Snowflake-like chunk_id (a Python int ready for SQL BIGINT), source_element_type and source_element_position for provenance tracking, character offsets into the original document, pre-computed character counts, and prev_chunk_id / next_chunk_id pointers for linked-list traversal. Call chunk.expand(include_breadcrumb=True) to get a retrieval-ready view with breadcrumb prepended to content.

For file input, use chunk_file():

chunks = structchunk.chunk_file("path/to/document.md", max_chars=500)

The file's absolute path is used as the doc_id automatically. For JSON serialization:

dicts = structchunk.chunk_to_dicts(chunks)

Algorithms

Algorithm Default When to use
hierarchical Yes Documents with clear section hierarchy (technical docs, reports, books). Produces semantically coherent chunks that always start at a section header.
linear No Documents without strict section structure, or when you want fine-grained control over split points. Fast greedy assembly with type-specific sub-splitters.
# Hierarchical (default, section-tree based)
chunks = structchunk.chunk(content, algorithm="hierarchical", max_chars=500)

# Linear (greedy block-by-block)
chunks = structchunk.chunk(content, algorithm="linear", max_chars=500)

The hierarchical algorithm builds a section tree from the document's header hierarchy. It walks the tree bottom-up and emits one chunk per section that fits within the size cap. It is the default because it produces the most semantically coherent chunks. Oversized sections are sub-split at natural boundaries (sentence, table row, list item, code line). Adjacent same-level sibling sections are greedily merged when they fit together, subject to a section-complete invariant: a complete section can merge with siblings, but a residual tail from a split section cannot. This prevents cross-contamination between different sections. Hierarchical is the right choice for technical docs, reports, books, or any content with a clear heading structure.

The linear algorithm uses greedy block-by-block assembly. Each block (paragraph, table, list, code fence) is added to the current chunk until it would exceed the size cap, then a new chunk starts. Oversized blocks are delegated to type-specific sub-splitters: paragraphs split at sentence boundaries, tables at row boundaries, lists at item boundaries, code fences at line boundaries. The linear algorithm is simpler and faster, making it a good choice for flat documents without section hierarchy.

Both algorithms share the same configuration parameters: max_chars, max_chunk_size, hard_max_size, min_chunk_size, sub_split_paragraph, sub_split_table, sub_split_code, sub_split_list, preserve_table_header, preserve_code_fence, forward_intro_text, and doc_id. See the API reference for details on each parameter.

CLI

After installation, the structchunk command is available as a console script:

structchunk document.md                                       # hierarchical, 500c cap
structchunk document.md --algorithm linear                    # greedy block-by-block
structchunk document.md --max-chars 300 --format json          # 300c cap, JSON output
structchunk document.md --quiet                                # suppress summary
structchunk document.md --output-dir /tmp/chunks               # custom output directory
Flag Default Description
--algorithm hierarchical Chunking algorithm: hierarchical or linear
--max-chars 500 Hard cap on chunk size in characters
--format both Output format: json, md, or both
--quiet False Only save files, don't print summary
--output-dir ./test_result/ Directory for output files

Output files include the input file stem, algorithm name, and a timestamp in their filename:

  • document-hierarchical-20250101_120000.json
  • document-hierarchical-20250101_120000.md

JSON output contains the full chunk list with all metadata fields serialized as dicts, suitable for programmatic consumption. Markdown output renders each chunk as a human-readable section with breadcrumb, source element type, character range, chunk ID, and linked-list pointers.

When --quiet is omitted, the CLI prints a summary table showing each chunk's index, character count, source type, and breadcrumb path, along with aggregate statistics: total chunks, size range, type distribution, continuation count, and elapsed time.

The output directory defaults to ./test_result/ and is created automatically if it does not exist.

Documentation

Contributing

Contributions are welcome. See CONTRIBUTING.md for:

  • Development setup and installation from source
  • Project layout and module overview
  • Running the test suite
  • Submitting pull requests and reporting bugs

Bug reports and pull requests are welcome on GitHub.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structchunk-0.1.0.tar.gz (61.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

structchunk-0.1.0-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file structchunk-0.1.0.tar.gz.

File metadata

  • Download URL: structchunk-0.1.0.tar.gz
  • Upload date:
  • Size: 61.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for structchunk-0.1.0.tar.gz
Algorithm Hash digest
SHA256 769b6d8de198a3c47eee3a9ec46743d9e2f7b303d5958b1964a3f58187f78d27
MD5 47210c8e029cebb0db26a40989cefe54
BLAKE2b-256 f5101f2838108085ab2bb0d0e54f95ff45a640ebb52a91d33085fc2602277d96

See more details on using hashes here.

File details

Details for the file structchunk-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: structchunk-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for structchunk-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e4c71b912f19ea691979d9a9d9ce5ee1f8762590a06bc97a6d381fad153ad020
MD5 54051919e09679f37ac8fb2e1aa1fe8b
BLAKE2b-256 7787bc8fb82d896677d0127c8eb941cc4fde5ca940dfe36a4c703390f8b1662d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page