Markdown chunker that respects your document's structure. Tables split at rows (not mid-row), headers never orphan from their content, every chunk carries a section path. 192 tests, zero deps, 3.9+.
Project description
structchunk
Structure-aware text chunking for RAG pipelines. v0.1.0
Pure-Python, zero external dependencies. Two algorithms for markdown chunking: hierarchical (section-tree based, semantically coherent chunks) and linear (greedy block-by-block, fast). Every chunk carries a header breadcrumb for full document context, and Snowflake-like BIGINT IDs for database-friendly storage.
Features
structchunk solves the core problems that hurt retrieval quality in RAG pipelines: headers orphaned from content, tables broken mid-row, chunks missing section context. It works on markdown documents and produces chunks that respect the document structure.
-
Structure-aware parsing: respects markdown headers, code fences, tables, and lists to find natural break points. Naive splitters (LangChain CharacterTextSplitter, etc.) split on character count and break tables mid-row.
-
Two algorithms:
hierarchical(default, section-tree based) produces chunks that always start at a section header.linear(greedy block-by-block) gives fine-grained control over split points. -
Header-prefix breadcrumbs: each chunk carries an in-document-order breadcrumb with markdown-level prefix (e.g.,
['# H1', '## H2', '### H3']) that becomes part of the chunk content. Embeddings see the full section context. -
H1 in every chunk: the document title is injected into every chunk via a post-pass. No chunk is contextually orphaned. Deep-nested sections retain the document-level context.
-
Sentence-boundary splitting: long paragraphs are split at sentence boundaries in both Chinese (。!?) and English (.!?). Single sentences are never broken unless they exceed the hard max size.
-
Table row-boundary splitting: oversized tables are split at row boundaries with column headers re-prepended to every continuation chunk. Lists split at item boundaries, code blocks at line boundaries.
-
Context absorption: when a table or list starts a new chunk group, the algorithm looks back for the most recent non-blank paragraph and absorbs it as context within the hard limit.
-
Snowflake BIGINT chunk IDs: each chunk gets a 64-bit Snowflake-like int that maps directly to a SQL
BIGINT PRIMARY KEYcolumn. Sortable by creation time. The embedded timestamp is recoverable viachunk_id_timestamp_ms(). -
Zero runtime dependencies: pure Python with no required external packages. Only
pytestis needed for the test suite. -
Fork-safe and clock-resilient: ID generation uses
os.register_at_fork(POSIX) so worker processes never generate colliding IDs. System clock jumps are handled by spin-waiting up to 10 ms, then raisingRuntimeError.
Installation
pip install structchunk
From source (includes test dependencies):
git clone https://github.com/yzp0111/structchunk
cd structchunk
pip install -e ".[test]"
Via uv:
uv pip install structchunk
Requires Python 3.9 or later. No runtime dependencies beyond the standard library.
Quick Start
import structchunk
chunks = structchunk.chunk(
"# Title\n\nSome content with a long paragraph that needs splitting.",
max_chars=500,
)
for c in chunks:
print(f"[{c.metadata.chunk_index}] {c.metadata.header_breadcrumb}")
print(c.content)
print()
Output (default hierarchical algorithm):
[0] ['# Title']
# Title
Some content with a long paragraph that needs splitting.
The chunk() function is the main entry point. It accepts markdown text and returns
a list of MarkdownChunk objects. The max_chars parameter caps every chunk at the
given size. Additional keyword arguments are forwarded to the algorithm's chunk function.
The breadcrumb entry includes the # prefix, distinguishing header levels (# H1,
## H2, ### H3). The H1 document title is present in every chunk, not just the first
one, so downstream embeddings always have the document-level context.
Each chunk also carries a Snowflake-like chunk_id (a Python int ready for SQL
BIGINT), source_element_type and source_element_position for provenance tracking,
character offsets into the original document, pre-computed character counts, and
prev_chunk_id / next_chunk_id pointers for linked-list traversal. Call
chunk.expand(include_breadcrumb=True) to get a retrieval-ready view with breadcrumb
prepended to content.
For file input, use chunk_file():
chunks = structchunk.chunk_file("path/to/document.md", max_chars=500)
The file's absolute path is used as the doc_id automatically. For JSON serialization:
dicts = structchunk.chunk_to_dicts(chunks)
Algorithms
| Algorithm | Default | When to use |
|---|---|---|
hierarchical |
Yes | Documents with clear section hierarchy (technical docs, reports, books). Produces semantically coherent chunks that always start at a section header. |
linear |
No | Documents without strict section structure, or when you want fine-grained control over split points. Fast greedy assembly with type-specific sub-splitters. |
# Hierarchical (default, section-tree based)
chunks = structchunk.chunk(content, algorithm="hierarchical", max_chars=500)
# Linear (greedy block-by-block)
chunks = structchunk.chunk(content, algorithm="linear", max_chars=500)
The hierarchical algorithm builds a section tree from the document's header hierarchy. It walks the tree bottom-up and emits one chunk per section that fits within the size cap. It is the default because it produces the most semantically coherent chunks. Oversized sections are sub-split at natural boundaries (sentence, table row, list item, code line). Adjacent same-level sibling sections are greedily merged when they fit together, subject to a section-complete invariant: a complete section can merge with siblings, but a residual tail from a split section cannot. This prevents cross-contamination between different sections. Hierarchical is the right choice for technical docs, reports, books, or any content with a clear heading structure.
The linear algorithm uses greedy block-by-block assembly. Each block (paragraph, table, list, code fence) is added to the current chunk until it would exceed the size cap, then a new chunk starts. Oversized blocks are delegated to type-specific sub-splitters: paragraphs split at sentence boundaries, tables at row boundaries, lists at item boundaries, code fences at line boundaries. The linear algorithm is simpler and faster, making it a good choice for flat documents without section hierarchy.
Both algorithms share the same configuration parameters: max_chars, max_chunk_size,
hard_max_size, min_chunk_size, sub_split_paragraph, sub_split_table,
sub_split_code, sub_split_list, preserve_table_header, preserve_code_fence,
forward_intro_text, and doc_id. See the API reference for details on each parameter.
CLI
After installation, the structchunk command is available as a console script:
structchunk document.md # hierarchical, 500c cap
structchunk document.md --algorithm linear # greedy block-by-block
structchunk document.md --max-chars 300 --format json # 300c cap, JSON output
structchunk document.md --quiet # suppress summary
structchunk document.md --output-dir /tmp/chunks # custom output directory
| Flag | Default | Description |
|---|---|---|
--algorithm |
hierarchical |
Chunking algorithm: hierarchical or linear |
--max-chars |
500 |
Hard cap on chunk size in characters |
--format |
both |
Output format: json, md, or both |
--quiet |
False |
Only save files, don't print summary |
--output-dir |
./test_result/ |
Directory for output files |
Output files include the input file stem, algorithm name, and a timestamp in their filename:
document-hierarchical-20250101_120000.jsondocument-hierarchical-20250101_120000.md
JSON output contains the full chunk list with all metadata fields serialized as dicts, suitable for programmatic consumption. Markdown output renders each chunk as a human-readable section with breadcrumb, source element type, character range, chunk ID, and linked-list pointers.
When --quiet is omitted, the CLI prints a summary table showing each chunk's index,
character count, source type, and breadcrumb path, along with aggregate statistics:
total chunks, size range, type distribution, continuation count, and elapsed time.
The output directory defaults to ./test_result/ and is created automatically if it
does not exist.
Documentation
- Quick Start
- Algorithms (sentence splitting, header pull-up, context absorption, breadcrumb construction, sibling merge)
- API Reference (
chunk(),chunk_file(),chunk_to_dicts(), keyword arguments) - CLI Usage (flags, output formats, examples)
- Metadata Reference (all fields on
ChunkMetadata) - Why structchunk? (design rationale, UUID4 vs Snowflake BIGINT, fork safety)
- Database Schema (PostgreSQL schema with BIGINT primary key and pgvector column)
Contributing
Contributions are welcome. See CONTRIBUTING.md for:
- Development setup and installation from source
- Project layout and module overview
- Running the test suite
- Submitting pull requests and reporting bugs
Bug reports and pull requests are welcome on GitHub.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file structchunk-0.1.0.tar.gz.
File metadata
- Download URL: structchunk-0.1.0.tar.gz
- Upload date:
- Size: 61.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
769b6d8de198a3c47eee3a9ec46743d9e2f7b303d5958b1964a3f58187f78d27
|
|
| MD5 |
47210c8e029cebb0db26a40989cefe54
|
|
| BLAKE2b-256 |
f5101f2838108085ab2bb0d0e54f95ff45a640ebb52a91d33085fc2602277d96
|
File details
Details for the file structchunk-0.1.0-py3-none-any.whl.
File metadata
- Download URL: structchunk-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4c71b912f19ea691979d9a9d9ce5ee1f8762590a06bc97a6d381fad153ad020
|
|
| MD5 |
54051919e09679f37ac8fb2e1aa1fe8b
|
|
| BLAKE2b-256 |
7787bc8fb82d896677d0127c8eb941cc4fde5ca940dfe36a4c703390f8b1662d
|