Structure-aware text chunking for RAG. Zero dependencies.

These details have not been verified by PyPI

Project links

Project description

chunkweaver

The structure-aware chunking layer between your extractor and your vector DB.

Where chunkweaver fits

  PDF / DOCX / HTML              Your vector DB
        │                              ▲
        ▼                              │
  ┌───────────┐    ┌──────────────┐    │
  │ Extractor │───▶│ chunkweaver  │────┘
  │           │    │              │
  │ unstructured   │ boundaries   │  Embed + upsert
  │ marker-pdf     │ detectors    │  into Pinecone,
  │ docling        │ presets      │  Qdrant, Weaviate,
  │ pdfminer       │ overlap      │  ChromaDB, etc.
  └───────────┘    └──────────────┘

Your extractor turns files into text. Your vector DB stores embeddings. chunkweaver sits in the middle — splitting that text at structural boundaries so each chunk is a coherent unit of meaning, not an arbitrary slice of characters.

The problem

Standard chunkers — including LangChain's RecursiveCharacterTextSplitter — are paragraph-aware but not structure-aware. They don't know that "Article 17" starts a new legal section, or that a financial table's header row belongs with its data. The result: chunks that split mid-section, producing blurry embeddings and incomplete retrievals.

Fixed-size chunking cuts through sentences; structure-aware splits follow logical sections

Our LLM-as-judge benchmark on 11 structured documents (GDPR, EU AI Act, CCPA, 8 IETF RFCs) and 58 queries shows:

Baseline	CW wins	Baseline wins	p-value
Naive 600-char	15	4	0.019
LangChain RCTS	11	4	0.119

chunkweaver significantly outperforms naive chunking (p < 0.02). Against LangChain RCTS, the win ratio is similar (11:4) but not significant at this sample size — RCTS's paragraph heuristic captures some structural signal, narrowing the gap. The advantage is clearest on documents with explicit section markers. See benchmark/ for full results, methodology, and reproduction steps.

What chunkweaver does

Two layers of structure-aware splitting:

Regex boundaries — you tell the chunker where sections start (^Article \d+, ^## , ^Item 1.)
Heuristic detectors — the chunker discovers structure itself (headings by casing/whitespace, tables by numeric patterns)

Both layers work together. Detectors can emit split points ("start a new chunk here") or keep-together regions ("don't split this table"). When they conflict, keep-together wins.

Zero dependencies — stdlib only, no LangChain/LlamaIndex tax
User-defined boundaries — regex patterns, not hard-coded heuristics
Heuristic detectors — HeadingDetector, TableDetector for semi-structured documents
Semantic overlap — sentences, not characters
Full metadata — offsets, boundary types, overlap tracking
Integrations — LangChain and LlamaIndex drop-ins; Unstructured planned

Install

pip install chunkweaver

Extras:

pip install chunkweaver[cli]        # CLI with click
pip install chunkweaver[langchain]   # LangChain TextSplitter integration
pip install chunkweaver[llamaindex]  # LlamaIndex NodeParser integration
pip install chunkweaver[dev]         # pytest + coverage

Quick start

from chunkweaver import Chunker

# Minimal — just better defaults
chunker = Chunker(target_size=1024)
chunks = chunker.chunk(text)

# Full configuration
chunker = Chunker(
    target_size=1024,
    overlap=2,
    overlap_unit="sentence",
    boundaries=[
        r"^Article\s+\d+",   # GDPR articles
        r"^#{1,3}\s",        # Markdown headers
        r"^\d+\.\d+\s",      # Numbered sections (RFC)
        r"^TABLE\s+",        # Table headers
    ],
    fallback="paragraph",
    min_size=200,
)

# List of strings
chunks = chunker.chunk(text)

# With metadata
chunks = chunker.chunk_with_metadata(text)
for c in chunks:
    print(c.text)            # full chunk content
    print(c.start, c.end)    # character offsets in original
    print(c.boundary_type)   # "section" | "paragraph" | "sentence" | "word"
    print(c.overlap_text)    # the overlap prefix (if any)
    print(c.content_text)    # text without overlap (for dedup)

Vector DB integration

chunkweaver is designed for vector database ingest pipelines:

from chunkweaver import Chunker
from chunkweaver.presets import LEGAL_EU

chunker = Chunker(
    target_size=1024,
    overlap=2,
    overlap_unit="sentence",
    boundaries=LEGAL_EU,
)

chunks = chunker.chunk_with_metadata(document_text)

# Prepare records for your vector DB
records = [
    {
        "id": f"doc-{doc_id}-chunk-{c.index}",
        "text": c.text,
        "metadata": {
            "source": filename,
            "start": c.start,
            "end": c.end,
            "boundary_type": c.boundary_type,
            "has_overlap": bool(c.overlap_text),
        },
    }
    for c in chunks
]

# Embed and upsert into Pinecone / Qdrant / Weaviate / ChromaDB / etc.

Presets

Built-in boundary patterns for common document types:

from chunkweaver.presets import (
    LEGAL_EU, LEGAL_US, RFC, MARKDOWN,
    CHAT, CLINICAL, FINANCIAL, FINANCIAL_TABLE, PLAIN,
)

Preset	Domain	Detects
`LEGAL_EU`	EU legislation	`Article N`, `CHAPTER`, `SECTION`, `(1)` recitals
`LEGAL_US`	US law / contracts	`§ N`, `Section N`, `WHEREAS`, `1.1` clauses
`RFC`	IETF RFCs	`1. Intro`, `3.1 Overview`, `Appendix A`
`MARKDOWN`	Markdown	`# headings`, `---` rules
`CHAT`	Chat logs	`[14:30]`, ISO timestamps, `speaker:` turns
`CLINICAL`	Medical notes	`HPI:`, `ASSESSMENT:`, `PLAN:`, etc.
`FINANCIAL`	SEC filings	`Item 1.`, `PART I`, `NOTE 1`, `Schedule A`
`FINANCIAL_TABLE`	Data tables	`TABLE N`, markdown/ASCII separators
`PLAIN`	Any	No boundaries — pure paragraph/sentence fallback

Combine presets freely:

boundaries = LEGAL_EU + [r"^TABLE\s+", r"^Annex\s+"]
boundaries = FINANCIAL + FINANCIAL_TABLE

LangChain integration

Drop-in replacement for RecursiveCharacterTextSplitter:

from chunkweaver.integrations.langchain import ChunkWeaverSplitter

splitter = ChunkWeaverSplitter(
    target_size=1024,
    overlap=2,
    boundaries=[r"^#{1,3}\s"],
)

# Works with LangChain document loaders
docs = splitter.create_documents([text])

Requires: pip install chunkweaver[langchain]

LlamaIndex integration

Drop-in NodeParser for LlamaIndex ingestion pipelines:

from chunkweaver.integrations.llamaindex import ChunkWeaverNodeParser
from chunkweaver.presets import LEGAL_EU

parser = ChunkWeaverNodeParser(
    target_size=1024,
    boundaries=LEGAL_EU,
    overlap=2,
)

# Works with LlamaIndex document loaders and ingestion pipelines
nodes = parser.get_nodes_from_documents(documents)

Supports all chunkweaver features including detectors:

from chunkweaver.detector_heading import HeadingDetector

parser = ChunkWeaverNodeParser(
    target_size=1024,
    detectors=[HeadingDetector()],
)

Requires: pip install chunkweaver[llamaindex]

CLI

# Basic usage
chunkweaver document.txt --size 1024 --overlap 2

# With boundary patterns
chunkweaver file.txt --boundaries "^Article\s+\d+" "^CHAPTER\s+"

# Use a preset
chunkweaver legal_doc.txt --preset legal-eu --format json

# JSONL output (one chunk per line, pipe-friendly)
chunkweaver file.txt --preset markdown --format jsonl

# Preview boundary detection (tune your patterns)
chunkweaver file.txt --detect-boundaries --boundaries "^Article\s+\d+"

# Pipe from stdin
cat document.txt | chunkweaver --size 1024 --preset rfc

Customization cookbook

chunkweaver is designed to be adapted to any text type. Here are recipes for common scenarios.

Chat logs & customer support transcripts

Chat text is informal — no uppercase after periods, no paragraph structure. The default sentence regex ([.!?]\s+(?=[A-Z"(])) won't split it well. Use the CHAT preset for turn-level boundaries and SENTENCE_END_PERMISSIVE for overlap that works with lowercase text:

from chunkweaver import Chunker, SENTENCE_END_PERMISSIVE
from chunkweaver.presets import CHAT

chunker = Chunker(
    target_size=512,
    overlap=1,
    overlap_unit="sentence",
    boundaries=CHAT,
    sentence_pattern=SENTENCE_END_PERMISSIVE,
    min_size=0,
)

chat_log = """[14:30] Agent: Welcome to support. How can I help?
[14:31] Customer: My order hasn't arrived. It's been 10 days.
[14:32] Agent: I'm sorry to hear that. Let me look into it.
[14:33] Customer: The order number is 12345.
[14:34] Agent: I see it was shipped Jan 5. It appears to be delayed."""

chunks = chunker.chunk(chat_log)
# Each speaker turn becomes its own chunk

Chinese / Japanese / Korean text

CJK languages use different sentence-ending punctuation (。！？). The default regex won't detect these:

from chunkweaver import Chunker, SENTENCE_END_CJK

chunker = Chunker(
    target_size=512,
    overlap=1,
    overlap_unit="sentence",
    sentence_pattern=SENTENCE_END_CJK,
)

text = "第一条规定了保护范围。第二条界定了适用条件。第三条明确了领土管辖权。"
chunks = chunker.chunk(text)

For mixed-language documents, use a combined pattern:

import re

chunker = Chunker(
    target_size=512,
    overlap=1,
    sentence_pattern=re.compile(r'([.!?。！？])(\s*)'),
)

Healthcare / clinical notes

Discharge summaries and clinical notes have predictable section headers. The CLINICAL preset recognizes CHIEF COMPLAINT:, HPI:, ASSESSMENT:, PLAN:, and many more:

from chunkweaver import Chunker
from chunkweaver.presets import CLINICAL

chunker = Chunker(
    target_size=1024,
    overlap=1,
    overlap_unit="sentence",
    boundaries=CLINICAL,
    min_size=50,   # merge very short sections like "ALLERGIES: NKDA"
)

note = """CHIEF COMPLAINT: Chest pain and shortness of breath.
HPI: 65-year-old male presenting with acute onset chest pain.
ASSESSMENT: Acute coronary syndrome, rule out MI.
PLAN: Admit to telemetry. Serial troponins q6h."""

chunks = chunker.chunk(note)
# Each clinical section stays intact

Financial documents (tables + headings)

The biggest problems with financial document chunking: tables get split in half, and section headings get separated from their content.

Option A — regex keep_together (simple, for known table markers):

from chunkweaver import Chunker
from chunkweaver.presets import FINANCIAL, FINANCIAL_TABLE

chunker = Chunker(
    target_size=1024,
    boundaries=FINANCIAL + FINANCIAL_TABLE,
    keep_together=[r"^TABLE\s+\d+"],
)

Option B — heuristic detectors (discovers structure automatically):

from chunkweaver import Chunker
from chunkweaver.detector_heading import HeadingDetector
from chunkweaver.detector_table import TableDetector
from chunkweaver.presets import FINANCIAL

chunker = Chunker(
    target_size=1024,
    boundaries=FINANCIAL,
    detectors=[HeadingDetector(), TableDetector()],
)

Option B finds headings by casing/whitespace patterns and tables by numeric run detection — no regex tuning needed. On SEC 10-K filings, TableDetector keeps 80% of financial tables intact vs. 21% without it.

US contracts & legal filings

US legal documents use §, Section, WHEREAS, and numbered clauses:

from chunkweaver import Chunker
from chunkweaver.presets import LEGAL_US

chunker = Chunker(
    target_size=1024,
    overlap=2,
    boundaries=LEGAL_US,
)

contract = """WHEREAS, the parties wish to enter into an agreement;
WHEREAS, the terms have been negotiated in good faith;
NOW, THEREFORE the parties agree as follows:
Section 1 Definitions.
1.1 "Agreement" means this document.
Section 2 Obligations.
§ 3 Governing law."""

chunks = chunker.chunk(contract)

Custom boundaries for any domain

You're not limited to presets. Any regex that matches line starts works:

# Jupyter notebooks (markdown cells)
boundaries = [r"^# In\[\d+\]", r"^#{1,3}\s"]

# Log files
boundaries = [r"^\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}"]

# Email threads
boundaries = [r"^From:", r"^On .+ wrote:", r"^>+ On"]

# LaTeX
boundaries = [r"^\\section\{", r"^\\subsection\{", r"^\\chapter\{"]

# reStructuredText
boundaries = [r"^={3,}\s*$", r"^-{3,}\s*$", r"^\.\.\s+\w+::"]

chunker = Chunker(target_size=1024, boundaries=boundaries)

Tuning tips

Choosing target_size: Larger chunks = better retrieval but fewer results per query. Start with 1024 for dense prose, 512 for short-form content (chat, clinical notes), 2048 for legal/technical documents with long sections.

Choosing overlap: 2 sentences is a good default. Use 0 when chunks are already small or when you need exact deduplication. Use overlap_unit="chars" with overlap=100 for predictable sizing.

Choosing min_size: Set to 0 when every boundary should produce a chunk (e.g., chat turns). Set to 200+ when standalone headings should merge with their body text.

Debugging boundaries: Use --detect-boundaries on the CLI to preview what your patterns match before chunking:

chunkweaver doc.txt --detect-boundaries --boundaries "^Article\s+\d+"
# line 5: [^Article\s+\d+] 'Article 1'
# line 23: [^Article\s+\d+] 'Article 2'

How it works

Run detectors — heuristic detectors analyze the full text and produce split points + keep-together regions
Detect boundaries — scan each line against your regex patterns, merge with detector split points, suppress splits inside keep-together regions
Split at boundaries — create one segment per structural section
Isolate protected regions — carve keep-together regions (tables) into their own segments
Sub-split oversized segments — break large sections at paragraph → sentence → word boundaries; allow protected regions to overshoot target_size
Merge undersized segments — combine tiny segments (like standalone headings) with their body text
Add overlap — prepend the last N sentences/paragraphs/chars from the previous chunk
Return — chunks with full metadata (offsets, boundary type, overlap tracking)

Heuristic detectors

For documents without clean regex-matchable section markers — SEC filings, scanned contracts, extracted PDFs — chunkweaver provides heuristic detectors that discover structure from text patterns.

HeadingDetector

Scores each line on multiple signals (casing, length, whitespace context, known prefixes) and emits split points at probable headings.

from chunkweaver import Chunker
from chunkweaver.detector_heading import HeadingDetector
from chunkweaver.presets import FINANCIAL

chunker = Chunker(
    target_size=1024,
    boundaries=FINANCIAL,
    detectors=[HeadingDetector(min_score=4.0)],
)

Works well on documents with Title Case or ALL CAPS headings — SEC filings, legal contracts, government reports, technical manuals.

TableDetector

Identifies runs of numeric data lines, extends backward to include column headers, and marks them as keep-together regions. The chunker will not split inside a protected table (allowing up to 1.5x target_size overshoot).

from chunkweaver import Chunker
from chunkweaver.detector_table import TableDetector
from chunkweaver.presets import FINANCIAL

chunker = Chunker(
    target_size=1024,
    boundaries=FINANCIAL,
    detectors=[TableDetector()],
)

On SEC 10-K filings, TableDetector keeps 80% of financial tables intact (vs. 21% without it).

Custom detectors

Implement the BoundaryDetector ABC to add your own structure detection:

from chunkweaver import BoundaryDetector, SplitPoint, KeepTogetherRegion

class MyDetector(BoundaryDetector):
    def detect(self, text):
        results = []
        # Emit SplitPoint where you want chunk breaks
        # Emit KeepTogetherRegion for ranges that must stay whole
        return results

chunker = Chunker(
    target_size=1024,
    detectors=[MyDetector()],
)

Architecture

chunkweaver/
├── __init__.py            # Public API: Chunker, Chunk, detectors, sentence patterns
├── chunker.py             # Core algorithm + detector integration
├── detectors.py           # BoundaryDetector ABC, SplitPoint, KeepTogetherRegion
├── detector_heading.py    # HeadingDetector — heuristic heading detection
├── detector_table.py      # TableDetector — financial table keep-together
├── models.py              # Chunk dataclass
├── boundaries.py          # Regex boundary detection engine
├── sentences.py           # Configurable sentence splitting (regex, no NLP)
├── presets.py             # 9 domain presets (legal, clinical, chat, etc.)
├── cli.py                 # CLI entry point
└── integrations/
    ├── langchain.py       # LangChain TextSplitter wrapper
    └── llamaindex.py      # LlamaIndex NodeParser wrapper

Design principles:

Each module has a single responsibility
No deeply nested conditionals — small, testable functions
All decisions are logged/exposed via chunk metadata
Zero dependencies for core; optional extras for CLI and integrations
Detectors are composable — stack any combination without conflicts

API reference

`Chunker(target_size, overlap, overlap_unit, boundaries, fallback, min_size, sentence_pattern, keep_together, detectors)`

Parameter	Type	Default	Description
`target_size`	`int`	`1024`	Target chunk size in characters
`overlap`	`int`	`2`	Number of overlap units from previous chunk
`overlap_unit`	`str`	`"sentence"`	`"sentence"`, `"paragraph"`, or `"chars"`
`boundaries`	`list[str]`	`[]`	Regex patterns marking section starts
`fallback`	`str`	`"paragraph"`	Sub-split strategy: `"paragraph"`, `"sentence"`, `"word"`
`min_size`	`int`	`200`	Minimum chunk size (merge smaller segments)
`sentence_pattern`	`str \| Pattern \| None`	`None`	Custom regex for sentence detection (default: English)
`keep_together`	`list[str] \| None`	`None`	Patterns for lines that must stay with next segment
`detectors`	`list[BoundaryDetector] \| None`	`None`	Heuristic detectors for structure discovery

`Chunker.chunk(text) → list[str]`

Returns a list of chunk strings.

`Chunker.chunk_with_metadata(text) → list[Chunk]`

Returns a list of Chunk objects with full metadata.

`Chunk`

Attribute	Type	Description
`text`	`str`	Full chunk content (including overlap)
`start`	`int`	Start offset in original text (excluding overlap)
`end`	`int`	End offset in original text
`index`	`int`	Zero-based chunk index
`boundary_type`	`str`	What triggered the split
`overlap_text`	`str`	The overlap prefix
`content_text`	`str`	Text without overlap (property)

Known limitations

Sentence detection defaults to a simple regex ([.!?]\s+(?=[A-Z"(])). Abbreviations like "Dr. Smith" may cause false splits. For non-English or informal text, pass sentence_pattern — built-in alternatives: SENTENCE_END_CJK, SENTENCE_END_PERMISSIVE.
Boundaries are line-level regex matches — they won't detect inline structural markers.
No tokenizer awareness — target_size is in characters, not tokens. For token budgets, estimate tokens ≈ chars / 4.
Flat hierarchy — all boundary patterns are equal. A (1) inside Article 5 matches the same as (1) at document level. For deeply nested structures, consider scoping your patterns more tightly.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Mar 25, 2026

0.2.3

Mar 24, 2026

0.2.2

Mar 24, 2026

This version

0.2.1

Mar 24, 2026

0.2.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkweaver-0.2.1.tar.gz (37.4 kB view details)

Uploaded Mar 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunkweaver-0.2.1-py3-none-any.whl (29.8 kB view details)

Uploaded Mar 24, 2026 Python 3

File details

Details for the file chunkweaver-0.2.1.tar.gz.

File metadata

Download URL: chunkweaver-0.2.1.tar.gz
Upload date: Mar 24, 2026
Size: 37.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for chunkweaver-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`4fd48aa3e55c4641888074e141d0920800d3f23306d08a2a9660c8942282b03d`
MD5	`a3b63c05469b39ff623cb294dadc54a6`
BLAKE2b-256	`dee3725343b9af39980be211db5f147db2dea71d6814d692cb3783809555f734`

See more details on using hashes here.

File details

Details for the file chunkweaver-0.2.1-py3-none-any.whl.

File metadata

Download URL: chunkweaver-0.2.1-py3-none-any.whl
Upload date: Mar 24, 2026
Size: 29.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for chunkweaver-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b5e5a7b0f463edb7ea77c2b32b484b38756473accf615aa489243042734d3764`
MD5	`e264e32c7295c7229c2611e371690fe7`
BLAKE2b-256	`c02e3bb154e1e732bfcbe9daaf94b5f08bd24f98e90b7c1677b1abbc7bfdef7f`

See more details on using hashes here.

chunkweaver 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

chunkweaver

Where chunkweaver fits

The problem

What chunkweaver does

Install

Quick start

Vector DB integration

Presets

LangChain integration

LlamaIndex integration

CLI

Customization cookbook

Chat logs & customer support transcripts

Chinese / Japanese / Korean text

Healthcare / clinical notes

Financial documents (tables + headings)

US contracts & legal filings

Custom boundaries for any domain

Tuning tips

How it works

Heuristic detectors

HeadingDetector

TableDetector

Custom detectors

Architecture

API reference

Chunker(target_size, overlap, overlap_unit, boundaries, fallback, min_size, sentence_pattern, keep_together, detectors)

Chunker.chunk(text) → list[str]

Chunker.chunk_with_metadata(text) → list[Chunk]

Chunk

Known limitations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Chunker(target_size, overlap, overlap_unit, boundaries, fallback, min_size, sentence_pattern, keep_together, detectors)`

`Chunker.chunk(text) → list[str]`

`Chunker.chunk_with_metadata(text) → list[Chunk]`

`Chunk`