RAG chunker that respects document structure.

These details have not been verified by PyPI

Project links

Project description

chunkweaver

RAG chunker that respects document structure.

Install

pip install chunkweaver

Extras:

pip install chunkweaver[langchain]   # LangChain TextSplitter integration
pip install chunkweaver[llamaindex]  # LlamaIndex NodeParser integration
pip install chunkweaver[cli]         # CLI (included in base install)

Quick start

from chunkweaver import Chunker

# Minimal — just better defaults
chunker = Chunker(target_size=1024)
chunks = chunker.chunk(text)

# With structure-aware boundaries
from chunkweaver.presets import LEGAL_EU

chunker = Chunker(
    target_size=1024,
    overlap=2,
    overlap_unit="sentence",
    boundaries=LEGAL_EU,
)

chunks = chunker.chunk(text)

# With metadata
chunks = chunker.chunk_with_metadata(text)
for c in chunks:
    print(c.text)            # full chunk content
    print(c.start, c.end)    # character offsets in original
    print(c.boundary_type)   # "section" | "paragraph" | "sentence" | "word" | "keep_together"
    print(c.overlap_text)    # the overlap prefix (if any)
    print(c.content_text)    # text without overlap (for dedup)

Where chunkweaver fits

  PDF / DOCX / HTML              Your vector DB
        │                              ▲
        ▼                              │
  ┌───────────┐    ┌──────────────┐    │
  │ Extractor │───▶│ chunkweaver  │────┘
  │           │    │              │
  │ unstructured   │ boundaries   │  Embed + upsert
  │ marker-pdf     │ detectors    │  into Pinecone,
  │ docling        │ presets      │  Qdrant, Weaviate,
  │ pdfminer       │ overlap      │  ChromaDB, etc.
  └───────────┘    └──────────────┘

Your extractor turns files into text. Your vector DB stores embeddings. chunkweaver sits in the middle — splitting that text at structural boundaries so each chunk is a coherent unit of meaning, not an arbitrary slice of characters.

Why it matters

Standard chunkers, including LangChain's RecursiveCharacterTextSplitter, don't know that "Article 17" starts a new legal section, or that a table's header row belongs with its data. The result: chunks that straddle topic boundaries, producing blurry embeddings and incomplete retrievals.

The fix is cheap: surface markers (headings, article numbers, table rules) are reliable proxies for where topics change. Detecting them costs O(n) character comparisons — orders of magnitude less than computing semantic boundaries from embeddings.

Our LLM-as-judge benchmark on 11 documents across four domains and 58 queries:

Baseline	CW wins	Baseline wins	p-value
Naive 600-char	15	4	0.019
LangChain RCTS	11	4	0.119

See benchmark/ for full results, methodology, and reproduction steps.

Features

Zero dependencies — stdlib only, no LangChain/LlamaIndex tax
Regex boundaries — you tell the chunker where sections start (^Article \d+, ^## , ^Item 1.)
Hierarchical levels — CHAPTER > Section > Article > clause; split only as deep as needed
Heuristic detectors — HeadingDetector, TableDetector discover structure from text patterns
Annotation ingestion — accept pre-computed structure from any extractor
Semantic overlap — sentences, not characters
Full metadata — offsets, boundary types, hierarchy levels, overlap tracking
Integrations — LangChain and LlamaIndex drop-ins

Presets

Built-in boundary patterns for common document types:

from chunkweaver.presets import (
    LEGAL_EU, LEGAL_US, RFC, MARKDOWN,
    CHAT, CLINICAL, FINANCIAL, FINANCIAL_TABLE,
    SEC_10K, FDA_LABEL, PLAIN,
)

Preset	Domain	Detects
`LEGAL_EU`	EU legislation	`Article N`, `CHAPTER`, `SECTION`, `(1)` recitals
`LEGAL_US`	US law / contracts	`§ N`, `Section N`, `WHEREAS`, `1.1` clauses
`RFC`	IETF RFCs	`1. Intro`, `3.1 Overview`, `Appendix A`
`MARKDOWN`	Markdown	`# headings`, `---` rules
`CHAT`	Chat logs	`[14:30]`, ISO timestamps, `speaker:` turns
`CLINICAL`	Medical notes	`HPI:`, `ASSESSMENT:`, `PLAN:`, etc.
`FINANCIAL`	SEC filings	`Item 1.`, `PART I`, `NOTE 1`, `Schedule A`
`FINANCIAL_TABLE`	Data tables	`TABLE N`, markdown/ASCII separators
`SEC_10K`	SEC annual reports	`PART I`–`IV`, `Item N.`, ALL-CAPS sub-headings
`FDA_LABEL`	Drug labels	`1 INDICATIONS`, `## 2.1 Adult Dosage`
`PLAIN`	Any	No boundaries — pure paragraph/sentence fallback

Combine presets freely:

boundaries = LEGAL_EU + [r"^TABLE\s+", r"^Annex\s+"]
boundaries = FINANCIAL + FINANCIAL_TABLE

Most presets have _LEVELED variants with hierarchical splits: LEGAL_EU_LEVELED, LEGAL_US_LEVELED, RFC_LEVELED, MARKDOWN_LEVELED, FINANCIAL_LEVELED, SEC_10K_LEVELED, FDA_LABEL_LEVELED. See docs/cookbook.md for details.

CLI

chunkweaver document.txt --size 1024 --overlap 2
chunkweaver legal_doc.txt --preset legal-eu --format json
chunkweaver file.txt --detect-boundaries --boundaries "^Article\s+\d+"
cat document.txt | chunkweaver --size 1024 --preset rfc
chunkweaver --recommend my_document.txt
chunkweaver --inspect my_document.txt

The --recommend flag analyzes a document and suggests the right config:

=== chunkweaver recommend ===

Document: 12,340 chars, 380 lines, 45 paragraphs

--- Preset matching ---
  legal-eu                8 hits <-- best

--- Detectors ---
  HeadingDetector: YES (12 headings found)
  TableDetector:   YES (2 tables found)

--- Python snippet ---
from chunkweaver import Chunker
from chunkweaver.presets import LEGAL_EU
from chunkweaver.detector_heading import HeadingDetector
from chunkweaver.detector_table import TableDetector

chunker = Chunker(
    target_size=1024,
    overlap=2,
    boundaries=LEGAL_EU,
    detectors=[HeadingDetector(), TableDetector()],
)

Integrations

LangChain

Drop-in replacement for RecursiveCharacterTextSplitter:

from chunkweaver.integrations.langchain import ChunkWeaverSplitter

splitter = ChunkWeaverSplitter(
    target_size=1024,
    overlap=2,
    boundaries=[r"^#{1,3}\s"],
)
docs = splitter.create_documents([text])

Requires: pip install chunkweaver[langchain]

LlamaIndex

Drop-in NodeParser for ingestion pipelines:

from chunkweaver.integrations.llamaindex import ChunkWeaverNodeParser
from chunkweaver.presets import LEGAL_EU

parser = ChunkWeaverNodeParser(
    target_size=1024,
    boundaries=LEGAL_EU,
    overlap=2,
)
nodes = parser.get_nodes_from_documents(documents)

Requires: pip install chunkweaver[llamaindex]

Heuristic detectors

For documents without clean section markers — SEC filings, scanned contracts, extracted PDFs — heuristic detectors discover structure from text patterns.

from chunkweaver import Chunker
from chunkweaver.detector_heading import HeadingDetector
from chunkweaver.detector_table import TableDetector

chunker = Chunker(
    target_size=1024,
    detectors=[HeadingDetector(), TableDetector()],
)

HeadingDetector scores lines on casing, length, whitespace context, and known prefixes. Works well on Title Case and ALL CAPS headings.

TableDetector identifies numeric data runs and marks them as keep-together regions. On SEC 10-K filings, keeps 80% of financial tables intact vs. 21% without it.

Custom detectors: subclass BoundaryDetector and return SplitPoint / KeepTogetherRegion. See examples/ml-detectors/ for scikit-learn examples.

Documentation

Cookbook — domain recipes (clinical, FDA, SEC, financial, legal, chat, CJK), hierarchical boundaries, annotation ingestion, vector DB integration, tuning tips
API Reference — full parameter tables, Chunk attributes, algorithm details, architecture
FAQ — "What if I need to..." for common questions
Benchmark — LLM-as-judge methodology, reproduction steps, raw results

Ecosystem

Part of a RAG tools suite for retrieval quality:

Tool	Role	What it measures
chunkweaver	Ingestion	Structure-aware chunking — controls what text enters the prompt
ragtune	Evaluation	Retrieval metrics (Recall@K, MRR, bootstrap CI) — measures how well your pipeline retrieves
ragprobe	Pre-deployment	Domain difficulty analysis — predicts how hard retrieval will be before you build

chunkweaver --recommend my_doc.txt              # what config to use
chunkweaver my_doc.txt --preset legal-eu \
    --export-dir ./chunks/                      # chunk → one .txt per chunk
ragtune ingest ./chunks/ --pre-chunked          # embed + store
ragtune simulate --queries golden.json          # measure retrieval
ragprobe analyze --corpus ./docs                # how hard is this domain

--export-dir writes ragtune-compatible files directly — no glue script needed. Use --format json with --recommend or --inspect for CI-friendly output.

Known limitations

Sentence detection defaults to a simple regex ([.!?]\s+(?=[A-Z"(])). Abbreviations like "Dr. Smith" may cause false splits. For non-English or informal text, pass sentence_pattern — built-in alternatives: SENTENCE_END_CJK, SENTENCE_END_PERMISSIVE. See the cookbook for Cyrillic, Spanish, and other script-specific guidance.
Boundaries are line-level regex matches — they won't detect inline structural markers.
No tokenizer awareness — target_size is in characters, not tokens. For token budgets, estimate tokens ≈ chars / 4.

Author

Oleksii Alexapolsky (𝕏) — building retrieval quality tools: chunkweaver, ragtune, ragprobe.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Mar 25, 2026

0.2.3

Mar 24, 2026

0.2.2

Mar 24, 2026

0.2.1

Mar 24, 2026

0.2.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkweaver-0.4.0.tar.gz (66.5 kB view details)

Uploaded Mar 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunkweaver-0.4.0-py3-none-any.whl (41.1 kB view details)

Uploaded Mar 25, 2026 Python 3

File details

Details for the file chunkweaver-0.4.0.tar.gz.

File metadata

Download URL: chunkweaver-0.4.0.tar.gz
Upload date: Mar 25, 2026
Size: 66.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for chunkweaver-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`f1880648abb72d22cb969b3310bcfb2c6cb4dfc301de29db72f27068940c05eb`
MD5	`c768d67c37598ac7b56a7f286d2dda13`
BLAKE2b-256	`271b3308ea5cc2cacca78ed2cbab5583817b4d55173f6c701926b1216b03298f`

See more details on using hashes here.

File details

Details for the file chunkweaver-0.4.0-py3-none-any.whl.

File metadata

Download URL: chunkweaver-0.4.0-py3-none-any.whl
Upload date: Mar 25, 2026
Size: 41.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for chunkweaver-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4f74f1f2cadfbe76f7843ae2121ce9df203bd0d141c564f8967508ce24500977`
MD5	`485962196c3c22146e364704a0761f37`
BLAKE2b-256	`c9eff91247cab77723e63079f593bf2b62eb2d2f066a2c75b2c72913ccb4405b`

See more details on using hashes here.

chunkweaver 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

chunkweaver

Install

Quick start

Where chunkweaver fits

Why it matters

Features

Presets

CLI

Integrations

LangChain

LlamaIndex

Heuristic detectors

Documentation

Ecosystem

Known limitations

Author

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes