Skip to main content

RAG chunker that respects document structure.

Project description

chunkweaver

RAG chunker that respects document structure.

PyPI version CI codecov License: MIT Python 3.9+

Install

pip install chunkweaver

Extras:

pip install chunkweaver[langchain]   # LangChain TextSplitter integration
pip install chunkweaver[llamaindex]  # LlamaIndex NodeParser integration
pip install chunkweaver[cli]         # CLI (included in base install)

Quick start

from chunkweaver import Chunker

# Minimal — just better defaults
chunker = Chunker(target_size=1024)
chunks = chunker.chunk(text)

# With structure-aware boundaries
from chunkweaver.presets import LEGAL_EU

chunker = Chunker(
    target_size=1024,
    overlap=2,
    overlap_unit="sentence",
    boundaries=LEGAL_EU,
)

chunks = chunker.chunk(text)

# With metadata
chunks = chunker.chunk_with_metadata(text)
for c in chunks:
    print(c.text)            # full chunk content
    print(c.start, c.end)    # character offsets in original
    print(c.boundary_type)   # "section" | "paragraph" | "sentence" | "word" | "keep_together"
    print(c.overlap_text)    # the overlap prefix (if any)
    print(c.content_text)    # text without overlap (for dedup)

Where chunkweaver fits

  PDF / DOCX / HTML              Your vector DB
        │                              ▲
        ▼                              │
  ┌───────────┐    ┌──────────────┐    │
  │ Extractor │───▶│ chunkweaver  │────┘
  │           │    │              │
  │ unstructured   │ boundaries   │  Embed + upsert
  │ marker-pdf     │ detectors    │  into Pinecone,
  │ docling        │ presets      │  Qdrant, Weaviate,
  │ pdfminer       │ overlap      │  ChromaDB, etc.
  └───────────┘    └──────────────┘

Your extractor turns files into text. Your vector DB stores embeddings. chunkweaver sits in the middle — splitting that text at structural boundaries so each chunk is a coherent unit of meaning, not an arbitrary slice of characters.

Why it matters

Standard chunkers, including LangChain's RecursiveCharacterTextSplitter, don't know that "Article 17" starts a new legal section, or that a table's header row belongs with its data. The result: chunks that straddle topic boundaries, producing blurry embeddings and incomplete retrievals.

The fix is cheap: surface markers (headings, article numbers, table rules) are reliable proxies for where topics change. Detecting them costs O(n) character comparisons — orders of magnitude less than computing semantic boundaries from embeddings.

Our LLM-as-judge benchmark on 11 documents across four domains and 58 queries:

Baseline CW wins Baseline wins p-value
Naive 600-char 15 4 0.019
LangChain RCTS 11 4 0.119

See benchmark/ for full results, methodology, and reproduction steps.

Features

  • Zero dependencies — stdlib only, no LangChain/LlamaIndex tax
  • Regex boundaries — you tell the chunker where sections start (^Article \d+, ^## , ^Item 1.)
  • Hierarchical levels — CHAPTER > Section > Article > clause; split only as deep as needed
  • Heuristic detectorsHeadingDetector, TableDetector discover structure from text patterns
  • Annotation ingestion — accept pre-computed structure from any extractor
  • Semantic overlap — sentences, not characters
  • Full metadata — offsets, boundary types, hierarchy levels, overlap tracking
  • Integrations — LangChain and LlamaIndex drop-ins

Presets

Built-in boundary patterns for common document types:

from chunkweaver.presets import (
    LEGAL_EU, LEGAL_US, RFC, MARKDOWN,
    CHAT, CLINICAL, FINANCIAL, FINANCIAL_TABLE,
    SEC_10K, FDA_LABEL, PLAIN,
)
Preset Domain Detects
LEGAL_EU EU legislation Article N, CHAPTER, SECTION, (1) recitals
LEGAL_US US law / contracts § N, Section N, WHEREAS, 1.1 clauses
RFC IETF RFCs 1. Intro, 3.1 Overview, Appendix A
MARKDOWN Markdown # headings, --- rules
CHAT Chat logs [14:30], ISO timestamps, speaker: turns
CLINICAL Medical notes HPI:, ASSESSMENT:, PLAN:, etc.
FINANCIAL SEC filings Item 1., PART I, NOTE 1, Schedule A
FINANCIAL_TABLE Data tables TABLE N, markdown/ASCII separators
SEC_10K SEC annual reports PART IIV, Item N., ALL-CAPS sub-headings
FDA_LABEL Drug labels 1 INDICATIONS, ## 2.1 Adult Dosage
PLAIN Any No boundaries — pure paragraph/sentence fallback

Combine presets freely:

boundaries = LEGAL_EU + [r"^TABLE\s+", r"^Annex\s+"]
boundaries = FINANCIAL + FINANCIAL_TABLE

Most presets have _LEVELED variants with hierarchical splits: LEGAL_EU_LEVELED, LEGAL_US_LEVELED, RFC_LEVELED, MARKDOWN_LEVELED, FINANCIAL_LEVELED, SEC_10K_LEVELED, FDA_LABEL_LEVELED. See docs/cookbook.md for details.

CLI

chunkweaver document.txt --size 1024 --overlap 2
chunkweaver legal_doc.txt --preset legal-eu --format json
chunkweaver file.txt --detect-boundaries --boundaries "^Article\s+\d+"
cat document.txt | chunkweaver --size 1024 --preset rfc
chunkweaver --recommend my_document.txt
chunkweaver --inspect my_document.txt

The --recommend flag analyzes a document and suggests the right config:

=== chunkweaver recommend ===

Document: 12,340 chars, 380 lines, 45 paragraphs

--- Preset matching ---
  legal-eu                8 hits <-- best

--- Detectors ---
  HeadingDetector: YES (12 headings found)
  TableDetector:   YES (2 tables found)

--- Python snippet ---
from chunkweaver import Chunker
from chunkweaver.presets import LEGAL_EU
from chunkweaver.detector_heading import HeadingDetector
from chunkweaver.detector_table import TableDetector

chunker = Chunker(
    target_size=1024,
    overlap=2,
    boundaries=LEGAL_EU,
    detectors=[HeadingDetector(), TableDetector()],
)

Integrations

LangChain

Drop-in replacement for RecursiveCharacterTextSplitter:

from chunkweaver.integrations.langchain import ChunkWeaverSplitter

splitter = ChunkWeaverSplitter(
    target_size=1024,
    overlap=2,
    boundaries=[r"^#{1,3}\s"],
)
docs = splitter.create_documents([text])

Requires: pip install chunkweaver[langchain]

LlamaIndex

Drop-in NodeParser for ingestion pipelines:

from chunkweaver.integrations.llamaindex import ChunkWeaverNodeParser
from chunkweaver.presets import LEGAL_EU

parser = ChunkWeaverNodeParser(
    target_size=1024,
    boundaries=LEGAL_EU,
    overlap=2,
)
nodes = parser.get_nodes_from_documents(documents)

Requires: pip install chunkweaver[llamaindex]

Heuristic detectors

For documents without clean section markers — SEC filings, scanned contracts, extracted PDFs — heuristic detectors discover structure from text patterns.

from chunkweaver import Chunker
from chunkweaver.detector_heading import HeadingDetector
from chunkweaver.detector_table import TableDetector

chunker = Chunker(
    target_size=1024,
    detectors=[HeadingDetector(), TableDetector()],
)

HeadingDetector scores lines on casing, length, whitespace context, and known prefixes. Works well on Title Case and ALL CAPS headings.

TableDetector identifies numeric data runs and marks them as keep-together regions. On SEC 10-K filings, keeps 80% of financial tables intact vs. 21% without it.

Custom detectors: subclass BoundaryDetector and return SplitPoint / KeepTogetherRegion. See examples/ml-detectors/ for scikit-learn examples.

Documentation

  • Cookbook — domain recipes (clinical, FDA, SEC, financial, legal, chat, CJK), hierarchical boundaries, annotation ingestion, vector DB integration, tuning tips
  • API Reference — full parameter tables, Chunk attributes, algorithm details, architecture
  • FAQ — "What if I need to..." for common questions
  • Benchmark — LLM-as-judge methodology, reproduction steps, raw results

Ecosystem

Part of a RAG tools suite for retrieval quality:

Tool Role What it measures
chunkweaver Ingestion Structure-aware chunking — controls what text enters the prompt
ragtune Evaluation Retrieval metrics (Recall@K, MRR, bootstrap CI) — measures how well your pipeline retrieves
ragprobe Pre-deployment Domain difficulty analysis — predicts how hard retrieval will be before you build
chunkweaver --recommend my_doc.txt              # what config to use
chunkweaver my_doc.txt --preset legal-eu \
    --export-dir ./chunks/                      # chunk → one .txt per chunk
ragtune ingest ./chunks/ --pre-chunked          # embed + store
ragtune simulate --queries golden.json          # measure retrieval
ragprobe analyze --corpus ./docs                # how hard is this domain

--export-dir writes ragtune-compatible files directly — no glue script needed. Use --format json with --recommend or --inspect for CI-friendly output.

Known limitations

  • Sentence detection defaults to a simple regex ([.!?]\s+(?=[A-Z"(])). Abbreviations like "Dr. Smith" may cause false splits. For non-English or informal text, pass sentence_pattern — built-in alternatives: SENTENCE_END_CJK, SENTENCE_END_PERMISSIVE. See the cookbook for Cyrillic, Spanish, and other script-specific guidance.
  • Boundaries are line-level regex matches — they won't detect inline structural markers.
  • No tokenizer awarenesstarget_size is in characters, not tokens. For token budgets, estimate tokens ≈ chars / 4.

Author

Oleksii Alexapolsky (𝕏) — building retrieval quality tools: chunkweaver, ragtune, ragprobe.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkweaver-0.4.0.tar.gz (66.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkweaver-0.4.0-py3-none-any.whl (41.1 kB view details)

Uploaded Python 3

File details

Details for the file chunkweaver-0.4.0.tar.gz.

File metadata

  • Download URL: chunkweaver-0.4.0.tar.gz
  • Upload date:
  • Size: 66.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for chunkweaver-0.4.0.tar.gz
Algorithm Hash digest
SHA256 f1880648abb72d22cb969b3310bcfb2c6cb4dfc301de29db72f27068940c05eb
MD5 c768d67c37598ac7b56a7f286d2dda13
BLAKE2b-256 271b3308ea5cc2cacca78ed2cbab5583817b4d55173f6c701926b1216b03298f

See more details on using hashes here.

File details

Details for the file chunkweaver-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: chunkweaver-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 41.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for chunkweaver-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4f74f1f2cadfbe76f7843ae2121ce9df203bd0d141c564f8967508ce24500977
MD5 485962196c3c22146e364704a0761f37
BLAKE2b-256 c9eff91247cab77723e63079f593bf2b62eb2d2f066a2c75b2c72913ccb4405b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page