Adaptive Information-Aware Chunking for RAG and Agentic Systems, driven by information density instead of fixed token counts.
Project description
entropy-chunker
Adaptive Information-Aware Chunking for RAG and Agentic Systems. Chunk boundaries follow information density instead of a fixed token count — no embeddings, no language model, anywhere in the chunking decision.
pip install entropy-chunker
from entropy_chunker import InfoTheoreticChunker
chunker = InfoTheoreticChunker()
chunks = chunker.split_text(my_document_text)
Why
Most chunkers assume equal token count ≈ equal information content. It's usually false: a legal contract's definitions section, packed with entities referenced throughout the document, gets split as arbitrarily as its boilerplate governing-law clause under a fixed 512-token splitter.
entropy-chunker instead scores each sentence on three signals — and
walks through the document emitting a chunk boundary when accumulated
information, not accumulated tokens, crosses a threshold:
| Signal | What it measures | How |
|---|---|---|
| Compression | redundancy vs. recent context | marginal gzip-compressed size |
| Lexical novelty | new vocabulary introduced | running vocabulary set |
| Word-frequency entropy | rarity vs. the rest of this document | classical Shannon surprisal |
All three are closed-form statistics over the text itself — no vector embeddings, no pretrained model, no API calls. This also means it's fast (the ~740K-character Finance benchmark corpus chunks in well under a second) and fully deterministic.
Where it outshines standard and embedding-based chunking
Benchmarked against Chroma's chunking evaluation methodology (472 real queries, 5 corpora), using the same embedding model as the paper's primary table:
| Metric | entropy-chunker | Best paper baseline |
|---|---|---|
| Precision | 8.93 | 7.0 (Recursive-200) |
| Precision-Ω | 37.68 | 29.9 (Recursive-200) |
| IoU | 8.84 | 6.9 (Recursive-200) |
| Recall | 83.9 | 91.9 (LLM-GPT4o) |
Precision, Precision-Ω, and IoU all beat every baseline tested —
including ClusterSemanticChunker (which uses embeddings directly to pick
boundaries) and LLMSemanticChunker (which prompts GPT-4o). Recall trails
by several points: smaller, more targeted chunks retrieve precisely but
are slightly more likely to split a long excerpt across a boundary. That
tradeoff is the honest cost of the precision gain, not a bug.
Chunk sizes also vary 4-5x more than a fixed-size baseline across every corpus tested — direct evidence boundaries track real information density rather than token count.
Full methodology, per-corpus breakdowns, and the weight-sensitivity analysis behind the presets below: BENCHMARKS.md.
Presets
Equal weighting is a safe default, but benchmark sweeps found it's never actually optimal. Three tuned presets, backed by real sweep data:
InfoTheoreticChunker(preset="precise") # best IoU/precision in benchmarks
InfoTheoreticChunker(preset="recall_focused") # trades some IoU for recall
InfoTheoreticChunker(preset="tabular") # for tables, boilerplate-heavy docs
InfoTheoreticChunker(preset="balanced") # equal weighting -- this is the package default
No preset argument is equivalent to preset="balanced". If you don't
know your corpus's structure in advance, precise is the better starting
point for most prose/document use cases — balanced is kept as the
actual default only because it was never the worst option in any
benchmark corpus, a safer unbiased choice absent more information.
Or set weights directly: InfoTheoreticChunker(w_compression=0.1, w_novelty=0.0, w_entropy=0.9).
Tunable parameters
InfoTheoreticChunker(
info_threshold=1.0, # cumulative info score that triggers a boundary
max_tokens=800, # hard ceiling, regardless of info score
preset="precise", # or set w_compression/w_novelty/w_entropy directly
)
Honest limitations
- Regex-based sentence splitting (by design — no model in the pipeline), which can misfire on unusual punctuation or line-wrap conventions.
- Recall trails embedding- and LLM-based chunkers by several points; see BENCHMARKS.md for the full tradeoff discussion.
- Validated on prose/structured-text corpora; code as a domain hasn't been separately benchmarked.
Installation extras
pip install "entropy-chunker[tokens]" # exact token counting via tiktoken
pip install "entropy-chunker[eval]" # for running the benchmark yourself
pip install git+https://github.com/brandonstarxel/chunking_evaluation.git # required for [eval]; not on PyPI
Without [tokens], token counting falls back to a chars/4 approximation
— this only affects the precision of the max_tokens ceiling, not where
chunk boundaries are placed.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file entropy_chunker-0.1.0.tar.gz.
File metadata
- Download URL: entropy_chunker-0.1.0.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b45622a952a5fcd7d96e67c7e415d3e456337399736410215f94e0747736f84
|
|
| MD5 |
9b8164a107dbcf3bdbeff12bd8e62504
|
|
| BLAKE2b-256 |
6ac6299c65b77e89dac10423912559cb441f9ec0fd55ca7bb4d1e0cce8917f2b
|
File details
Details for the file entropy_chunker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: entropy_chunker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcdef745d5e34e06ce8945261493e9e8409c08800b28aa610a038ec81325b9c6
|
|
| MD5 |
ed885fb450537a4c53bebf58bbb2f4b6
|
|
| BLAKE2b-256 |
b72095bc1bc9e2453279cbdcc09513e3e6a05634f17ae843b3962ddfe8a4f14d
|