Adaptive Information-Aware Chunking for RAG and Agentic Systems, driven by information density instead of fixed token counts.

These details have not been verified by PyPI

Project links

Project description

entropy-chunker

Adaptive Information-Aware Chunking for RAG and Agentic Systems. Chunk boundaries follow information density instead of a fixed token count — no embeddings, no language model, anywhere in the chunking decision.

pip install entropy-chunker

from entropy_chunker import InfoTheoreticChunker

chunker = InfoTheoreticChunker()
chunks = chunker.split_text(my_document_text)

Why

Most chunkers assume equal token count ≈ equal information content. It's usually false: a legal contract's definitions section, packed with entities referenced throughout the document, gets split as arbitrarily as its boilerplate governing-law clause under a fixed 512-token splitter.

entropy-chunker instead scores each sentence on three signals — and walks through the document emitting a chunk boundary when accumulated information, not accumulated tokens, crosses a threshold:

Signal	What it measures	How
Compression	redundancy vs. recent context	marginal gzip-compressed size
Lexical novelty	new vocabulary introduced	running vocabulary set
Word-frequency entropy	rarity vs. the rest of this document	classical Shannon surprisal

All three are closed-form statistics over the text itself — no vector embeddings, no pretrained model, no API calls. This also means it's fast (the ~740K-character Finance benchmark corpus chunks in well under a second) and fully deterministic.

Where it outshines standard and embedding-based chunking

Benchmarked against Chroma's chunking evaluation methodology (472 real queries, 5 corpora), using the same embedding model as the paper's primary table:

Metric	entropy-chunker	Best paper baseline
Precision	8.93	7.0 (Recursive-200)
Precision-Ω	37.68	29.9 (Recursive-200)
IoU	8.84	6.9 (Recursive-200)
Recall	83.9	91.9 (LLM-GPT4o)

Precision, Precision-Ω, and IoU all beat every baseline tested — including ClusterSemanticChunker (which uses embeddings directly to pick boundaries) and LLMSemanticChunker (which prompts GPT-4o). Recall trails by several points: smaller, more targeted chunks retrieve precisely but are slightly more likely to split a long excerpt across a boundary. That tradeoff is the honest cost of the precision gain, not a bug.

Chunk sizes also vary 4-5x more than a fixed-size baseline across every corpus tested — direct evidence boundaries track real information density rather than token count.

Full methodology, per-corpus breakdowns, and the weight-sensitivity analysis behind the presets below: BENCHMARKS.md.

Presets

Equal weighting is a safe default, but benchmark sweeps found it's never actually optimal. Three tuned presets, backed by real sweep data:

InfoTheoreticChunker(preset="precise")         # best IoU/precision in benchmarks
InfoTheoreticChunker(preset="recall_focused")   # trades some IoU for recall
InfoTheoreticChunker(preset="tabular")          # for tables, boilerplate-heavy docs
InfoTheoreticChunker(preset="balanced")         # equal weighting -- this is the package default

No preset argument is equivalent to preset="balanced". If you don't know your corpus's structure in advance, precise is the better starting point for most prose/document use cases — balanced is kept as the actual default only because it was never the worst option in any benchmark corpus, a safer unbiased choice absent more information.

Or set weights directly: InfoTheoreticChunker(w_compression=0.1, w_novelty=0.0, w_entropy=0.9).

Tunable parameters

InfoTheoreticChunker(
    info_threshold=1.0,   # cumulative info score that triggers a boundary
    max_tokens=800,       # hard ceiling, regardless of info score
    preset="precise",     # or set w_compression/w_novelty/w_entropy directly
)

Honest limitations

Regex-based sentence splitting (by design — no model in the pipeline), which can misfire on unusual punctuation or line-wrap conventions.
Recall trails embedding- and LLM-based chunkers by several points; see BENCHMARKS.md for the full tradeoff discussion.
Validated on prose/structured-text corpora; code as a domain hasn't been separately benchmarked.

Installation extras

pip install "entropy-chunker[tokens]"  # exact token counting via tiktoken
pip install "entropy-chunker[eval]"    # for running the benchmark yourself
pip install git+https://github.com/brandonstarxel/chunking_evaluation.git  # required for [eval]; not on PyPI

Without [tokens], token counting falls back to a chars/4 approximation — this only affects the precision of the max_tokens ceiling, not where chunk boundaries are placed.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entropy_chunker-0.1.0.tar.gz (22.2 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

entropy_chunker-0.1.0-py3-none-any.whl (19.0 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file entropy_chunker-0.1.0.tar.gz.

File metadata

Download URL: entropy_chunker-0.1.0.tar.gz
Upload date: Jun 29, 2026
Size: 22.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for entropy_chunker-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6b45622a952a5fcd7d96e67c7e415d3e456337399736410215f94e0747736f84`
MD5	`9b8164a107dbcf3bdbeff12bd8e62504`
BLAKE2b-256	`6ac6299c65b77e89dac10423912559cb441f9ec0fd55ca7bb4d1e0cce8917f2b`

See more details on using hashes here.

File details

Details for the file entropy_chunker-0.1.0-py3-none-any.whl.

File metadata

Download URL: entropy_chunker-0.1.0-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for entropy_chunker-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fcdef745d5e34e06ce8945261493e9e8409c08800b28aa610a038ec81325b9c6`
MD5	`ed885fb450537a4c53bebf58bbb2f4b6`
BLAKE2b-256	`b72095bc1bc9e2453279cbdcc09513e3e6a05634f17ae843b3962ddfe8a4f14d`

See more details on using hashes here.

entropy-chunker 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

entropy-chunker

Why

Where it outshines standard and embedding-based chunking

Presets

Tunable parameters

Honest limitations

Installation extras

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes