Skip to main content

Adaptive Information-Aware Chunking for RAG and Agentic Systems, driven by information density instead of fixed token counts.

Project description

entropy-chunker

Adaptive Information-Aware Chunking for RAG and Agentic Systems. Chunk boundaries follow information density instead of a fixed token count — no embeddings, no language model, anywhere in the chunking decision.

pip install entropy-chunker
from entropy_chunker import InfoTheoreticChunker

chunker = InfoTheoreticChunker()
chunks = chunker.split_text(my_document_text)

Why

Most chunkers assume equal token count ≈ equal information content. It's usually false: a legal contract's definitions section, packed with entities referenced throughout the document, gets split as arbitrarily as its boilerplate governing-law clause under a fixed 512-token splitter.

entropy-chunker instead scores each sentence on three signals — and walks through the document emitting a chunk boundary when accumulated information, not accumulated tokens, crosses a threshold:

Signal What it measures How
Compression redundancy vs. recent context marginal gzip-compressed size
Lexical novelty new vocabulary introduced running vocabulary set
Word-frequency entropy rarity vs. the rest of this document classical Shannon surprisal

All three are closed-form statistics over the text itself — no vector embeddings, no pretrained model, no API calls. This also means it's fast (the ~740K-character Finance benchmark corpus chunks in well under a second) and fully deterministic.

Where it outshines standard and embedding-based chunking

Benchmarked against Chroma's chunking evaluation methodology (472 real queries, 5 corpora), using the same embedding model as the paper's primary table:

Metric entropy-chunker Best paper baseline
Precision 8.93 7.0 (Recursive-200)
Precision-Ω 37.68 29.9 (Recursive-200)
IoU 8.84 6.9 (Recursive-200)
Recall 83.9 91.9 (LLM-GPT4o)

Precision, Precision-Ω, and IoU all beat every baseline tested — including ClusterSemanticChunker (which uses embeddings directly to pick boundaries) and LLMSemanticChunker (which prompts GPT-4o). Recall trails by several points: smaller, more targeted chunks retrieve precisely but are slightly more likely to split a long excerpt across a boundary. That tradeoff is the honest cost of the precision gain, not a bug.

Chunk sizes also vary 4-5x more than a fixed-size baseline across every corpus tested — direct evidence boundaries track real information density rather than token count.

Full methodology, per-corpus breakdowns, and the weight-sensitivity analysis behind the presets below: BENCHMARKS.md.

Presets

Equal weighting is a safe default, but benchmark sweeps found it's never actually optimal. Three tuned presets, backed by real sweep data:

InfoTheoreticChunker(preset="precise")         # best IoU/precision in benchmarks
InfoTheoreticChunker(preset="recall_focused")   # trades some IoU for recall
InfoTheoreticChunker(preset="tabular")          # for tables, boilerplate-heavy docs
InfoTheoreticChunker(preset="balanced")         # equal weighting -- this is the package default

No preset argument is equivalent to preset="balanced". If you don't know your corpus's structure in advance, precise is the better starting point for most prose/document use cases — balanced is kept as the actual default only because it was never the worst option in any benchmark corpus, a safer unbiased choice absent more information.

Or set weights directly: InfoTheoreticChunker(w_compression=0.1, w_novelty=0.0, w_entropy=0.9).

Tunable parameters

InfoTheoreticChunker(
    info_threshold=1.0,   # cumulative info score that triggers a boundary
    max_tokens=800,       # hard ceiling, regardless of info score
    preset="precise",     # or set w_compression/w_novelty/w_entropy directly
)

Honest limitations

  • Regex-based sentence splitting (by design — no model in the pipeline), which can misfire on unusual punctuation or line-wrap conventions.
  • Recall trails embedding- and LLM-based chunkers by several points; see BENCHMARKS.md for the full tradeoff discussion.
  • Validated on prose/structured-text corpora; code as a domain hasn't been separately benchmarked.

Installation extras

pip install "entropy-chunker[tokens]"  # exact token counting via tiktoken
pip install "entropy-chunker[eval]"    # for running the benchmark yourself
pip install git+https://github.com/brandonstarxel/chunking_evaluation.git  # required for [eval]; not on PyPI

Without [tokens], token counting falls back to a chars/4 approximation — this only affects the precision of the max_tokens ceiling, not where chunk boundaries are placed.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

entropy_chunker-0.1.0.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

entropy_chunker-0.1.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file entropy_chunker-0.1.0.tar.gz.

File metadata

  • Download URL: entropy_chunker-0.1.0.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for entropy_chunker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6b45622a952a5fcd7d96e67c7e415d3e456337399736410215f94e0747736f84
MD5 9b8164a107dbcf3bdbeff12bd8e62504
BLAKE2b-256 6ac6299c65b77e89dac10423912559cb441f9ec0fd55ca7bb4d1e0cce8917f2b

See more details on using hashes here.

File details

Details for the file entropy_chunker-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for entropy_chunker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fcdef745d5e34e06ce8945261493e9e8409c08800b28aa610a038ec81325b9c6
MD5 ed885fb450537a4c53bebf58bbb2f4b6
BLAKE2b-256 b72095bc1bc9e2453279cbdcc09513e3e6a05634f17ae843b3962ddfe8a4f14d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page