Skip to main content

Benchmark RAG chunking strategies on your own documents. Compare fixed, sliding, paragraph, recursive, and semantic chunking with real retrieval metrics.

Project description

chunk-bench

Benchmark RAG chunking strategies on your own documents. Compare fixed, sliding, paragraph, and recursive chunking with real retrieval metrics.

Tests Dependencies Python License LinkedIn


Why chunk-bench?

Chunking strategy is one of the highest-impact decisions in a RAG pipeline — 2026 benchmarks show up to a 9% recall gap between best and worst strategy on the same corpus. But most teams pick a strategy once and never measure whether it's actually working for their documents.

chunk-bench lets you run all four strategies against your own text and queries in one call, and get back recall, precision, and MRR scores — no embeddings required, no external services.


Install

pip install chunk-bench

Quick start

from chunk_bench import ChunkBench

bench = ChunkBench()

report = bench.run(
    text=your_document_text,
    queries=["What is retrieval augmented generation?",
             "How does chunking affect retrieval quality?",
             "What is the difference between fixed and semantic chunking?"],
)

print(report.summary_table())

Output:

Strategy           Chunks  Avg Tokens   Recall  Precision     MRR      F1
────────────────────────────────────────────────────────────────────────
recursive               18         94    0.857      0.733   0.833   0.790
paragraph               12        142    0.810      0.700   0.810   0.752
fixed                   24         71    0.762      0.667   0.762   0.711
sliding                 31         55    0.714      0.633   0.714   0.671
────────────────────────────────────────────────────────────────────────

  Best recall:    recursive (0.857)
  Best F1:        recursive (0.790)
  Best MRR:       recursive (0.833)

Providing your own relevance terms

report = bench.run(
    text=your_document,
    queries=["What is GDPR?", "What are data subject rights?"],
    relevant_terms=[
        ["GDPR", "General Data Protection", "regulation"],
        ["subject", "rights", "access", "erasure", "portability"],
    ],
    top_k=5,
)

Use specific strategies

report = bench.run(
    text=text,
    queries=queries,
    strategies=["fixed", "recursive"],  # skip sliding and paragraph
    chunk_size=512,
    overlap=50,
)

Chunk any text directly

from chunk_bench import chunk

chunks = chunk(text, strategy="recursive", chunk_size=512, overlap=50)
for c in chunks:
    print(f"[{c.index}] ~{c.token_count} tokens: {c.text[:60]}")

CLI

chunk-bench document.txt --queries "What is X?" "How does Y work?" --strategies fixed recursive --json

Strategies

Strategy Description Best for
fixed Split at regular character intervals with overlap Simple, uniform documents
sliding Overlapping windows stepping forward When context preservation matters most
paragraph Split on double newlines, merge small paragraphs Structured documents with clear sections
recursive Try paragraph → sentence → word boundaries Most document types — good default

Linda Oraegbunam | LinkedIn | Twitter | GitHub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunk_bench-1.0.0.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunk_bench-1.0.0-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file chunk_bench-1.0.0.tar.gz.

File metadata

  • Download URL: chunk_bench-1.0.0.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunk_bench-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6d9985f371cb798567976591a7628063d9aeaac2d610d186d1dd05e46599b9a6
MD5 7c2db4e2085a666a461e00392ca3bf8f
BLAKE2b-256 a84220d219ffc27aa5a702f12b7a3a8f4635e534a3fbef47d9472b5f6269b957

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunk_bench-1.0.0.tar.gz:

Publisher: publish.yml on obielin/chunk-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chunk_bench-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: chunk_bench-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunk_bench-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e397505adddd82472c818c5acd4cb7da110265a6b023075e196ee8db9c62455
MD5 a527e888340ce05d8e05fe58310b700f
BLAKE2b-256 97941c67b83f37416a573e87a1d41049aa975a532aa50965874bc6e4684c6eca

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunk_bench-1.0.0-py3-none-any.whl:

Publisher: publish.yml on obielin/chunk-bench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page