Skip to main content

A chunk-source-agnostic evaluation harness for RAG chunking strategies

Project description

chunkbench

PyPI version MIT license Python 3.12 | 3.13 mypy --strict

Every RAG tutorial picks a chunk size, shrugs, and moves on. chunkbench is what happens after you stop shrugging.


The problem, in one sentence

You split your documents into chunks somehow — fixed size, paragraphs, a chunking library's default recipe, vibes — and that one decision quietly determines whether your retriever can ever find the right answer. Most teams never measure it. They just ship the first thing that seemed to work on three test queries and hope.

chunkbench replaces the hoping with a number. Feed it a corpus, a handful of chunking strategies, and a set of real questions with known-correct answers, and it tells you — with recall, precision, and cost figures side by side — which strategy actually retrieves the right information, instead of which one merely feels right.

It does not chunk your documents. It does not pick your embedding model. It does not talk you into using its favorite LLM. Those are your calls, made with your tools — chunkbench just tells you, honestly, whether the call you made was any good. Think of it less as a library and more as the friend who actually reads the whole receipt before saying "yeah, that seems fair."

Full design rationale — why golden questions live at the section level, what each metric actually measures, and the exact list of things chunkbench deliberately refuses to do — lives in docs/chunkbench.md.

Install

pip install chunkbench-rag

(The PyPI distribution is chunkbench-ragchunkbench alone was too close to an existing project's name — but the import and the CLI command are both still plain chunkbench.)

Core install is three dependencies deep (pydantic, pyyaml, numpy) — no embedding SDK, no LLM SDK, no chunking library, because chunkbench isn't going to make that decision for you. The one shipped convenience extra:

pip install chunkbench-rag[openai]   # adds chunkbench.embedding.providers.openai
                                      # and chunkbench.generation.providers.openai

Using something else — chonkie, Gemini, Cohere, a model you trained in your garage — see Bring your own everything below. No extra required; it's a ~15-line function either way.

60-second quickstart

from chunkbench import run_comparison
from chunkbench.corpus import directory_corpus_loader

report = run_comparison(
    corpus=directory_corpus_loader("examples/quickstart/corpus", extensions=(".md",)),
    embedder=toy_embedder,               # any Embedder — see below
    golden_set="examples/quickstart/golden_qa.yaml",
    chunk_sources={
        "whole_section": whole_section_chunker,
        "paragraph": paragraph_chunker,
    },
    k=2,
)

report.to_markdown("report.md")
report.to_json("report.json")

toy_embedder, whole_section_chunker, and paragraph_chunker are tiny example functions in examples/quickstart/quickstart.py — this exact snippet runs today, unmodified, no API key, no network call:

python examples/quickstart/quickstart.py
whole_section: recall@2=1.00
paragraph: recall@2=1.00
Wrote examples/quickstart/report.md and examples/quickstart/report.json

The embedder there is a dependency-free hashing stand-in, good for proving the plumbing works and not much else. Swap it for something real before trusting the numbers.

Bring your own everything

There is exactly one base class in chunkbench you're required to inherit from: none. ChunkSource, Embedder, Generator, and Judge are all plain function shapes (Callable[...]) — wrap whatever you already use and hand it over.

Chunking, with chonkie:

from chonkie import RecursiveChunker
from chunkbench import Chunk, Document

def chonkie_chunker(document: Document) -> list[Chunk]:
    chunker = RecursiveChunker()
    chunks = []
    for slug, section_text in _sections(document.content):   # your own section splitter
        for i, piece in enumerate(chunker(section_text)):
            chunks.append(Chunk(
                id=f"{document.id}-{slug}-{i}", doc_id=document.id,
                section=slug, text=piece.text,
            ))
    return chunks

Embedding and generation, with Gemini 2.5 Flash:

from google import genai
from chunkbench import Embedder, Vector

def gemini_embedder(model: str = "gemini-embedding-001") -> Embedder:
    client = genai.Client()
    def embed(texts: list[str]) -> list[Vector]:
        return [e.values for e in client.models.embed_content(model=model, contents=texts).embeddings]
    return embed
from chunkbench import run_comparison

report = run_comparison(
    corpus=my_corpus_loader,
    embedder=gemini_embedder(),
    chunk_sources={"chonkie_recursive": chonkie_chunker},
    golden_set="golden_qa.yaml",
    k=5,
)

Neither chonkie nor google-genai is a chunkbench dependency — install what you need yourself. Full runnable versions, plus the same pattern applied to a judge model, live in docs/providers.md and examples/providers/. Swap in Cohere, Voyage, sentence-transformers, or an in-house model gateway the same way — chunkbench genuinely does not care.

The composable API

For finer control — running only part of the pipeline, or scoring a custom metric:

from chunkbench import Pipeline, registry

@registry.metric("my_custom_metric")
class MyMetric:
    def score(self, retrieved, golden) -> float:
        ...

pipeline = Pipeline(embedder=my_embed_function, golden_set=my_golden_set)
chunks = pipeline.run_chunking(corpus, chunk_source=my_semantic_chunker)
results = pipeline.run_retrieval(chunks, k=5)
scores = pipeline.score(results, metrics=["recall", "precision", "my_custom_metric"])

docs/api-stability.md names exactly which extension points (chunk-source contract, metric registry, embedder/vector-store interfaces) carry a semver stability guarantee — the short version: the things listed above, forever; the internals, whenever we find a better way.

CLI

# Config-file-driven — see docs/chunkbench.md for the full schema.
chunkbench run --config chunkbench.yaml

# Flag-driven, for one-off use. --chunkers/--embedder/--generator/--judge
# all take 'module:attribute' import strings — chunkbench doesn't ship
# chunking algorithms or provider integrations, so these point at your
# own code, importable from wherever you run the command.
chunkbench run \
  --corpus ./docs \
  --golden golden_qa.yaml \
  --chunkers whole_section=mypkg.chunkers:whole_section,semantic=mypkg.chunkers:semantic \
  --embedder mypkg.providers:gemini_embedder \
  --k 5

# Re-render a previous run's results.json in another format.
chunkbench report --from results.json --format html

A regression_gate section in the config file makes chunkbench run exit non-zero when a metric drops below a threshold ("fail if recall_at_k for semantic drops below 0.8") — drop it into CI as a quality gate on chunking changes instead of finding out in production.

What you get back

A Report, in three flavors: a Python object (iterable/indexable per approach and per question), Markdown (drop into a PR description), and JSON (the stable integration point — schema pinned in docs/report-schema.json).

Documentation

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkbench_rag-0.1.0.tar.gz (147.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkbench_rag-0.1.0-py3-none-any.whl (49.4 kB view details)

Uploaded Python 3

File details

Details for the file chunkbench_rag-0.1.0.tar.gz.

File metadata

  • Download URL: chunkbench_rag-0.1.0.tar.gz
  • Upload date:
  • Size: 147.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkbench_rag-0.1.0.tar.gz
Algorithm Hash digest
SHA256 24c5d90cd91fe9176d0a005f1fc4d148132d0565aa4cb1b92abdecc634f2c3d6
MD5 8423663954d95df9f7e75711bec8ec68
BLAKE2b-256 61da8f774720c57da4be5308cd18198f221c5125cafc1c9ccd38be19308765c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkbench_rag-0.1.0.tar.gz:

Publisher: release.yml on ghassenov/chunkbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chunkbench_rag-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chunkbench_rag-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 49.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkbench_rag-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25826f13f51d99bafb38089bcef87a508f3685b0169a29b91839045687553bbb
MD5 049c4aa40deb8b7c325bd031f8411525
BLAKE2b-256 76102514f8d2abb9bece65eaa27a1ecb054552af273e8116a22f7e582ec64687

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkbench_rag-0.1.0-py3-none-any.whl:

Publisher: release.yml on ghassenov/chunkbench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page