MTCB - Massive Text Chunking Benchmark. Evaluate your RAG chunking with ease.

These details have not been verified by PyPI

Project links

Project description

MTCB Logo

🔬 mtcb ✨

The benchmark for evaluating chunking strategies in RAG pipelines.

Installation • Quick Start • Benchmarks • Usage • Metrics

MTCB (Massive Text Chunking Benchmark) is a standardized evaluation framework for text chunking in RAG systems. It measures how well your chunking and embedding strategy retrieves relevant passages across 9 diverse domains, from legal contracts to scientific papers. Built on top of Chonkie.

📦 Installation

pip install mtcb

🚀 Quick Start

Run the lightweight nano benchmark to evaluate a chunking strategy in minutes:

from mtcb import NanoBenchmark
from chonkie import RecursiveChunker

benchmark = NanoBenchmark()
result = benchmark.evaluate(
    chunker=RecursiveChunker(chunk_size=512),
    embedding_model="voyage-3-large",
    k=[1, 5, 10],
)
print(result)

🧩 Available Benchmarks

Full Benchmark

The full MTCB benchmark spans 9 domains with ~17k questions across ~3k documents:

Dataset	Domain	Documents	Questions
🧸 Gacha	Classic Literature (Gutenberg)	100	2,878
💼 Ficha	SEC Financial Filings	88	1,331
📝 Macha	GitHub READMEs	445	1,812
💻 Cocha	Multilingual Code	1,000	2,372
📊 Tacha	Financial Tables (TAT-QA)	349	2,065
🔬 Sencha	Scientific Papers (QASPER)	243	1,507
⚖️ Hojicha	Legal Contracts (CUAD)	194	1,568
🏥 Ryokucha	Medical Guidelines (NICE/CDC/WHO)	241	1,351
🎓 Genmaicha	MIT OCW Lecture Transcripts	250	2,037
	Total	2,910	16,921

Nano Benchmark

For fast iteration during development, MTCB provides a lightweight nano benchmark with ~100 questions per dataset. Documents are selected to maximize question density:

Dataset	Domain	Documents	Questions
🧸 nano-gacha	Classic Literature	5	100
💼 nano-ficha	SEC Financial Filings	5	100
📝 nano-macha	GitHub READMEs	19	100
💻 nano-cocha	Multilingual Code	26	100
📊 nano-tacha	Financial Tables	11	100
🔬 nano-sencha	Scientific Papers	13	100
⚖️ nano-hojicha	Legal Contracts	10	100
🏥 nano-ryokucha	Medical Guidelines	12	100
🎓 nano-genmaicha	Lecture Transcripts	7	100
	Total	108	900

🔧 Usage

MTCB works with Chonkie — any chunker that extends chonkie.BaseChunker is supported out of the box.

Full Benchmark

Run the complete benchmark across all 9 domains:

from mtcb import Benchmark
from chonkie import RecursiveChunker

benchmark = Benchmark()
result = benchmark.evaluate(
    chunker=RecursiveChunker(chunk_size=512),
    embedding_model="voyage-3-large",
    k=[1, 5, 10],
)
print(result)

Individual Evaluators

Run a single domain-specific evaluator:

from mtcb import GachaEvaluator
from chonkie import RecursiveChunker

evaluator = GachaEvaluator(
    chunker=RecursiveChunker(chunk_size=1000),
    embedding_model="voyage-3-large",
    cache_dir="./cache"
)

result = evaluator.evaluate(k=[1, 3, 5, 10])
print(result)

Custom Datasets

Evaluate on your own corpus using SimpleEvaluator:

from mtcb import SimpleEvaluator
from chonkie import RecursiveChunker

evaluator = SimpleEvaluator(
    corpus=["Your document text here...", "Another document..."],
    questions=["What is X?", "How does Y work?"],
    relevant_passages=["passage that must be in retrieved chunk", "another passage"],
    chunker=RecursiveChunker(chunk_size=1000),
    embedding_model="voyage-3-large",
)

result = evaluator.evaluate(k=[1, 3, 5, 10])
print(result)

Dataset Generation

Generate verified QA datasets from your own documents:

from mtcb import DatasetGenerator

generator = DatasetGenerator(deduplicate=True)
result = generator.generate(
    corpus=["Your document text..."],
    samples_per_document=10,
    output_path="./output.jsonl",
)

print(f"Generated {result.total_verified} verified samples")
for sample in result.samples:
    print(f"Q: {sample.question}")
    print(f"A: {sample.answer}")

📊 Metrics

MTCB evaluates retrieval quality using:

Recall@k: Percentage of questions where the relevant passage appears in the top-k results
Precision@k: Ratio of relevant chunks in the top-k results
MRR@k: Mean Reciprocal Rank — how high the first relevant result ranks
NDCG@k: Normalized Discounted Cumulative Gain — position-weighted relevance scoring

📚 Citation

If you use MTCB in your research, please cite:

@software{mtcb2025,
  author = {Bhavnick Minhas and Shreyash Nigam},
  title = {MTCB: Massive Text Chunking Benchmark},
  url = {https://github.com/chonkie-inc/mtcb},
  version = {0.1.0},
  year = {2025},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Feb 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mtcb-0.1.0.tar.gz (37.8 kB view details)

Uploaded Feb 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mtcb-0.1.0-py3-none-any.whl (47.4 kB view details)

Uploaded Feb 7, 2026 Python 3

File details

Details for the file mtcb-0.1.0.tar.gz.

File metadata

Download URL: mtcb-0.1.0.tar.gz
Upload date: Feb 7, 2026
Size: 37.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mtcb-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5d2175203845db3d66030aabbffbe0dd9dc831a5c32800f8e9a31b552b378917`
MD5	`13d9f6fdbd8c00c5a9c1f579be4e4691`
BLAKE2b-256	`684f47ea337c09fdb6ae3aa976bb3ab035f51b6466fea3a750b3b76110f33254`

See more details on using hashes here.

File details

Details for the file mtcb-0.1.0-py3-none-any.whl.

File metadata

Download URL: mtcb-0.1.0-py3-none-any.whl
Upload date: Feb 7, 2026
Size: 47.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mtcb-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f0f9cfea1c34714530db4bdf0cc303527801cd2c8043e4362f6870cec862c675`
MD5	`8c97dcdbfe618981ae004b352dab2532`
BLAKE2b-256	`2d677c4618ef8c3f4d28a921404e8a4a85294b6f0cfe54915cc2668e8d2f6cce`

See more details on using hashes here.

mtcb 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🔬 mtcb ✨

📦 Installation

🚀 Quick Start

🧩 Available Benchmarks

Full Benchmark

Nano Benchmark

🔧 Usage

Full Benchmark

Individual Evaluators

Custom Datasets

Dataset Generation

📊 Metrics

📚 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes