MTCB - Massive Text Chunking Benchmark. Evaluate your RAG chunking with ease.
Project description
🔬 mtcb ✨
The benchmark for evaluating chunking strategies in RAG pipelines.
Installation • Quick Start • Benchmarks • Usage • Metrics
MTCB (Massive Text Chunking Benchmark) is a standardized evaluation framework for text chunking in RAG systems. It measures how well your chunking and embedding strategy retrieves relevant passages across 9 diverse domains, from legal contracts to scientific papers. Built on top of Chonkie.
📦 Installation
pip install mtcb
🚀 Quick Start
Run the lightweight nano benchmark to evaluate a chunking strategy in minutes:
from mtcb import NanoBenchmark
from chonkie import RecursiveChunker
benchmark = NanoBenchmark()
result = benchmark.evaluate(
chunker=RecursiveChunker(chunk_size=512),
embedding_model="voyage-3-large",
k=[1, 5, 10],
)
print(result)
🧩 Available Benchmarks
Full Benchmark
The full MTCB benchmark spans 9 domains with ~17k questions across ~3k documents:
| Dataset | Domain | Documents | Questions |
|---|---|---|---|
| 🧸 Gacha | Classic Literature (Gutenberg) | 100 | 2,878 |
| 💼 Ficha | SEC Financial Filings | 88 | 1,331 |
| 📝 Macha | GitHub READMEs | 445 | 1,812 |
| 💻 Cocha | Multilingual Code | 1,000 | 2,372 |
| 📊 Tacha | Financial Tables (TAT-QA) | 349 | 2,065 |
| 🔬 Sencha | Scientific Papers (QASPER) | 243 | 1,507 |
| ⚖️ Hojicha | Legal Contracts (CUAD) | 194 | 1,568 |
| 🏥 Ryokucha | Medical Guidelines (NICE/CDC/WHO) | 241 | 1,351 |
| 🎓 Genmaicha | MIT OCW Lecture Transcripts | 250 | 2,037 |
| Total | 2,910 | 16,921 |
Nano Benchmark
For fast iteration during development, MTCB provides a lightweight nano benchmark with ~100 questions per dataset. Documents are selected to maximize question density:
| Dataset | Domain | Documents | Questions |
|---|---|---|---|
| 🧸 nano-gacha | Classic Literature | 5 | 100 |
| 💼 nano-ficha | SEC Financial Filings | 5 | 100 |
| 📝 nano-macha | GitHub READMEs | 19 | 100 |
| 💻 nano-cocha | Multilingual Code | 26 | 100 |
| 📊 nano-tacha | Financial Tables | 11 | 100 |
| 🔬 nano-sencha | Scientific Papers | 13 | 100 |
| ⚖️ nano-hojicha | Legal Contracts | 10 | 100 |
| 🏥 nano-ryokucha | Medical Guidelines | 12 | 100 |
| 🎓 nano-genmaicha | Lecture Transcripts | 7 | 100 |
| Total | 108 | 900 |
🔧 Usage
MTCB works with Chonkie — any chunker that extends chonkie.BaseChunker is supported out of the box.
Full Benchmark
Run the complete benchmark across all 9 domains:
from mtcb import Benchmark
from chonkie import RecursiveChunker
benchmark = Benchmark()
result = benchmark.evaluate(
chunker=RecursiveChunker(chunk_size=512),
embedding_model="voyage-3-large",
k=[1, 5, 10],
)
print(result)
Individual Evaluators
Run a single domain-specific evaluator:
from mtcb import GachaEvaluator
from chonkie import RecursiveChunker
evaluator = GachaEvaluator(
chunker=RecursiveChunker(chunk_size=1000),
embedding_model="voyage-3-large",
cache_dir="./cache"
)
result = evaluator.evaluate(k=[1, 3, 5, 10])
print(result)
Custom Datasets
Evaluate on your own corpus using SimpleEvaluator:
from mtcb import SimpleEvaluator
from chonkie import RecursiveChunker
evaluator = SimpleEvaluator(
corpus=["Your document text here...", "Another document..."],
questions=["What is X?", "How does Y work?"],
relevant_passages=["passage that must be in retrieved chunk", "another passage"],
chunker=RecursiveChunker(chunk_size=1000),
embedding_model="voyage-3-large",
)
result = evaluator.evaluate(k=[1, 3, 5, 10])
print(result)
Dataset Generation
Generate verified QA datasets from your own documents:
from mtcb import DatasetGenerator
generator = DatasetGenerator(deduplicate=True)
result = generator.generate(
corpus=["Your document text..."],
samples_per_document=10,
output_path="./output.jsonl",
)
print(f"Generated {result.total_verified} verified samples")
for sample in result.samples:
print(f"Q: {sample.question}")
print(f"A: {sample.answer}")
📊 Metrics
MTCB evaluates retrieval quality using:
- Recall@k: Percentage of questions where the relevant passage appears in the top-k results
- Precision@k: Ratio of relevant chunks in the top-k results
- MRR@k: Mean Reciprocal Rank — how high the first relevant result ranks
- NDCG@k: Normalized Discounted Cumulative Gain — position-weighted relevance scoring
📚 Citation
If you use MTCB in your research, please cite:
@software{mtcb2025,
author = {Bhavnick Minhas and Shreyash Nigam},
title = {MTCB: Massive Text Chunking Benchmark},
url = {https://github.com/chonkie-inc/mtcb},
version = {0.1.0},
year = {2025},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mtcb-0.1.0.tar.gz.
File metadata
- Download URL: mtcb-0.1.0.tar.gz
- Upload date:
- Size: 37.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d2175203845db3d66030aabbffbe0dd9dc831a5c32800f8e9a31b552b378917
|
|
| MD5 |
13d9f6fdbd8c00c5a9c1f579be4e4691
|
|
| BLAKE2b-256 |
684f47ea337c09fdb6ae3aa976bb3ab035f51b6466fea3a750b3b76110f33254
|
File details
Details for the file mtcb-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mtcb-0.1.0-py3-none-any.whl
- Upload date:
- Size: 47.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0f9cfea1c34714530db4bdf0cc303527801cd2c8043e4362f6870cec862c675
|
|
| MD5 |
8c97dcdbfe618981ae004b352dab2532
|
|
| BLAKE2b-256 |
2d677c4618ef8c3f4d28a921404e8a4a85294b6f0cfe54915cc2668e8d2f6cce
|