Skip to main content

MTCB - Massive Text Chunking Benchmark. Evaluate your RAG chunking with ease.

Project description

MTCB Logo

🔬 mtcb ✨

The benchmark for evaluating chunking strategies in RAG pipelines.

PyPI version License GitHub stars Downloads

InstallationQuick StartBenchmarksUsageMetrics

MTCB (Massive Text Chunking Benchmark) is a standardized evaluation framework for text chunking in RAG systems. It measures how well your chunking and embedding strategy retrieves relevant passages across 9 diverse domains, from legal contracts to scientific papers. Built on top of Chonkie.

📦 Installation

pip install mtcb

🚀 Quick Start

Run the lightweight nano benchmark to evaluate a chunking strategy in minutes:

from mtcb import NanoBenchmark
from chonkie import RecursiveChunker

benchmark = NanoBenchmark()
result = benchmark.evaluate(
    chunker=RecursiveChunker(chunk_size=512),
    embedding_model="voyage-3-large",
    k=[1, 5, 10],
)
print(result)

🧩 Available Benchmarks

Full Benchmark

The full MTCB benchmark spans 9 domains with ~17k questions across ~3k documents:

Dataset Domain Documents Questions
🧸 Gacha Classic Literature (Gutenberg) 100 2,878
💼 Ficha SEC Financial Filings 88 1,331
📝 Macha GitHub READMEs 445 1,812
💻 Cocha Multilingual Code 1,000 2,372
📊 Tacha Financial Tables (TAT-QA) 349 2,065
🔬 Sencha Scientific Papers (QASPER) 243 1,507
⚖️ Hojicha Legal Contracts (CUAD) 194 1,568
🏥 Ryokucha Medical Guidelines (NICE/CDC/WHO) 241 1,351
🎓 Genmaicha MIT OCW Lecture Transcripts 250 2,037
Total 2,910 16,921

Nano Benchmark

For fast iteration during development, MTCB provides a lightweight nano benchmark with ~100 questions per dataset. Documents are selected to maximize question density:

Dataset Domain Documents Questions
🧸 nano-gacha Classic Literature 5 100
💼 nano-ficha SEC Financial Filings 5 100
📝 nano-macha GitHub READMEs 19 100
💻 nano-cocha Multilingual Code 26 100
📊 nano-tacha Financial Tables 11 100
🔬 nano-sencha Scientific Papers 13 100
⚖️ nano-hojicha Legal Contracts 10 100
🏥 nano-ryokucha Medical Guidelines 12 100
🎓 nano-genmaicha Lecture Transcripts 7 100
Total 108 900

🔧 Usage

MTCB works with Chonkie — any chunker that extends chonkie.BaseChunker is supported out of the box.

Full Benchmark

Run the complete benchmark across all 9 domains:

from mtcb import Benchmark
from chonkie import RecursiveChunker

benchmark = Benchmark()
result = benchmark.evaluate(
    chunker=RecursiveChunker(chunk_size=512),
    embedding_model="voyage-3-large",
    k=[1, 5, 10],
)
print(result)

Individual Evaluators

Run a single domain-specific evaluator:

from mtcb import GachaEvaluator
from chonkie import RecursiveChunker

evaluator = GachaEvaluator(
    chunker=RecursiveChunker(chunk_size=1000),
    embedding_model="voyage-3-large",
    cache_dir="./cache"
)

result = evaluator.evaluate(k=[1, 3, 5, 10])
print(result)

Custom Datasets

Evaluate on your own corpus using SimpleEvaluator:

from mtcb import SimpleEvaluator
from chonkie import RecursiveChunker

evaluator = SimpleEvaluator(
    corpus=["Your document text here...", "Another document..."],
    questions=["What is X?", "How does Y work?"],
    relevant_passages=["passage that must be in retrieved chunk", "another passage"],
    chunker=RecursiveChunker(chunk_size=1000),
    embedding_model="voyage-3-large",
)

result = evaluator.evaluate(k=[1, 3, 5, 10])
print(result)

Dataset Generation

Generate verified QA datasets from your own documents:

from mtcb import DatasetGenerator

generator = DatasetGenerator(deduplicate=True)
result = generator.generate(
    corpus=["Your document text..."],
    samples_per_document=10,
    output_path="./output.jsonl",
)

print(f"Generated {result.total_verified} verified samples")
for sample in result.samples:
    print(f"Q: {sample.question}")
    print(f"A: {sample.answer}")

📊 Metrics

MTCB evaluates retrieval quality using:

  • Recall@k: Percentage of questions where the relevant passage appears in the top-k results
  • Precision@k: Ratio of relevant chunks in the top-k results
  • MRR@k: Mean Reciprocal Rank — how high the first relevant result ranks
  • NDCG@k: Normalized Discounted Cumulative Gain — position-weighted relevance scoring

📚 Citation

If you use MTCB in your research, please cite:

@software{mtcb2025,
  author = {Bhavnick Minhas and Shreyash Nigam},
  title = {MTCB: Massive Text Chunking Benchmark},
  url = {https://github.com/chonkie-inc/mtcb},
  version = {0.1.0},
  year = {2025},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mtcb-0.1.0.tar.gz (37.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mtcb-0.1.0-py3-none-any.whl (47.4 kB view details)

Uploaded Python 3

File details

Details for the file mtcb-0.1.0.tar.gz.

File metadata

  • Download URL: mtcb-0.1.0.tar.gz
  • Upload date:
  • Size: 37.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mtcb-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5d2175203845db3d66030aabbffbe0dd9dc831a5c32800f8e9a31b552b378917
MD5 13d9f6fdbd8c00c5a9c1f579be4e4691
BLAKE2b-256 684f47ea337c09fdb6ae3aa976bb3ab035f51b6466fea3a750b3b76110f33254

See more details on using hashes here.

File details

Details for the file mtcb-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mtcb-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 47.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mtcb-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f0f9cfea1c34714530db4bdf0cc303527801cd2c8043e4362f6870cec862c675
MD5 8c97dcdbfe618981ae004b352dab2532
BLAKE2b-256 2d677c4618ef8c3f4d28a921404e8a4a85294b6f0cfe54915cc2668e8d2f6cce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page