Multimodal RAG benchmark dataset — companion to mmeval-vrag

These details have not been verified by PyPI

Project links

Project description

MM-RAGBench

MM-RAGBench is a benchmark dataset for evaluating multimodal Retrieval-Augmented Generation systems. It is built as a native companion to mmeval-vrag — install, load, and evaluate in 5 lines.

The Problem

Existing RAG benchmarks (RAGBench, ViDoRe, NoMIRACL) evaluate text-only retrieval and generation. Real-world RAG systems retrieve images alongside text — medical scans with clinical notes, product photos with descriptions, diagrams with documentation. None of the existing benchmarks test whether a system can reason across both modalities and avoid hallucinating visual details.

What MM-RAGBench Provides

3,000 multimodal queries across 6 domains requiring cross-modal reasoning
12,000 candidate documents (image-text pairs) from open-license sources
Fine-grained annotations: hallucination traps (object/attribute/relation/fabrication), faithfulness criteria
Native mmeval-vrag integration: loads directly as EvalSample or QueryItem
All 11 mmeval-vrag metrics work out of the box

Install

pip install mm-ragbench

This pulls in mmeval-vrag automatically.

Quick Start

Option 1: Evaluate pre-computed answers

from mm_ragbench import load_eval_samples
from mmeval_vrag import MultimodalRAGEvaluator, EvalConfig

# Load benchmark as mmeval-vrag EvalSample objects
samples = load_eval_samples(split="test", max_samples=100)

# Your system fills in generated_answer for each sample
for sample in samples:
    docs = your_retriever(sample.query_text, sample.query_image)
    sample.retrieved = docs
    sample.generated_answer = your_generator(sample.query_text, docs)

# Evaluate with mmeval-vrag (all 11 metrics)
evaluator = MultimodalRAGEvaluator(config=EvalConfig(metrics=["all"]))
results = evaluator.evaluate(samples)
print(results.summary())

Option 2: Evaluate a live pipeline

from mm_ragbench import load_query_items
from mmeval_vrag.evaluators.pipeline import EvalPipeline
from mmeval_vrag import EvalConfig

# Load benchmark as mmeval-vrag QueryItem objects
queries = load_query_items(split="test")

# Plug in your retriever + generator
pipeline = EvalPipeline(
    retriever=my_retriever,   # (query_text, query_image, top_k) → List[RetrievedItem]
    generator=my_generator,   # (query_text, contexts) → str
    config=EvalConfig(metrics=["all"]),
)
results = pipeline.run(queries)
results.to_json("my_system_results.json")

Option 3: Use via mmeval-vrag's loader registry

from mmeval_vrag.datasets import load_dataset
import mm_ragbench  # registers the "mm_ragbench" loader

samples = load_dataset("mm_ragbench", "EmmanuelleB985/mm-ragbench", split="test")

Metrics (via mmeval-vrag)

All 11 metrics from mmeval-vrag are supported:

Category	Metric	What it measures
Retrieval	`retrieval_precision`	Fraction of top-K items that are relevant
	`retrieval_recall`	Fraction of relevant items found in top-K
	`retrieval_mrr`	Reciprocal rank of first relevant item
	`retrieval_ndcg`	Normalised DCG accounting for rank
Generation	`faithfulness`	Are generated claims supported by context?
	`hallucination_rate`	Fraction of unsupported claims (lower = better)
	`answer_relevance`	Similarity between answer and query
	`context_relevance`	Relevance of retrieved passages to query
Cross-Modal	`cross_modal_alignment`	CLIP similarity: images ↔ query
	`visual_grounding`	CLIP similarity: images ↔ answer
	`multimodal_consistency`	CLIP similarity within (image, text) pairs

Dataset Schema

Each sample maps to mmeval-vrag types:

MM-RAGBench field          → mmeval-vrag type
─────────────────────────────────────────────
query / query_image        → EvalSample.query_text / query_image
gold_doc_texts/images      → EvalSample.retrieved (List[RetrievedItem])
gold_answer                → EvalSample.reference_answer
domain, difficulty, ...    → EvalSample.metadata
hallucination_traps        → EvalSample.metadata["hallucination_traps"]
faithfulness_criteria      → EvalSample.metadata["faithfulness_criteria"]
gold_doc_ids               → QueryItem.relevant_ids (for EvalPipeline)

Annotation Fields

Each query includes:

hallucination_traps: Known failure modes, e.g. {"type": "attribute", "description": "May confuse bridge completion year (1937) with construction start (1933)"}
faithfulness_criteria: Verifiable checks, e.g. "Must identify bridge type from visual features"
answer_modality: Whether the answer needs text_only, image_only, or cross_modal reasoning
difficulty: easy (40%), medium (35%), hard (25%)

Domains

Domain	Queries	Sources
Science & Nature	500	Wikipedia diagrams, species photos, experiments
Geography & Travel	500	Landmarks, maps, cultural sites
History & Art	500	Historical photos, artworks, architecture
Technology	500	Product images, diagrams, interfaces
Food & Cooking	500	Recipe images, ingredients, techniques
Daily Life	500	Everyday objects, how-to guides, sports

Build the Dataset from Scratch

from mm_ragbench import MMRAGBenchBuilder

builder = MMRAGBenchBuilder(
    output_dir="data/mm-ragbench",
    storage_mode="urls",                   # "urls" (~10 MB) | "thumbnails" (~300 MB) | "full" (~50 GB)
    llm_provider="anthropic",              # or "openai"
    llm_model="claude-sonnet-4-20250514",  # or "gpt-4o"
)

builder.collect_sources()          # Pull from WIT + COCO (CC-BY/CC-BY-SA)
builder.generate_queries()         # LLM generates queries + annotations
builder.generate_hard_negatives()  # Same-domain distractors
builder.verify_and_balance()       # Balance to 3,000 queries
builder.export_jsonl()             # JSONL compatible with mmeval-vrag
builder.push_to_hub("EmmanuelleB985/mm-ragbench")

Storage Modes

Mode	Disk	What's saved	Image access
`"urls"`	~10 MB	Text + Wikimedia/COCO image URLs	Fetched on demand during eval
`"thumbnails"`	~300 MB	224px JPEG thumbnails	Local files
`"full"`	~50 GB	Original resolution images	Local files

The default is "thumbnails" — good balance of quality and size. Use "urls" if disk is tight; images are fetched lazily when mmeval-vrag's CLIP metrics need them.

Per-Domain Analysis

MM-RAGBench metadata enables granular analysis:

from mm_ragbench import load_eval_samples

samples = load_eval_samples(split="test")

# Group results by domain
by_domain = {}
for sample, result in zip(samples, results.results):
    domain = sample.metadata["domain"]
    by_domain.setdefault(domain, []).append(result.scores)

for domain, scores in sorted(by_domain.items()):
    faith = [s["faithfulness"] for s in scores if "faithfulness" in s]
    halluc = [s["hallucination_rate"] for s in scores if "hallucination_rate" in s]
    print(f"{domain}: faithfulness={sum(faith)/len(faith):.3f}  hallucination={sum(halluc)/len(halluc):.3f}")

Leaderboard

System	Retrieval R@5	Faithfulness	Hallucination ↓	Cross-Modal	Overall
Submit yours	—	—	—	—	—

Submit by opening a PR with your results.json (exported via results.to_json()).

Data Sources & Licensing

Source	License	Used for
Wikipedia (via WIT)	CC-BY-SA 3.0	Article text + images
Wikimedia Commons	CC-BY / CC-BY-SA	Images
COCO	CC-BY 4.0	Everyday scene images
Generated annotations	CC-BY 4.0	Queries, answers, traps

Citation

@software{bourigault2026mmragbench,
  author = {Bourigault, Emmanuelle},
  title = {MM-RAGBench: Multimodal Benchmark for Evaluating RAG Systems},
  year = {2026},
  url = {https://github.com/EmmanuelleB985/mm-ragbench},
}

@software{bourigault2025mmeval,
  author = {Bourigault, Emmanuelle},
  title = {mmeval-vrag: Evaluation Framework for Multimodal Vision-Language RAG Systems},
  year = {2025},
  url = {https://github.com/EmmanuelleB985/mmeval-vrag},
}

License

Code: Apache 2.0 · Dataset: CC-BY 4.0 · Source images: per-sample (all CC-BY or CC-BY-SA)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mm_ragbench-0.1.0.tar.gz (19.7 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mm_ragbench-0.1.0-py3-none-any.whl (15.0 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file mm_ragbench-0.1.0.tar.gz.

File metadata

Download URL: mm_ragbench-0.1.0.tar.gz
Upload date: Apr 4, 2026
Size: 19.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for mm_ragbench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1c803fb3927ec236cc0a983a57a3e51d2dbea89ed0e3f54b3a4b87ef595db981`
MD5	`399266a54169ca33285215d560936ab2`
BLAKE2b-256	`e67b51996624a28f805cfc17113a187c6d9e07fcf7dee1572b87be863afad823`

See more details on using hashes here.

File details

Details for the file mm_ragbench-0.1.0-py3-none-any.whl.

File metadata

Download URL: mm_ragbench-0.1.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 15.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for mm_ragbench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92b4b3d141010e843ac65ac45b75413448e248c2084df1cfe45b8d9f60b9e587`
MD5	`8a20cbbcb7eb2e0c35eafc0e4f5bcf03`
BLAKE2b-256	`77d759d07d0eaf0716e0bfa7533a619a6b7025ef871ef0052a1b161d0ff801eb`

See more details on using hashes here.

mm-ragbench 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MM-RAGBench

The Problem

What MM-RAGBench Provides

Install

Quick Start

Option 1: Evaluate pre-computed answers

Option 2: Evaluate a live pipeline

Option 3: Use via mmeval-vrag's loader registry

Metrics (via mmeval-vrag)

Dataset Schema

Annotation Fields

Domains

Build the Dataset from Scratch

Storage Modes

Per-Domain Analysis

Leaderboard

Data Sources & Licensing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes