Benchmark for evaluating grounding quality in multimodal RAG systems
Project description
mmrag-eval
📝 Blog post: Beyond Text: Why Multimodal RAG Needs Its Own Evaluation Benchmark
mmrag-eval is an open-source benchmark for evaluating the grounding quality of multimodal Retrieval-Augmented Generation (RAG) systems — measuring not just whether a system retrieves images, but whether its generated answers stay within, and faithfully reflect, the visual evidence it retrieved.
Problem Statement
Multimodal RAG systems retrieve images to ground their responses, but standard benchmarks only measure answer correctness against a reference string. This leaves two critical failure modes undetected:
- Hallucination — answers that are plausible but not supported by the retrieved images.
- Redundancy — retrieval pipelines that pad results with near-duplicate images, inflating apparent recall while reducing evidence diversity.
mmrag-eval provides three independent, composable metrics that together give a principled grounding score for any multimodal RAG system.
Dataset
A sample dataset of 50 annotated image–query pairs is available on HuggingFace:
from datasets import load_dataset
dataset = load_dataset("ritaban-b/mmrag-eval", split="train")
→ ritaban-b/mmrag-eval on HuggingFace
All 50 records were manually reviewed by the author. See the dataset card for schema, category breakdown, and annotation quality notes.
Metrics
1. Grounding Fidelity (grounding_fidelity)
Measures how well a generated answer is grounded in the retrieved image rather than hallucinated. Uses CLIP image–text similarity as a proxy score. A higher score means the answer text is more consistent with the image's semantic content.
Hook for GPT-4V: pass a custom grounding_fn(image_path, text) -> float to replace CLIP with any vision-language judge.
2. Retrieval Quality (retrieval_quality)
Measures the standard IR effectiveness of the image retrieval step:
- nDCG@K — normalized Discounted Cumulative Gain at K: rewards retrieving relevant images at higher ranks.
- Recall@K — fraction of ground-truth relevant images recovered in the top-K results.
3. Diversity (diversity)
Penalizes retrieval pipelines that return near-duplicate images. Uses perceptual hashing (pHash) to detect visually similar images and returns a score in [0, 1], where 1.0 means all retrieved images are visually distinct.
Quickstart
pip install git+https://github.com/ritabanb/mmrag-eval.git
from mmrag_eval import evaluate
from mmrag_eval.dataset.loader import MMRagSample
samples = [
MMRagSample(
query="What is shown in the diagram?",
image_path="data/fig1.png",
reference_answer="A flowchart depicting the training pipeline.",
grounding_labels=["data/fig1.png"], # relevant images for this query
)
]
retrieved_images = [["data/fig1.png", "data/fig2.png"]]
generated_answers = ["The diagram shows a machine learning training pipeline."]
results = evaluate(
samples=samples,
retrieved_images=retrieved_images,
generated_answers=generated_answers,
k=5,
)
print(results["aggregated"])
# {
# "grounding_fidelity": 0.83,
# "ndcg_at_k": 1.0,
# "recall_at_k": 1.0,
# "diversity_score": 1.0,
# "num_samples": 1
# }
CLI
python scripts/run_eval.py dataset.json \
--retrieved retrieved.json \
--answers answers.json \
--k 5 \
--output results.json
Dataset Format
JSON list of objects with four required fields:
[
{
"query": "string",
"image_path": "path/to/image.png",
"reference_answer": "string",
"grounding_labels": ["path/to/relevant_image.png"]
}
]
Also loadable from HuggingFace datasets:
from mmrag_eval.dataset.loader import load_from_hf
samples = load_from_hf("your-hf-org/mmrag-dataset", split="test")
Custom Grounding Function (GPT-4V)
import openai, base64
def gpt4v_grounding(image_path: str, text: str) -> float:
with open(image_path, "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
{"type": "text", "text": f"On a scale of 0–1, how well does this text describe only what is visible in the image? Text: '{text}'. Reply with a single float."},
],
}],
max_tokens=10,
)
return float(response.choices[0].message.content.strip())
results = evaluate(samples, retrieved_images, generated_answers, grounding_fn=gpt4v_grounding)
Contributing
Contributions are welcome — especially new metrics, dataset loaders, and evaluation integrations.
- Fork the repo and create a feature branch.
- Add tests for any new metric in
tests/test_metrics.py. - Run
pytestand make sure all tests pass. - Open a pull request with a clear description of what you added and why.
Please open an issue first for significant changes so we can discuss the approach.
Citation
If you use mmrag-eval in your research, please cite:
@software{mmrag_eval,
author = {Bhattacharya, Ritaban},
title = {{mmrag-eval}: Benchmark for Evaluating Grounding Quality in Multimodal RAG Systems},
year = {2026},
url = {https://github.com/ritabanb/mmrag-eval},
note = {Dataset: https://huggingface.co/datasets/ritaban-b/mmrag-eval},
}
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mmrag_eval-0.1.0.tar.gz.
File metadata
- Download URL: mmrag_eval-0.1.0.tar.gz
- Upload date:
- Size: 10.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6db71e0d9171cc760d0587b947cf90ea9d28bba790b33f145cdd660d4c3ab3b
|
|
| MD5 |
8fd779b8503a8e5fd6338dccd4e35f5f
|
|
| BLAKE2b-256 |
92a258f7b10c0af1f5cc562a5d659a618c71f45a4f2e81cf37e1be284ddb8225
|
File details
Details for the file mmrag_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mmrag_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ab3b65f06b98fd1fd8dd8c4409c649b97a7f0a52cc1d17d6acfc8a566e8437a
|
|
| MD5 |
a2435da5d8b18131d026e71c1eee9b6c
|
|
| BLAKE2b-256 |
b43cb092bccc84dc81afc8f9d289e58c6453644535757d78ad2773ba4f0611d5
|