Benchmark for evaluating grounding quality in multimodal RAG systems

These details have not been verified by PyPI

Project links

Project description

mmrag-eval

📝 Blog post: Beyond Text: Why Multimodal RAG Needs Its Own Evaluation Benchmark

mmrag-eval is an open-source benchmark for evaluating the grounding quality of multimodal Retrieval-Augmented Generation (RAG) systems — measuring not just whether a system retrieves images, but whether its generated answers stay within, and faithfully reflect, the visual evidence it retrieved.

Problem Statement

Multimodal RAG systems retrieve images to ground their responses, but standard benchmarks only measure answer correctness against a reference string. This leaves two critical failure modes undetected:

Hallucination — answers that are plausible but not supported by the retrieved images.
Redundancy — retrieval pipelines that pad results with near-duplicate images, inflating apparent recall while reducing evidence diversity.

mmrag-eval provides three independent, composable metrics that together give a principled grounding score for any multimodal RAG system.

Dataset

A sample dataset of 50 annotated image–query pairs is available on HuggingFace:

from datasets import load_dataset
dataset = load_dataset("ritaban-b/mmrag-eval", split="train")

→ ritaban-b/mmrag-eval on HuggingFace

All 50 records were manually reviewed by the author. See the dataset card for schema, category breakdown, and annotation quality notes.

Metrics

1. Grounding Fidelity (`grounding_fidelity`)

Measures how well a generated answer is grounded in the retrieved image rather than hallucinated. Uses CLIP image–text similarity as a proxy score. A higher score means the answer text is more consistent with the image's semantic content.

Hook for GPT-4V: pass a custom grounding_fn(image_path, text) -> float to replace CLIP with any vision-language judge.

2. Retrieval Quality (`retrieval_quality`)

Measures the standard IR effectiveness of the image retrieval step:

nDCG@K — normalized Discounted Cumulative Gain at K: rewards retrieving relevant images at higher ranks.
Recall@K — fraction of ground-truth relevant images recovered in the top-K results.

3. Diversity (`diversity`)

Penalizes retrieval pipelines that return near-duplicate images. Uses perceptual hashing (pHash) to detect visually similar images and returns a score in [0, 1], where 1.0 means all retrieved images are visually distinct.

Quickstart

pip install git+https://github.com/ritabanb/mmrag-eval.git

from mmrag_eval import evaluate
from mmrag_eval.dataset.loader import MMRagSample

samples = [
    MMRagSample(
        query="What is shown in the diagram?",
        image_path="data/fig1.png",
        reference_answer="A flowchart depicting the training pipeline.",
        grounding_labels=["data/fig1.png"],  # relevant images for this query
    )
]

retrieved_images = [["data/fig1.png", "data/fig2.png"]]
generated_answers = ["The diagram shows a machine learning training pipeline."]

results = evaluate(
    samples=samples,
    retrieved_images=retrieved_images,
    generated_answers=generated_answers,
    k=5,
)

print(results["aggregated"])
# {
#   "grounding_fidelity": 0.83,
#   "ndcg_at_k": 1.0,
#   "recall_at_k": 1.0,
#   "diversity_score": 1.0,
#   "num_samples": 1
# }

CLI

python scripts/run_eval.py dataset.json \
  --retrieved retrieved.json \
  --answers answers.json \
  --k 5 \
  --output results.json

Dataset Format

JSON list of objects with four required fields:

[
  {
    "query": "string",
    "image_path": "path/to/image.png",
    "reference_answer": "string",
    "grounding_labels": ["path/to/relevant_image.png"]
  }
]

Also loadable from HuggingFace datasets:

from mmrag_eval.dataset.loader import load_from_hf
samples = load_from_hf("your-hf-org/mmrag-dataset", split="test")

Custom Grounding Function (GPT-4V)

import openai, base64

def gpt4v_grounding(image_path: str, text: str) -> float:
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode()
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
                {"type": "text", "text": f"On a scale of 0–1, how well does this text describe only what is visible in the image? Text: '{text}'. Reply with a single float."},
            ],
        }],
        max_tokens=10,
    )
    return float(response.choices[0].message.content.strip())

results = evaluate(samples, retrieved_images, generated_answers, grounding_fn=gpt4v_grounding)

Contributing

Contributions are welcome — especially new metrics, dataset loaders, and evaluation integrations.

Fork the repo and create a feature branch.
Add tests for any new metric in tests/test_metrics.py.
Run pytest and make sure all tests pass.
Open a pull request with a clear description of what you added and why.

Please open an issue first for significant changes so we can discuss the approach.

Citation

If you use mmrag-eval in your research, please cite:

@software{mmrag_eval,
  author  = {Bhattacharya, Ritaban},
  title   = {{mmrag-eval}: Benchmark for Evaluating Grounding Quality in Multimodal RAG Systems},
  year    = {2026},
  url     = {https://github.com/ritabanb/mmrag-eval},
  note    = {Dataset: https://huggingface.co/datasets/ritaban-b/mmrag-eval},
}

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmrag_eval-0.1.0.tar.gz (10.7 kB view details)

Uploaded Jun 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mmrag_eval-0.1.0-py3-none-any.whl (9.5 kB view details)

Uploaded Jun 4, 2026 Python 3

File details

Details for the file mmrag_eval-0.1.0.tar.gz.

File metadata

Download URL: mmrag_eval-0.1.0.tar.gz
Upload date: Jun 4, 2026
Size: 10.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for mmrag_eval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c6db71e0d9171cc760d0587b947cf90ea9d28bba790b33f145cdd660d4c3ab3b`
MD5	`8fd779b8503a8e5fd6338dccd4e35f5f`
BLAKE2b-256	`92a258f7b10c0af1f5cc562a5d659a618c71f45a4f2e81cf37e1be284ddb8225`

See more details on using hashes here.

File details

Details for the file mmrag_eval-0.1.0-py3-none-any.whl.

File metadata

Download URL: mmrag_eval-0.1.0-py3-none-any.whl
Upload date: Jun 4, 2026
Size: 9.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for mmrag_eval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3ab3b65f06b98fd1fd8dd8c4409c649b97a7f0a52cc1d17d6acfc8a566e8437a`
MD5	`a2435da5d8b18131d026e71c1eee9b6c`
BLAKE2b-256	`b43cb092bccc84dc81afc8f9d289e58c6453644535757d78ad2773ba4f0611d5`

See more details on using hashes here.

mmrag-eval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mmrag-eval

Problem Statement

Dataset

Metrics

1. Grounding Fidelity (`grounding_fidelity`)

2. Retrieval Quality (`retrieval_quality`)

3. Diversity (`diversity`)

Quickstart

CLI

Dataset Format

Custom Grounding Function (GPT-4V)

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

mmrag-eval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mmrag-eval

Problem Statement

Dataset

Metrics

1. Grounding Fidelity (grounding_fidelity)

2. Retrieval Quality (retrieval_quality)

3. Diversity (diversity)

Quickstart

CLI

Dataset Format

Custom Grounding Function (GPT-4V)

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Grounding Fidelity (`grounding_fidelity`)

2. Retrieval Quality (`retrieval_quality`)

3. Diversity (`diversity`)