Skip to main content

Benchmark for evaluating grounding quality in multimodal RAG systems

Project description

mmrag-eval

CI Python 3.10+ License: MIT PRs Welcome

📝 Blog post: Beyond Text: Why Multimodal RAG Needs Its Own Evaluation Benchmark

mmrag-eval is an open-source benchmark for evaluating the grounding quality of multimodal Retrieval-Augmented Generation (RAG) systems — measuring not just whether a system retrieves images, but whether its generated answers stay within, and faithfully reflect, the visual evidence it retrieved.


Problem Statement

Multimodal RAG systems retrieve images to ground their responses, but standard benchmarks only measure answer correctness against a reference string. This leaves two critical failure modes undetected:

  1. Hallucination — answers that are plausible but not supported by the retrieved images.
  2. Redundancy — retrieval pipelines that pad results with near-duplicate images, inflating apparent recall while reducing evidence diversity.

mmrag-eval provides three independent, composable metrics that together give a principled grounding score for any multimodal RAG system.


Dataset

A sample dataset of 50 annotated image–query pairs is available on HuggingFace:

from datasets import load_dataset
dataset = load_dataset("ritaban-b/mmrag-eval", split="train")

ritaban-b/mmrag-eval on HuggingFace

All 50 records were manually reviewed by the author. See the dataset card for schema, category breakdown, and annotation quality notes.


Metrics

1. Grounding Fidelity (grounding_fidelity)

Measures how well a generated answer is grounded in the retrieved image rather than hallucinated. Uses CLIP image–text similarity as a proxy score. A higher score means the answer text is more consistent with the image's semantic content.

Hook for GPT-4V: pass a custom grounding_fn(image_path, text) -> float to replace CLIP with any vision-language judge.

2. Retrieval Quality (retrieval_quality)

Measures the standard IR effectiveness of the image retrieval step:

  • nDCG@K — normalized Discounted Cumulative Gain at K: rewards retrieving relevant images at higher ranks.
  • Recall@K — fraction of ground-truth relevant images recovered in the top-K results.

3. Diversity (diversity)

Penalizes retrieval pipelines that return near-duplicate images. Uses perceptual hashing (pHash) to detect visually similar images and returns a score in [0, 1], where 1.0 means all retrieved images are visually distinct.


Quickstart

pip install git+https://github.com/ritabanb/mmrag-eval.git
from mmrag_eval import evaluate
from mmrag_eval.dataset.loader import MMRagSample

samples = [
    MMRagSample(
        query="What is shown in the diagram?",
        image_path="data/fig1.png",
        reference_answer="A flowchart depicting the training pipeline.",
        grounding_labels=["data/fig1.png"],  # relevant images for this query
    )
]

retrieved_images = [["data/fig1.png", "data/fig2.png"]]
generated_answers = ["The diagram shows a machine learning training pipeline."]

results = evaluate(
    samples=samples,
    retrieved_images=retrieved_images,
    generated_answers=generated_answers,
    k=5,
)

print(results["aggregated"])
# {
#   "grounding_fidelity": 0.83,
#   "ndcg_at_k": 1.0,
#   "recall_at_k": 1.0,
#   "diversity_score": 1.0,
#   "num_samples": 1
# }

CLI

python scripts/run_eval.py dataset.json \
  --retrieved retrieved.json \
  --answers answers.json \
  --k 5 \
  --output results.json

Dataset Format

JSON list of objects with four required fields:

[
  {
    "query": "string",
    "image_path": "path/to/image.png",
    "reference_answer": "string",
    "grounding_labels": ["path/to/relevant_image.png"]
  }
]

Also loadable from HuggingFace datasets:

from mmrag_eval.dataset.loader import load_from_hf
samples = load_from_hf("your-hf-org/mmrag-dataset", split="test")

Custom Grounding Function (GPT-4V)

import openai, base64

def gpt4v_grounding(image_path: str, text: str) -> float:
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode()
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
                {"type": "text", "text": f"On a scale of 0–1, how well does this text describe only what is visible in the image? Text: '{text}'. Reply with a single float."},
            ],
        }],
        max_tokens=10,
    )
    return float(response.choices[0].message.content.strip())

results = evaluate(samples, retrieved_images, generated_answers, grounding_fn=gpt4v_grounding)

Contributing

Contributions are welcome — especially new metrics, dataset loaders, and evaluation integrations.

  1. Fork the repo and create a feature branch.
  2. Add tests for any new metric in tests/test_metrics.py.
  3. Run pytest and make sure all tests pass.
  4. Open a pull request with a clear description of what you added and why.

Please open an issue first for significant changes so we can discuss the approach.


Citation

If you use mmrag-eval in your research, please cite:

@software{mmrag_eval,
  author  = {Bhattacharya, Ritaban},
  title   = {{mmrag-eval}: Benchmark for Evaluating Grounding Quality in Multimodal RAG Systems},
  year    = {2026},
  url     = {https://github.com/ritabanb/mmrag-eval},
  note    = {Dataset: https://huggingface.co/datasets/ritaban-b/mmrag-eval},
}

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmrag_eval-0.1.0.tar.gz (10.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mmrag_eval-0.1.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file mmrag_eval-0.1.0.tar.gz.

File metadata

  • Download URL: mmrag_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 10.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for mmrag_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c6db71e0d9171cc760d0587b947cf90ea9d28bba790b33f145cdd660d4c3ab3b
MD5 8fd779b8503a8e5fd6338dccd4e35f5f
BLAKE2b-256 92a258f7b10c0af1f5cc562a5d659a618c71f45a4f2e81cf37e1be284ddb8225

See more details on using hashes here.

File details

Details for the file mmrag_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mmrag_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for mmrag_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3ab3b65f06b98fd1fd8dd8c4409c649b97a7f0a52cc1d17d6acfc8a566e8437a
MD5 a2435da5d8b18131d026e71c1eee9b6c
BLAKE2b-256 b43cb092bccc84dc81afc8f9d289e58c6453644535757d78ad2773ba4f0611d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page