Evaluation framework for Multimodal Vision-Language RAG systems — measure retrieval quality, hallucination, faithfulness, and cross-modal alignment in one unified pipeline.

These details have not been verified by PyPI

Project links

Project description

mmeval-vrag

Evaluation framework for Multimodal Vision-Language RAG systems.

Measure retrieval quality, hallucination, faithfulness, and cross-modal alignment in one unified pipeline.

Why mmeval-vrag?

Existing RAG evaluation tools focus on text-only pipelines. Real-world systems increasingly retrieve images alongside text — medical scans with clinical notes, product photos with descriptions, diagrams with documentation. mmeval-vrag is purpose-built for this multimodal setting:

11 metrics spanning retrieval, generation, and cross-modal alignment
Graceful degradation — works with CPU-only token overlap, scales up with sentence-transformers, CLIP, and NLI models
Pipeline evaluation — plug in your retriever + generator and benchmark end-to-end
JSONL + VQA loaders — start evaluating in minutes with standard formats
Extensible — register custom metrics with a single decorator

Installation

# Core (numpy + Pillow only)
pip install mmeval-vrag

# With sentence-transformers for embedding-based metrics
pip install mmeval-vrag[transformers]

# With PyTorch + CLIP for cross-modal metrics
pip install mmeval-vrag[torch]

# Everything
pip install mmeval-vrag[full]

Quick Start

from mmeval_vrag import MultimodalRAGEvaluator, EvalConfig
from mmeval_vrag.types import EvalSample, RetrievedItem

sample = EvalSample(
    query_text="What does the chest X-ray show?",
    retrieved=[
        RetrievedItem(
            text="Bilateral infiltrates consistent with pneumonia.",
            is_relevant=True,
        ),
    ],
    generated_answer="The X-ray shows bilateral infiltrates indicating pneumonia.",
    reference_answer="Bilateral infiltrates indicating pneumonia.",
)

evaluator = MultimodalRAGEvaluator(
    config=EvalConfig(metrics=["faithfulness", "hallucination_rate", "retrieval_precision"])
)
results = evaluator.evaluate([sample])
print(results.summary())

Metrics

Category	Metric	What it measures
Retrieval	`retrieval_precision`	Fraction of top-K items that are relevant
	`retrieval_recall`	Fraction of all relevant items in top-K
	`retrieval_mrr`	Reciprocal rank of first relevant item
	`retrieval_ndcg`	Normalised DCG accounting for rank positions
Generation	`faithfulness`	Are generated claims supported by context?
	`hallucination_rate`	Fraction of unsupported claims (lower = better)
	`answer_relevance`	Similarity between answer and query
	`context_relevance`	Relevance of retrieved passages to query
Cross-Modal	`cross_modal_alignment`	CLIP similarity: retrieved images ↔ query text
	`visual_grounding`	CLIP similarity: retrieved images ↔ generated answer
	`multimodal_consistency`	CLIP similarity within (image, text) pairs

End-to-End Pipeline Evaluation

Evaluate a live retriever + generator without pre-computing samples:

from mmeval_vrag.evaluators.pipeline import EvalPipeline, QueryItem

pipeline = EvalPipeline(
    retriever=my_retriever,   # (query_text, query_image, top_k) → List[RetrievedItem]
    generator=my_generator,   # (query_text, contexts) → str
    config=EvalConfig(metrics=["all"]),
)

results = pipeline.run([
    QueryItem(query_text="Describe the tumor.", relevant_ids=["doc_42"]),
])

CLI

# Evaluate from a JSONL file
mmeval-vrag samples.jsonl -m faithfulness hallucination_rate -o results.json

# All metrics
mmeval-vrag samples.jsonl -m all --device cuda

JSONL format (one object per line):

{
  "query": "What is shown in the image?",
  "retrieved": [{"text": "A lesion is visible.", "is_relevant": true}],
  "generated_answer": "The image shows a lesion.",
  "reference_answer": "A lesion."
}

Custom Metrics

from mmeval_vrag.metrics import BaseMetric, register_metric

@register_metric
class MyCustomMetric(BaseMetric):
    name = "my_custom_metric"

    def compute(self, sample):
        score = len(sample.generated_answer) / 100  # toy example
        return {self.name: min(score, 1.0)}

Then use it: EvalConfig(metrics=["my_custom_metric", "faithfulness"]).

Fallback Behaviour

Component available	Faithfulness / Relevance	Hallucination	Cross-modal
Core only (numpy)	Token overlap (Jaccard)	Token overlap	Skipped (returns 0)
+ sentence-transformers	Embedding cosine sim	Token overlap	Skipped
+ transformers (NLI)	Embedding cosine sim	NLI entailment	Skipped
+ transformers (CLIP)	Embedding cosine sim	NLI entailment	CLIP cosine sim

Export & Analysis

# Summary statistics
results.summary()  # {metric: {mean, std, median, min, max, n}}

# Per-sample DataFrame
df = results.to_dataframe()

# JSON export
results.to_json("results.json")

Project Structure

mmeval-vrag/
├── mmeval_vrag/
│   ├── __init__.py          # Public API
│   ├── config.py            # EvalConfig + metric registry
│   ├── types.py             # EvalSample, RetrievedItem, ImageInput
│   ├── results.py           # EvalResult, EvalResultCollection
│   ├── cli.py               # CLI entry point
│   ├── evaluators/
│   │   ├── multimodal_rag.py  # Main evaluator
│   │   └── pipeline.py        # End-to-end pipeline evaluator
│   ├── metrics/
│   │   ├── __init__.py        # BaseMetric + registry
│   │   ├── retrieval.py       # Precision, Recall, MRR, NDCG
│   │   ├── faithfulness.py    # Faithfulness, Answer/Context Relevance
│   │   ├── hallucination.py   # Hallucination Rate
│   │   └── cross_modal.py     # CLIP-based cross-modal metrics
│   ├── datasets/
│   │   └── loaders.py         # JSONL + VQA dataset loaders
│   └── utils/
│       └── text.py            # Sentence splitting, token overlap
├── tests/
│   └── test_core.py
├── examples/
│   ├── quickstart.py
│   └── pipeline_eval.py
├── pyproject.toml
├── LICENSE
└── README.md

Contributing

Contributions welcome! Please open an issue or PR on GitHub.

git clone https://github.com/EmmanuelleB985/mmeval-vrag.git
cd mmeval-vrag
pip install -e ".[dev]"
pytest

License

Apache 2.0 — see LICENSE.

Citation

@software{bourigault2025mmeval,
  author = {Bourigault, Emmanuelle},
  title = {mmeval-vrag: Evaluation Framework for Multimodal Vision-Language RAG Systems},
  year = {2025},
  url = {https://github.com/EmmanuelleB985/mmeval-vrag},
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mmeval_vrag-0.1.0.tar.gz (24.5 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mmeval_vrag-0.1.0-py3-none-any.whl (23.8 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file mmeval_vrag-0.1.0.tar.gz.

File metadata

Download URL: mmeval_vrag-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 24.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for mmeval_vrag-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`675f278bfce6797299c8d6dce8878a6a82cafff93d19cb79d045b31a65ecd2e0`
MD5	`ea46c25fdecdf356ddad90e0bd6d79f1`
BLAKE2b-256	`63fedd21a11b644d359806b6839dfbbc80de418b0bbdc027b92d48d3fa97989f`

See more details on using hashes here.

File details

Details for the file mmeval_vrag-0.1.0-py3-none-any.whl.

File metadata

Download URL: mmeval_vrag-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 23.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for mmeval_vrag-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af579a0d532622eda21f931fd793e2006c3ebdbafe7c3c5a35973f7b779ee1a8`
MD5	`d481281cea16a82f5f6567f855f8e270`
BLAKE2b-256	`646145a5c5e135f2efd9f0bc912c7c01fe4a4684fc03d6c38d5742a21eb9b729`

See more details on using hashes here.

mmeval-vrag 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mmeval-vrag

Why mmeval-vrag?

Installation

Quick Start

Metrics

End-to-End Pipeline Evaluation

CLI

Custom Metrics

Fallback Behaviour

Export & Analysis

Project Structure

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes