Evaluation framework for Multimodal Vision-Language RAG systems — measure retrieval quality, hallucination, faithfulness, and cross-modal alignment in one unified pipeline.
Project description
mmeval-vrag
Evaluation framework for Multimodal Vision-Language RAG systems.
Measure retrieval quality, hallucination, faithfulness, and cross-modal alignment in one unified pipeline.
Why mmeval-vrag?
Existing RAG evaluation tools focus on text-only pipelines. Real-world systems increasingly retrieve images alongside text — medical scans with clinical notes, product photos with descriptions, diagrams with documentation. mmeval-vrag is purpose-built for this multimodal setting:
- 11 metrics spanning retrieval, generation, and cross-modal alignment
- Graceful degradation — works with CPU-only token overlap, scales up with sentence-transformers, CLIP, and NLI models
- Pipeline evaluation — plug in your retriever + generator and benchmark end-to-end
- JSONL + VQA loaders — start evaluating in minutes with standard formats
- Extensible — register custom metrics with a single decorator
Installation
# Core (numpy + Pillow only)
pip install mmeval-vrag
# With sentence-transformers for embedding-based metrics
pip install mmeval-vrag[transformers]
# With PyTorch + CLIP for cross-modal metrics
pip install mmeval-vrag[torch]
# Everything
pip install mmeval-vrag[full]
Quick Start
from mmeval_vrag import MultimodalRAGEvaluator, EvalConfig
from mmeval_vrag.types import EvalSample, RetrievedItem
sample = EvalSample(
query_text="What does the chest X-ray show?",
retrieved=[
RetrievedItem(
text="Bilateral infiltrates consistent with pneumonia.",
is_relevant=True,
),
],
generated_answer="The X-ray shows bilateral infiltrates indicating pneumonia.",
reference_answer="Bilateral infiltrates indicating pneumonia.",
)
evaluator = MultimodalRAGEvaluator(
config=EvalConfig(metrics=["faithfulness", "hallucination_rate", "retrieval_precision"])
)
results = evaluator.evaluate([sample])
print(results.summary())
Metrics
| Category | Metric | What it measures |
|---|---|---|
| Retrieval | retrieval_precision |
Fraction of top-K items that are relevant |
retrieval_recall |
Fraction of all relevant items in top-K | |
retrieval_mrr |
Reciprocal rank of first relevant item | |
retrieval_ndcg |
Normalised DCG accounting for rank positions | |
| Generation | faithfulness |
Are generated claims supported by context? |
hallucination_rate |
Fraction of unsupported claims (lower = better) | |
answer_relevance |
Similarity between answer and query | |
context_relevance |
Relevance of retrieved passages to query | |
| Cross-Modal | cross_modal_alignment |
CLIP similarity: retrieved images ↔ query text |
visual_grounding |
CLIP similarity: retrieved images ↔ generated answer | |
multimodal_consistency |
CLIP similarity within (image, text) pairs |
End-to-End Pipeline Evaluation
Evaluate a live retriever + generator without pre-computing samples:
from mmeval_vrag.evaluators.pipeline import EvalPipeline, QueryItem
pipeline = EvalPipeline(
retriever=my_retriever, # (query_text, query_image, top_k) → List[RetrievedItem]
generator=my_generator, # (query_text, contexts) → str
config=EvalConfig(metrics=["all"]),
)
results = pipeline.run([
QueryItem(query_text="Describe the tumor.", relevant_ids=["doc_42"]),
])
CLI
# Evaluate from a JSONL file
mmeval-vrag samples.jsonl -m faithfulness hallucination_rate -o results.json
# All metrics
mmeval-vrag samples.jsonl -m all --device cuda
JSONL format (one object per line):
{
"query": "What is shown in the image?",
"retrieved": [{"text": "A lesion is visible.", "is_relevant": true}],
"generated_answer": "The image shows a lesion.",
"reference_answer": "A lesion."
}
Custom Metrics
from mmeval_vrag.metrics import BaseMetric, register_metric
@register_metric
class MyCustomMetric(BaseMetric):
name = "my_custom_metric"
def compute(self, sample):
score = len(sample.generated_answer) / 100 # toy example
return {self.name: min(score, 1.0)}
Then use it: EvalConfig(metrics=["my_custom_metric", "faithfulness"]).
Fallback Behaviour
| Component available | Faithfulness / Relevance | Hallucination | Cross-modal |
|---|---|---|---|
| Core only (numpy) | Token overlap (Jaccard) | Token overlap | Skipped (returns 0) |
| + sentence-transformers | Embedding cosine sim | Token overlap | Skipped |
| + transformers (NLI) | Embedding cosine sim | NLI entailment | Skipped |
| + transformers (CLIP) | Embedding cosine sim | NLI entailment | CLIP cosine sim |
Export & Analysis
# Summary statistics
results.summary() # {metric: {mean, std, median, min, max, n}}
# Per-sample DataFrame
df = results.to_dataframe()
# JSON export
results.to_json("results.json")
Project Structure
mmeval-vrag/
├── mmeval_vrag/
│ ├── __init__.py # Public API
│ ├── config.py # EvalConfig + metric registry
│ ├── types.py # EvalSample, RetrievedItem, ImageInput
│ ├── results.py # EvalResult, EvalResultCollection
│ ├── cli.py # CLI entry point
│ ├── evaluators/
│ │ ├── multimodal_rag.py # Main evaluator
│ │ └── pipeline.py # End-to-end pipeline evaluator
│ ├── metrics/
│ │ ├── __init__.py # BaseMetric + registry
│ │ ├── retrieval.py # Precision, Recall, MRR, NDCG
│ │ ├── faithfulness.py # Faithfulness, Answer/Context Relevance
│ │ ├── hallucination.py # Hallucination Rate
│ │ └── cross_modal.py # CLIP-based cross-modal metrics
│ ├── datasets/
│ │ └── loaders.py # JSONL + VQA dataset loaders
│ └── utils/
│ └── text.py # Sentence splitting, token overlap
├── tests/
│ └── test_core.py
├── examples/
│ ├── quickstart.py
│ └── pipeline_eval.py
├── pyproject.toml
├── LICENSE
└── README.md
Contributing
Contributions welcome! Please open an issue or PR on GitHub.
git clone https://github.com/EmmanuelleB985/mmeval-vrag.git
cd mmeval-vrag
pip install -e ".[dev]"
pytest
License
Apache 2.0 — see LICENSE.
Citation
@software{bourigault2025mmeval,
author = {Bourigault, Emmanuelle},
title = {mmeval-vrag: Evaluation Framework for Multimodal Vision-Language RAG Systems},
year = {2025},
url = {https://github.com/EmmanuelleB985/mmeval-vrag},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mmeval_vrag-0.1.0.tar.gz.
File metadata
- Download URL: mmeval_vrag-0.1.0.tar.gz
- Upload date:
- Size: 24.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
675f278bfce6797299c8d6dce8878a6a82cafff93d19cb79d045b31a65ecd2e0
|
|
| MD5 |
ea46c25fdecdf356ddad90e0bd6d79f1
|
|
| BLAKE2b-256 |
63fedd21a11b644d359806b6839dfbbc80de418b0bbdc027b92d48d3fa97989f
|
File details
Details for the file mmeval_vrag-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mmeval_vrag-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af579a0d532622eda21f931fd793e2006c3ebdbafe7c3c5a35973f7b779ee1a8
|
|
| MD5 |
d481281cea16a82f5f6567f855f8e270
|
|
| BLAKE2b-256 |
646145a5c5e135f2efd9f0bc912c7c01fe4a4684fc03d6c38d5742a21eb9b729
|