All-in-one metrics for evaluating AI-generated radiology text
Project description
RadEval
All-in-one metrics for evaluating AI-generated radiology text
TL;DR
pip install -e .
from RadEval import RadEval
import json
refs = [
"Mild cardiomegaly with small bilateral pleural effusions and basilar atelectasis.",
"No pleural effusions or pneumothoraces.",
]
hyps = [
"Mildly enlarged cardiac silhouette with small pleural effusions and dependent bibasilar atelectasis.",
"No pleural effusions or pneumothoraces.",
]
evaluator = RadEval(
do_radgraph=True,
do_bleu=True
)
results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))
{
"radgraph_simple": 0.72,
"radgraph_partial": 0.61,
"radgraph_complete": 0.61,
"bleu": 0.36
}
Installation
pip install RadEval # from PyPI
pip install RadEval[api] # include OpenAI/Gemini for MammoGREEN
Or install from source:
git clone https://github.com/jbdel/RadEval.git && cd RadEval
conda create -n radeval python=3.11 -y && conda activate radeval
pip install -e '.[api]'
Supported Metrics
| Category | Metric | Flag | Modality | Best For | Usage |
|---|---|---|---|---|---|
| Lexical | BLEU | do_bleu |
-- | Surface-level n-gram overlap | docs |
| ROUGE | do_rouge |
-- | Content coverage | docs | |
| Semantic | BERTScore | do_bertscore |
-- | Semantic similarity | docs |
| RadEval BERTScore | do_radeval_bertscore |
-- | Domain-adapted radiology semantics | docs | |
| Clinical | F1CheXbert | do_f1chexbert |
CXR | CheXpert finding classification | docs |
| F1RadBERT-CT | do_f1radbert_ct |
CT | CT finding classification | docs | |
| F1RadGraph | do_radgraph |
CXR | Clinical entity/relation accuracy | docs | |
| RaTEScore | do_ratescore |
CXR | Entity-level synonym-aware scoring | docs | |
| Specialized | RadGraph-RadCliQ | do_radgraph_radcliq |
CXR | Per-pair entity+relation F1 (RadCliQ variant) | docs |
| RadCliQ-v1 | do_radcliq |
CXR | Composite clinical relevance | docs | |
| SRRBert | do_srrbert |
CXR | Structured report evaluation | docs | |
| Temporal F1 | do_temporal |
CXR | Temporal consistency | docs | |
| GREEN | do_green |
CXR | LLM-based overall quality (7B model) | docs | |
| MammoGREEN | do_mammo_green |
Mammo | Mammography-specific LLM scoring | docs | |
| CRIMSON | do_crimson |
CXR | LLM-based clinical significance scoring | docs | |
| RadFact-CT | do_radfact_ct |
CT | LLM-based factual precision/recall | docs |
Modality: CXR = Chest X-Ray, CT = Computed Tomography, Mammo = Mammography, -- = modality-agnostic.
Enable only the metrics you need -- each one is loaded lazily.
Per-Sample Output
Pass do_per_sample=True to get per-sample scores for every enabled metric. The output uses the same flat keys as the default mode, but each value is a list[float] of length n_samples instead of a single aggregate.
evaluator = RadEval(do_bleu=True, do_bertscore=True, do_per_sample=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"] → [0.85, 0.40, ...] (one per sample)
# results["bertscore"] → [0.95, 0.89, ...]
See docs/metrics.md for the full list of per-sample output keys for each metric.
Detailed Output
Pass do_details=True to get additional aggregate scores beyond the defaults: per-label F1 breakdowns for classifiers, BLEU-1/2/3, standard deviations for LLM-based metrics. Same flat keys as default, no nesting.
evaluator = RadEval(do_bleu=True, do_f1chexbert=True, do_crimson=True, do_details=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"] → 0.36 (same as default)
# results["bleu_1"] → 0.55 (extra: BLEU-1)
# results["bleu_2"] → 0.42 (extra: BLEU-2)
# results["crimson_std"] → 0.15 (extra: std)
# results["f1chexbert_label_scores_f1"] → {"f1chexbert_5": {"Cardiomegaly": 0.59, ...}, ...}
See docs/metrics.md for the full output schema of each metric.
Comparing Systems
Use compare_systems to run paired approximate randomization tests between any number of systems:
from RadEval import RadEval, compare_systems
evaluator = RadEval(do_bleu=True)
signatures, scores = compare_systems(
systems={
'baseline': baseline_reports,
'improved': improved_reports,
},
metrics={'bleu': lambda hyps, refs: evaluator(refs, hyps)['bleu']},
references=reference_reports,
n_samples=10000,
)
See docs/hypothesis_testing.md for a full walkthrough and interpretation guide.
Documentation
| Page | Contents |
|---|---|
| docs/metrics.md | What each metric measures, do_per_sample / do_details output schemas |
| docs/hypothesis_testing.md | Statistical background, full example, performance notes |
| docs/file_formats.md | Loading data from .tok, .json, and Python lists |
RadEval Expert Dataset
A curated evaluation set annotated by board-certified radiologists for validating automatic metrics. Available on HuggingFace.
Citation
@inproceedings{xu-etal-2025-radeval,
title = "{R}ad{E}val: A framework for radiology text evaluation",
author = "Xu, Justin and
Zhang, Xi and
Abderezaei, Javid and
Bauml, Julie and
Boodoo, Roger and
Haghighi, Fatemeh and
Ganjizadeh, Ali and
Brattain, Eric and
Van Veen, Dave and
Meng, Zaiqiao and
Eyre, David W and
Delbrouck, Jean-Benoit",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-demos.40/",
doi = "10.18653/v1/2025.emnlp-demos.40",
pages = "546--557",
}
Contributors
|
Jean-Benoit Delbrouck |
Justin Xu |
Xi Zhang |
Acknowledgments
Built on the work of the radiology AI community: CheXbert, RadGraph, BERTScore, RaTEScore, SRR-BERT, GREEN, and datasets like MIMIC-CXR.
If you find RadEval useful, please give us a star!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file radeval-0.1.8.tar.gz.
File metadata
- Download URL: radeval-0.1.8.tar.gz
- Upload date:
- Size: 99.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c95febb715654fc937a15f9d0a467f54d8354361f5c57fdaf42830ec406dcdfb
|
|
| MD5 |
92b443798e96ce77fc11b9ff6fbc69d1
|
|
| BLAKE2b-256 |
24964a82ea6243bfdea680be13ec4d14b8ab0ff6075974ebea941687e173d835
|