All-in-one metrics for evaluating AI-generated radiology text
Project description
RadEval
RadEval (EMNLP, 2025) is a Python framework for evaluating AI-generated radiology reports. It serves two use cases:
- Evaluation: 16 metrics spanning lexical, semantic, clinical, and LLM-based evaluation, all behind a single interface with lazy loading and config-file support.
- Reinforcement-learning (RL) rewards: every RL-eligible metric exposed as a drop-in HuggingFace TRL reward function for GRPO (and other trainers that accept a reward callable).
Table of Contents
- Installation
- Usage: Evaluation
- Usage: RL rewards
- Supported Metrics
- API Keys for LLM Metrics
- Documentation
- Expert Dataset
- Contributing
- Citation
Installation
pip install radeval # from PyPI
pip install radeval[api] # include OpenAI/Gemini for LLM-based metrics
Or install from source:
git clone https://github.com/jbdel/RadEval.git && cd RadEval
conda create -n radeval python=3.11 -y && conda activate radeval
pip install -e '.[api]'
# Torch wheels are CUDA-version specific. If the default wheel from PyPI does
# not match your local NVIDIA driver, install a matching build first, e.g.:
# pip install --index-url https://download.pytorch.org/whl/cu128 torch==2.9.1 torchvision==0.24.1
Known-good stack (for RadEval 2.1+): Python 3.11,
torch==2.9.1+cu128,transformers==5.6.2,tokenizers==0.22.2,huggingface_hub>=1.0,accelerate>=1.1,numpy<3. Full test suite passes on this configuration. For the[rl]extras, addtrl>=1.3.0,<2.
Usage: Evaluation
Basic
Pass a list of metric names. Each metric is loaded lazily; only the ones you enable import their dependencies.
from radeval import RadEval
import json
refs = [
"Mild cardiomegaly with small bilateral pleural effusions and basilar atelectasis.",
"No pleural effusions or pneumothoraces.",
]
hyps = [
"Mildly enlarged cardiac silhouette with small pleural effusions and dependent bibasilar atelectasis.",
"No pleural effusions or pneumothoraces.",
]
evaluator = RadEval(metrics=["radgraph", "bleu"])
results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=4))
{
"radgraph_simple": 0.72,
"radgraph_partial": 0.61,
"radgraph_complete": 0.61,
"bleu": 0.36
}
Config file
For per-metric settings (model, provider, concurrency) or reproducible evaluation configs, use a YAML file:
# config.yaml
metrics:
- bleu
- rouge
- crimson:
provider: openai
model_name: gpt-4o-mini
- radfact_ct:
filter_negatives: true
output:
mode: per_sample # or "default" or "detailed"
evaluator = RadEval.from_config("config.yaml")
results = evaluator(refs=refs, hyps=hyps)
See examples/config.yaml for a complete example.
Output modes
| Mode | Flag | Values |
|---|---|---|
| Default | — | float per metric |
| Per-sample | per_sample=True |
list[float] per metric (one per report) |
| Detailed | detailed=True |
Extra keys: label breakdowns, BLEU-1/2/3, std |
# Per-sample scores
evaluator = RadEval(metrics=["bleu", "bertscore"], per_sample=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"] → [0.85, 0.40, ...] (one per sample)
# results["bertscore"] → [0.95, 0.89, ...]
# Detailed output (label F1s, sub-scores, std)
evaluator = RadEval(metrics=["bleu", "f1chexbert"], detailed=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu_1"] → 0.55 (extra: BLEU-1)
# results["bleu_2"] → 0.42 (extra: BLEU-2)
See docs/metrics.md for the full output schema of each metric.
Comparing systems
Use compare_systems to run paired approximate randomization tests between any number of systems:
from radeval import RadEval, compare_systems
evaluator = RadEval(metrics=["bleu"])
signatures, scores = compare_systems(
systems={'baseline': baseline_reports, 'improved': improved_reports},
metrics={'bleu': lambda hyps, refs: evaluator(refs, hyps)['bleu']},
references=reference_reports,
n_samples=10000,
)
See docs/hypothesis_testing.md for a full walkthrough and interpretation guide.
Usage: RL rewards
RadEval metrics aren't just for offline evaluation — every RL-eligible metric is a drop-in HuggingFace TRL reward function. GRPO is the flagship, tested path; RLOO and other TRL trainers that consume a reward-function callable use the same interface.
Three things to look at, in increasing depth:
RL quickstart
pip install radeval[rl] # adds trl>=1.3.0,<2
from radeval.rewards import make_reward_fn
from trl import GRPOTrainer
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[make_reward_fn("bleu")], # or bertscore, radgraph (key=...), radcliq, ...
train_dataset=dataset, # must have a "ground_truth" column
)
trainer.train()
Runnable end-to-end: python examples/trl_grpo_quickstart.py.
RL benchmarks: cost & divergence
How expensive is each metric when used as a per-step reward, how does reward choice change what the model learns? See docs/trl_rewards_benchmarks.md for:
- A speed table covering all 16 public metrics, from 0.09 ms/sample (BLEU, CPU) to ~2,200 ms/sample (GREEN, 7B local LLM). RadCliQ, a metric with strong correlation to radiologist preferences, comes in at ~161 ms/sample.
- A reward-divergence gallery: same rollouts, scored by several metrics side-by-side. Headline finding: on a negation flip ("No pleural effusion." → "Pleural effusion."), BERTScore rewards the clinically-wrong rollout at 0.893, nearly its 1.0 ceiling; a GRPO policy trained against BERTScore would be pushed toward this rollout. Clinical metrics penalize the flip, but by widely varying magnitudes: RadGraph drops from 1.0 to 0.50, RadCliQ rises by ~1.7 distance units, and CRIMSON (LLM judge, signed range (−1, 1]) scores −0.333: a single hallucinated abnormal finding against a normal reference. The benchmarks page lays out the full per-metric reaction across several other rollout types.
RL reward API & docs
- docs/trl_rewards.md:
make_reward_fncontract, requiredkey=for multi-key metrics, conversational-completion handling, multi-metric composition, VLM pointer, known limitations. - Note: For distance metrics (lower = better) such as RadCliQ, use the safe inversion
make_reward_fn("radcliq", score_transform=lambda x: -x).
Supported Metrics
| Category | Metric | Key | Modality | Provider | Best For | Usage |
|---|---|---|---|---|---|---|
| Lexical | BLEU | "bleu" |
-- | -- | Surface-level n-gram overlap | docs |
| ROUGE | "rouge" |
-- | -- | Content coverage | docs | |
| Semantic | BERTScore | "bertscore" |
-- | -- | Semantic similarity | docs |
| RadEval BERTScore | "radeval_bertscore" |
-- | -- | Domain-adapted radiology semantics | docs | |
| Clinical | F1CheXbert | "f1chexbert" |
CXR | -- | CheXpert finding classification | docs |
| F1RadBERT-CT | "f1radbert_ct" |
CT | -- | CT finding classification | docs | |
| F1RadGraph | "radgraph" |
CXR | -- | Clinical entity/relation accuracy | docs | |
| RaTEScore | "ratescore" |
CXR | -- | Entity-level synonym-aware scoring | docs | |
| Specialized | RadGraph-RadCliQ | "radgraph_radcliq" |
CXR | -- | Per-pair entity+relation F1 (RadCliQ variant) | docs |
| RadCliQ-v1 | "radcliq" |
CXR | -- | Composite clinical relevance | docs | |
| SRRBert | "srrbert" |
CXR | -- | Structured report evaluation | docs | |
| Temporal F1 | "temporal" |
CXR | -- | Temporal consistency | docs | |
| GREEN | "green" |
CXR | Local HF | LLM-based overall quality (7B model) | docs | |
| MammoGREEN | "mammo_green" |
Mammo | OpenAI / Gemini | Mammography-specific LLM scoring | docs | |
| CRIMSON | "crimson" |
CXR | OpenAI / HF | LLM-based clinical significance scoring | docs | |
| RadFact-CT | "radfact_ct" |
CT | OpenAI | LLM-based factual precision/recall | docs |
Modality: CXR = Chest X-Ray, CT = Computed Tomography, Mammo = Mammography, -- = modality-agnostic.
Enable only the metrics you need; each one is loaded lazily.
API Keys for LLM Metrics
LLM-based metrics (CRIMSON, MammoGREEN, RadFact-CT) share two global API key arguments:
evaluator = RadEval(
metrics=["crimson", "mammo_green", "radfact_ct"],
openai_api_key="sk-...", # used by CRIMSON (openai), MammoGREEN (openai), RadFact-CT
gemini_api_key="AIza...", # used by MammoGREEN (gemini)
)
If not passed explicitly, keys fall back to the environment variables OPENAI_API_KEY, GEMINI_API_KEY, or GOOGLE_API_KEY. An error is raised if the chosen provider requires a key that is neither passed nor in the environment.
Documentation
| Page | Contents |
|---|---|
| docs/metrics.md | What each metric measures, per_sample / detailed output schemas |
| docs/hypothesis_testing.md | Statistical background, full example, performance notes |
| docs/file_formats.md | Loading data from .tok, .json, and Python lists |
| docs/trl_rewards.md | Using RadEval metrics as RL reward functions with HuggingFace TRL |
| docs/trl_rewards_benchmarks.md | Speed table + reward-divergence gallery for picking an RL reward metric |
RadEval Expert Dataset
A curated evaluation set annotated by board-certified radiologists for validating automatic metrics. Available on HuggingFace.
Citation
@inproceedings{xu-etal-2025-radeval,
title = "{R}ad{E}val: A framework for radiology text evaluation",
author = "Xu, Justin and
Zhang, Xi and
Abderezaei, Javid and
Bauml, Julie and
Boodoo, Roger and
Haghighi, Fatemeh and
Ganjizadeh, Ali and
Brattain, Eric and
Van Veen, Dave and
Meng, Zaiqiao and
Eyre, David W and
Delbrouck, Jean-Benoit",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-demos.40/",
doi = "10.18653/v1/2025.emnlp-demos.40",
pages = "546--557",
}
Contributors
|
Jean-Benoit Delbrouck |
Justin Xu |
Xi Zhang |
Dave Van Veen |
Contributing
RadEval is open source and we welcome contributions from the community. Whether it's a new metric, a bug fix, or improved documentation; feel free to open an issue or submit a pull request on GitHub.
Acknowledgments
Built on the work of the radiology AI community: CheXbert, RadGraph, BERTScore, RaTEScore, SRR-BERT, GREEN, CRIMSON, and datasets like MIMIC-CXR.
Please give us a star if you find RadEval useful! ⭐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file radeval-2.2.1.tar.gz.
File metadata
- Download URL: radeval-2.2.1.tar.gz
- Upload date:
- Size: 487.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a6f846420b87f600a211973455671e6b0e504ade30515d1129f5cf3d2f279633
|
|
| MD5 |
8ae0d94bed1777337975ea111e49c3ad
|
|
| BLAKE2b-256 |
e657372aeef6ddc3241728bcae4b4c51c68675ca4a4bb0b22050d983e20654ec
|