All-in-one metrics for evaluating AI-generated radiology text

These details have not been verified by PyPI

Project links

Project description

RadEval

RadEval (EMNLP, 2025) is a Python framework for evaluating AI-generated radiology reports. It serves two use cases:

Evaluation: 16 metrics spanning lexical, semantic, clinical, and LLM-based evaluation, all behind a single interface with lazy loading and config-file support.
Reinforcement-learning (RL) rewards: every RL-eligible metric exposed as a drop-in HuggingFace TRL reward function for GRPO (and other trainers that accept a reward callable).

Installation
Usage: Evaluation
Usage: RL rewards
Supported Metrics
API Keys for LLM Metrics
Documentation
Expert Dataset
Contributing
Citation

Installation

pip install radeval              # from PyPI
pip install radeval[api]         # include OpenAI/Gemini for LLM-based metrics

Or install from source:

git clone https://github.com/jbdel/RadEval.git && cd RadEval
conda create -n radeval python=3.11 -y && conda activate radeval
pip install -e '.[api]'
# Torch wheels are CUDA-version specific. If the default wheel from PyPI does
# not match your local NVIDIA driver, install a matching build first, e.g.:
# pip install --index-url https://download.pytorch.org/whl/cu128 torch==2.9.1 torchvision==0.24.1

Known-good stack (for RadEval 2.1+): Python 3.11, torch==2.9.1+cu128, transformers==5.6.2, tokenizers==0.22.2, huggingface_hub>=1.0, accelerate>=1.1, numpy<3. Full test suite passes on this configuration. For the [rl] extras, add trl>=1.3.0,<2.

Usage: Evaluation

Basic

Pass a list of metric names. Each metric is loaded lazily; only the ones you enable import their dependencies.

from radeval import RadEval
import json

refs = [
    "Mild cardiomegaly with small bilateral pleural effusions and basilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]
hyps = [
    "Mildly enlarged cardiac silhouette with small pleural effusions and dependent bibasilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]

evaluator = RadEval(metrics=["radgraph", "bleu"])
results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=4))

{
    "radgraph_simple": 0.72,
    "radgraph_partial": 0.61,
    "radgraph_complete": 0.61,
    "bleu": 0.36
}

Config file

For per-metric settings (model, provider, concurrency) or reproducible evaluation configs, use a YAML file:

# config.yaml
metrics:
  - bleu
  - rouge
  - crimson:
      provider: openai
      model_name: gpt-4o-mini
  - radfact_ct:
      filter_negatives: true

output:
  mode: per_sample    # or "default" or "detailed"

evaluator = RadEval.from_config("config.yaml")
results = evaluator(refs=refs, hyps=hyps)

See examples/config.yaml for a complete example.

Output modes

Mode	Flag	Values
Default	—	`float` per metric
Per-sample	`per_sample=True`	`list[float]` per metric (one per report)
Detailed	`detailed=True`	Extra keys: label breakdowns, BLEU-1/2/3, std

# Per-sample scores
evaluator = RadEval(metrics=["bleu", "bertscore"], per_sample=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"]      → [0.85, 0.40, ...]   (one per sample)
# results["bertscore"] → [0.95, 0.89, ...]

# Detailed output (label F1s, sub-scores, std)
evaluator = RadEval(metrics=["bleu", "f1chexbert"], detailed=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu_1"]    → 0.55   (extra: BLEU-1)
# results["bleu_2"]    → 0.42   (extra: BLEU-2)

See docs/metrics.md for the full output schema of each metric.

Comparing systems

Use compare_systems to run paired approximate randomization tests between any number of systems:

from radeval import RadEval, compare_systems

evaluator = RadEval(metrics=["bleu"])
signatures, scores = compare_systems(
    systems={'baseline': baseline_reports, 'improved': improved_reports},
    metrics={'bleu': lambda hyps, refs: evaluator(refs, hyps)['bleu']},
    references=reference_reports,
    n_samples=10000,
)

See docs/hypothesis_testing.md for a full walkthrough and interpretation guide.

Usage: RL rewards

RadEval metrics aren't just for offline evaluation — every RL-eligible metric is a drop-in HuggingFace TRL reward function. GRPO is the flagship, tested path; RLOO and other TRL trainers that consume a reward-function callable use the same interface.

Three things to look at, in increasing depth:

RL quickstart

pip install radeval[rl]    # adds trl>=1.3.0,<2

from radeval.rewards import make_reward_fn
from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[make_reward_fn("bleu")],   # or bertscore, radgraph (key=...), radcliq, ...
    train_dataset=dataset,                   # must have a "ground_truth" column
)
trainer.train()

Runnable end-to-end: python examples/trl_grpo_quickstart.py.

RL benchmarks: cost & divergence

How expensive is each metric when used as a per-step reward, how does reward choice change what the model learns? See docs/trl_rewards_benchmarks.md for:

A speed table covering all 16 public metrics, from 0.09 ms/sample (BLEU, CPU) to ~2,200 ms/sample (GREEN, 7B local LLM). RadCliQ, a metric with strong correlation to radiologist preferences, comes in at ~161 ms/sample.
A reward-divergence gallery: same rollouts, scored by several metrics side-by-side. Headline finding: on a negation flip ("No pleural effusion." → "Pleural effusion."), BERTScore rewards the clinically-wrong rollout at 0.893, nearly its 1.0 ceiling; a GRPO policy trained against BERTScore would be pushed toward this rollout. Clinical metrics penalize the flip, but by widely varying magnitudes: RadGraph drops from 1.0 to 0.50, RadCliQ rises by ~1.7 distance units, and CRIMSON (LLM judge, signed range (−1, 1]) scores −0.333: a single hallucinated abnormal finding against a normal reference. The benchmarks page lays out the full per-metric reaction across several other rollout types.

RL reward API & docs

docs/trl_rewards.md: make_reward_fn contract, required key= for multi-key metrics, conversational-completion handling, multi-metric composition, VLM pointer, known limitations.
Note: For distance metrics (lower = better) such as RadCliQ, use the safe inversion make_reward_fn("radcliq", score_transform=lambda x: -x).

Supported Metrics

Category	Metric	Key	Modality	Provider	Best For	Usage
Lexical	BLEU	`"bleu"`	--	--	Surface-level n-gram overlap	docs
	ROUGE	`"rouge"`	--	--	Content coverage	docs
Semantic	BERTScore	`"bertscore"`	--	--	Semantic similarity	docs
	RadEval BERTScore	`"radeval_bertscore"`	--	--	Domain-adapted radiology semantics	docs
Clinical	F1CheXbert	`"f1chexbert"`	CXR	--	CheXpert finding classification	docs
	F1RadBERT-CT	`"f1radbert_ct"`	CT	--	CT finding classification	docs
	F1RadGraph	`"radgraph"`	CXR	--	Clinical entity/relation accuracy	docs
	RaTEScore	`"ratescore"`	CXR	--	Entity-level synonym-aware scoring	docs
Specialized	RadGraph-RadCliQ	`"radgraph_radcliq"`	CXR	--	Per-pair entity+relation F1 (RadCliQ variant)	docs
	RadCliQ-v1	`"radcliq"`	CXR	--	Composite clinical relevance	docs
	SRRBert	`"srrbert"`	CXR	--	Structured report evaluation	docs
	Temporal F1	`"temporal"`	CXR	--	Temporal consistency	docs
	GREEN	`"green"`	CXR	Local HF	LLM-based overall quality (7B model)	docs
	MammoGREEN	`"mammo_green"`	Mammo	OpenAI / Gemini	Mammography-specific LLM scoring	docs
	CRIMSON	`"crimson"`	CXR	OpenAI / HF	LLM-based clinical significance scoring	docs
	RadFact-CT	`"radfact_ct"`	CT	OpenAI	LLM-based factual precision/recall	docs

Modality: CXR = Chest X-Ray, CT = Computed Tomography, Mammo = Mammography, -- = modality-agnostic.

Enable only the metrics you need; each one is loaded lazily.

API Keys for LLM Metrics

LLM-based metrics (CRIMSON, MammoGREEN, RadFact-CT) share two global API key arguments:

evaluator = RadEval(
    metrics=["crimson", "mammo_green", "radfact_ct"],
    openai_api_key="sk-...",   # used by CRIMSON (openai), MammoGREEN (openai), RadFact-CT
    gemini_api_key="AIza...",  # used by MammoGREEN (gemini)
)

If not passed explicitly, keys fall back to the environment variables OPENAI_API_KEY, GEMINI_API_KEY, or GOOGLE_API_KEY. An error is raised if the chosen provider requires a key that is neither passed nor in the environment.

Documentation

Page	Contents
docs/metrics.md	What each metric measures, `per_sample` / `detailed` output schemas
docs/hypothesis_testing.md	Statistical background, full example, performance notes
docs/file_formats.md	Loading data from .tok, .json, and Python lists
docs/trl_rewards.md	Using RadEval metrics as RL reward functions with HuggingFace TRL
docs/trl_rewards_benchmarks.md	Speed table + reward-divergence gallery for picking an RL reward metric

RadEval Expert Dataset

A curated evaluation set annotated by board-certified radiologists for validating automatic metrics. Available on HuggingFace.

Citation

@inproceedings{xu-etal-2025-radeval,
    title = "{R}ad{E}val: A framework for radiology text evaluation",
    author = "Xu, Justin  and
      Zhang, Xi  and
      Abderezaei, Javid  and
      Bauml, Julie  and
      Boodoo, Roger  and
      Haghighi, Fatemeh  and
      Ganjizadeh, Ali  and
      Brattain, Eric  and
      Van Veen, Dave  and
      Meng, Zaiqiao  and
      Eyre, David W  and
      Delbrouck, Jean-Benoit",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-demos.40/",
    doi = "10.18653/v1/2025.emnlp-demos.40",
    pages = "546--557",
}

Contributors

_{Jean-Benoit Delbrouck}

_{Justin Xu}

_{Xi Zhang}

_{Dave Van Veen}

Contributing

RadEval is open source and we welcome contributions from the community. Whether it's a new metric, a bug fix, or improved documentation; feel free to open an issue or submit a pull request on GitHub.

Acknowledgments

Built on the work of the radiology AI community: CheXbert, RadGraph, BERTScore, RaTEScore, SRR-BERT, GREEN, CRIMSON, and datasets like MIMIC-CXR.

Please give us a star if you find RadEval useful! ⭐

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.2.1

May 8, 2026

0.2.0

Mar 24, 2026

0.1.8

Mar 19, 2026

0.1.7

Mar 13, 2026

0.1.6

Mar 12, 2026

0.1.5

Mar 12, 2026

0.1.4

Mar 12, 2026

0.1.3

Mar 12, 2026

0.1.2

Mar 12, 2026

0.1.1

Mar 12, 2026

0.1.0

Mar 10, 2026

0.0.6

Jan 12, 2026

0.0.6rc2 pre-release

Jan 22, 2026

0.0.6rc1 pre-release

Jan 22, 2026

0.0.5

Jan 7, 2026

0.0.5rc1 pre-release

Jan 8, 2026

0.0.4

Jan 6, 2026

0.0.3

Jan 6, 2026

0.0.2

Jan 6, 2026

0.0.1rc6 pre-release

Jul 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

radeval-2.2.1.tar.gz (487.8 kB view details)

Uploaded May 8, 2026 Source

File details

Details for the file radeval-2.2.1.tar.gz.

File metadata

Download URL: radeval-2.2.1.tar.gz
Upload date: May 8, 2026
Size: 487.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for radeval-2.2.1.tar.gz
Algorithm	Hash digest
SHA256	`a6f846420b87f600a211973455671e6b0e504ade30515d1129f5cf3d2f279633`
MD5	`8ae0d94bed1777337975ea111e49c3ad`
BLAKE2b-256	`e657372aeef6ddc3241728bcae4b4c51c68675ca4a4bb0b22050d983e20654ec`

See more details on using hashes here.

radeval 2.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RadEval

Table of Contents

Installation

Usage: Evaluation

Basic

Config file

Output modes

Comparing systems

Usage: RL rewards

RL quickstart

RL benchmarks: cost & divergence

RL reward API & docs

Supported Metrics

API Keys for LLM Metrics

Documentation

RadEval Expert Dataset

Citation

Contributors

Contributing

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes