All-in-one metrics for evaluating AI-generated radiology text

These details have not been verified by PyPI

Project links

Project description

RadEval

All-in-one metrics for evaluating AI-generated radiology text

TL;DR

pip install -e .

from RadEval import RadEval
import json

refs = [
    "Mild cardiomegaly with small bilateral pleural effusions and basilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]
hyps = [
    "Mildly enlarged cardiac silhouette with small pleural effusions and dependent bibasilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]

evaluator = RadEval(
    do_radgraph=True,
    do_bleu=True
)

results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))

{
  "radgraph_simple": 0.72,
  "radgraph_partial": 0.61,
  "radgraph_complete": 0.61,
  "bleu": 0.36
}

Installation

pip install RadEval              # from PyPI
pip install RadEval[api]         # include OpenAI/Gemini for MammoGREEN

Or install from source:

git clone https://github.com/jbdel/RadEval.git && cd RadEval
conda create -n radeval python=3.11 -y && conda activate radeval
pip install -e '.[api]'

Supported Metrics

Category	Metric	Flag	Modality	Best For	Usage
Lexical	BLEU	`do_bleu`	--	Surface-level n-gram overlap	docs
	ROUGE	`do_rouge`	--	Content coverage	docs
Semantic	BERTScore	`do_bertscore`	--	Semantic similarity	docs
	RadEval BERTScore	`do_radeval_bertscore`	--	Domain-adapted radiology semantics	docs
Clinical	F1CheXbert	`do_f1chexbert`	CXR	CheXpert finding classification	docs
	F1RadBERT-CT	`do_f1radbert_ct`	CT	CT finding classification	docs
	F1RadGraph	`do_radgraph`	CXR	Clinical entity/relation accuracy	docs
	RaTEScore	`do_ratescore`	CXR	Entity-level synonym-aware scoring	docs
Specialized	RadGraph-RadCliQ	`do_radgraph_radcliq`	CXR	Per-pair entity+relation F1 (RadCliQ variant)	docs
	RadCliQ-v1	`do_radcliq`	CXR	Composite clinical relevance	docs
	SRRBert	`do_srrbert`	CXR	Structured report evaluation	docs
	Temporal F1	`do_temporal`	CXR	Temporal consistency	docs
	GREEN	`do_green`	CXR	LLM-based overall quality (7B model)	docs
	MammoGREEN	`do_mammo_green`	Mammo	Mammography-specific LLM scoring	docs
	CRIMSON	`do_crimson`	CXR	LLM-based clinical significance scoring	docs
	RadFact-CT	`do_radfact_ct`	CT	LLM-based factual precision/recall	docs

Modality: CXR = Chest X-Ray, CT = Computed Tomography, Mammo = Mammography, -- = modality-agnostic.

Enable only the metrics you need -- each one is loaded lazily.

Per-Sample Output

Pass do_per_sample=True to get per-sample scores for every enabled metric. The output uses the same flat keys as the default mode, but each value is a list[float] of length n_samples instead of a single aggregate.

evaluator = RadEval(do_bleu=True, do_bertscore=True, do_per_sample=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"]      → [0.85, 0.40, ...]   (one per sample)
# results["bertscore"] → [0.95, 0.89, ...]

See docs/metrics.md for the full list of per-sample output keys for each metric.

Detailed Output

Pass do_details=True to get additional aggregate scores beyond the defaults: per-label F1 breakdowns for classifiers, BLEU-1/2/3, standard deviations for LLM-based metrics. Same flat keys as default, no nesting.

evaluator = RadEval(do_bleu=True, do_f1chexbert=True, do_crimson=True, do_details=True)
results = evaluator(refs=refs, hyps=hyps)
# results["bleu"]       → 0.36     (same as default)
# results["bleu_1"]     → 0.55     (extra: BLEU-1)
# results["bleu_2"]     → 0.42     (extra: BLEU-2)
# results["crimson_std"] → 0.15    (extra: std)
# results["f1chexbert_label_scores_f1"] → {"f1chexbert_5": {"Cardiomegaly": 0.59, ...}, ...}

See docs/metrics.md for the full output schema of each metric.

Comparing Systems

Use compare_systems to run paired approximate randomization tests between any number of systems:

from RadEval import RadEval, compare_systems

evaluator = RadEval(do_bleu=True)
signatures, scores = compare_systems(
    systems={
        'baseline': baseline_reports,
        'improved': improved_reports,
    },
    metrics={'bleu': lambda hyps, refs: evaluator(refs, hyps)['bleu']},
    references=reference_reports,
    n_samples=10000,
)

See docs/hypothesis_testing.md for a full walkthrough and interpretation guide.

Documentation

Page	Contents
docs/metrics.md	What each metric measures, `do_per_sample` / `do_details` output schemas
docs/hypothesis_testing.md	Statistical background, full example, performance notes
docs/file_formats.md	Loading data from .tok, .json, and Python lists

RadEval Expert Dataset

A curated evaluation set annotated by board-certified radiologists for validating automatic metrics. Available on HuggingFace.

Citation

@inproceedings{xu-etal-2025-radeval,
    title = "{R}ad{E}val: A framework for radiology text evaluation",
    author = "Xu, Justin  and
      Zhang, Xi  and
      Abderezaei, Javid  and
      Bauml, Julie  and
      Boodoo, Roger  and
      Haghighi, Fatemeh  and
      Ganjizadeh, Ali  and
      Brattain, Eric  and
      Van Veen, Dave  and
      Meng, Zaiqiao  and
      Eyre, David W  and
      Delbrouck, Jean-Benoit",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-demos.40/",
    doi = "10.18653/v1/2025.emnlp-demos.40",
    pages = "546--557",
}

Contributors

_{Jean-Benoit Delbrouck}

_{Justin Xu}

_{Xi Zhang}

Acknowledgments

Built on the work of the radiology AI community: CheXbert, RadGraph, BERTScore, RaTEScore, SRR-BERT, GREEN, and datasets like MIMIC-CXR.

If you find RadEval useful, please give us a star!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Mar 24, 2026

0.1.8

Mar 19, 2026

0.1.7

Mar 13, 2026

This version

0.1.6

Mar 12, 2026

0.1.5

Mar 12, 2026

0.1.4

Mar 12, 2026

0.1.3

Mar 12, 2026

0.1.2

Mar 12, 2026

0.1.1

Mar 12, 2026

0.1.0

Mar 10, 2026

0.0.6

Jan 12, 2026

0.0.6rc2 pre-release

Jan 22, 2026

0.0.6rc1 pre-release

Jan 22, 2026

0.0.5

Jan 7, 2026

0.0.5rc1 pre-release

Jan 8, 2026

0.0.4

Jan 6, 2026

0.0.3

Jan 6, 2026

0.0.2

Jan 6, 2026

0.0.1rc6 pre-release

Jul 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

radeval-0.1.6.tar.gz (96.3 kB view details)

Uploaded Mar 12, 2026 Source

File details

Details for the file radeval-0.1.6.tar.gz.

File metadata

Download URL: radeval-0.1.6.tar.gz
Upload date: Mar 12, 2026
Size: 96.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for radeval-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`7d369f2a40e8b522c3f841bbd0e977f58f83cb648269a5af513aea11a03b2f3c`
MD5	`ea6fda89902562d7edd4b8d137145b78`
BLAKE2b-256	`d3616ff2d73b090282e9c3306f800b0ca6c4b92864811a2abaa0ae6aeeae6ab3`

See more details on using hashes here.

RadEval 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RadEval

TL;DR

Installation

Supported Metrics

Per-Sample Output

Detailed Output

Comparing Systems

Documentation

RadEval Expert Dataset

Citation

Contributors

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes