All-in-one metrics for evaluating AI-generated radiology text

These details have not been verified by PyPI

Project links

Project description

RadEval

All-in-one metrics for evaluating AI-generated radiology text

TL;DR

pip install -e .

from RadEval import RadEval
import json

refs = [
    "Mild cardiomegaly with small bilateral pleural effusions and basilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]
hyps = [
    "Mildly enlarged cardiac silhouette with small pleural effusions and dependent bibasilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]

evaluator = RadEval(
    do_radgraph=True,
    do_bleu=True
)

results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))

{
  "radgraph_simple": 0.72,
  "radgraph_partial": 0.61,
  "radgraph_complete": 0.61,
  "bleu": 0.36
}

Installation

pip install RadEval              # from PyPI
pip install RadEval[api]         # include OpenAI/Gemini for MammoGREEN

Or install from source:

git clone https://github.com/jbdel/RadEval.git && cd RadEval
conda create -n radeval python=3.11 -y && conda activate radeval
pip install -e '.[api]'

Supported Metrics

Category	Metric	Flag	Best For
Lexical	BLEU	`do_bleu`	Surface-level n-gram overlap
	ROUGE	`do_rouge`	Content coverage
Semantic	BERTScore	`do_bertscore`	Semantic similarity
	RadEval BERTScore	`do_radeval_bertscore`	Domain-adapted radiology semantics
Clinical	F1CheXbert	`do_chexbert`	CheXpert finding classification
	F1RadBERT-CT	`do_f1radbert_ct`	CT finding classification
	F1RadGraph	`do_radgraph`	Clinical entity/relation accuracy
	RaTEScore	`do_ratescore`	Entity-level synonym-aware scoring
Specialized	RadGraph-RadCliQ	`do_radgraph_radcliq`	Per-pair entity+relation F1 (RadCliQ variant)
	RadCliQ-v1	`do_radcliq`	Composite clinical relevance
	SRR-BERT	`do_srr_bert`	Structured report evaluation
	Temporal F1	`do_temporal`	Temporal consistency
	GREEN	`do_green`	LLM-based overall quality (7B model)
	MammoGREEN	`do_mammo_green`	Mammography-specific LLM scoring
	RadFact-CT	`do_radfact_ct`	LLM-based factual precision/recall for CT
	CRIMSON	`do_crimson`	LLM-based clinical significance scoring

Enable only the metrics you need -- each one is loaded lazily.

Detailed Output

Pass do_details=True to get per-sample scores, label breakdowns, and entity annotations for every enabled metric. See docs/metrics.md for the full output schema of each metric.

Comparing Systems

Use compare_systems to run paired approximate randomization tests between any number of systems:

from RadEval import RadEval, compare_systems

evaluator = RadEval(do_bleu=True)
signatures, scores = compare_systems(
    systems={
        'baseline': baseline_reports,
        'improved': improved_reports,
    },
    metrics={'bleu': lambda hyps, refs: evaluator(refs, hyps)['bleu']},
    references=reference_reports,
    n_samples=10000,
)

See docs/hypothesis_testing.md for a full walkthrough and interpretation guide.

Documentation

Page	Contents
docs/metrics.md	What each metric measures, `do_details` output schemas
docs/configuration.md	Full parameter reference, example presets
docs/hypothesis_testing.md	Statistical background, full example, performance notes
docs/file_formats.md	Loading data from .tok, .json, and Python lists

RadEval Expert Dataset

A curated evaluation set annotated by board-certified radiologists for validating automatic metrics. Available on HuggingFace.

Citation

@inproceedings{xu-etal-2025-radeval,
    title = "{R}ad{E}val: A framework for radiology text evaluation",
    author = "Xu, Justin  and
      Zhang, Xi  and
      Abderezaei, Javid  and
      Bauml, Julie  and
      Boodoo, Roger  and
      Haghighi, Fatemeh  and
      Ganjizadeh, Ali  and
      Brattain, Eric  and
      Van Veen, Dave  and
      Meng, Zaiqiao  and
      Eyre, David W  and
      Delbrouck, Jean-Benoit",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-demos.40/",
    doi = "10.18653/v1/2025.emnlp-demos.40",
    pages = "546--557",
}

Contributors

_{Jean-Benoit Delbrouck}

_{Justin Xu}

_{Xi Zhang}

Acknowledgments

Built on the work of the radiology AI community: CheXbert, RadGraph, BERTScore, RaTEScore, SRR-BERT, GREEN, and datasets like MIMIC-CXR.

If you find RadEval useful, please give us a star!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Mar 24, 2026

0.1.8

Mar 19, 2026

0.1.7

Mar 13, 2026

0.1.6

Mar 12, 2026

0.1.5

Mar 12, 2026

This version

0.1.4

Mar 12, 2026

0.1.3

Mar 12, 2026

0.1.2

Mar 12, 2026

0.1.1

Mar 12, 2026

0.1.0

Mar 10, 2026

0.0.6

Jan 12, 2026

0.0.6rc2 pre-release

Jan 22, 2026

0.0.6rc1 pre-release

Jan 22, 2026

0.0.5

Jan 7, 2026

0.0.5rc1 pre-release

Jan 8, 2026

0.0.4

Jan 6, 2026

0.0.3

Jan 6, 2026

0.0.2

Jan 6, 2026

0.0.1rc6 pre-release

Jul 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

radeval-0.1.4.tar.gz (95.3 kB view details)

Uploaded Mar 12, 2026 Source

File details

Details for the file radeval-0.1.4.tar.gz.

File metadata

Download URL: radeval-0.1.4.tar.gz
Upload date: Mar 12, 2026
Size: 95.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for radeval-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`c388a2bf50dd47853426ade348ee3b30a56a51ce9ae66e4e522c448b0bf7f2be`
MD5	`347063fa4efe3e46cfbaca4cf4ce5c3b`
BLAKE2b-256	`da5672f35308ba662e38390a701cfc595a543bf8d3854781137eea3d8602accb`

See more details on using hashes here.

RadEval 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

RadEval

TL;DR

Installation

Supported Metrics

Detailed Output

Comparing Systems

Documentation

RadEval Expert Dataset

Citation

Contributors

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes