Skip to main content

All-in-one metrics for evaluating AI-generated radiology text

Project description

RadEval

All-in-one metrics for evaluating AI-generated radiology text

PyPI Python version Expert Dataset Model Video Gradio Demo EMNLP License

TL;DR

pip install -e .
from RadEval import RadEval
import json

refs = [
    "Mild cardiomegaly with small bilateral pleural effusions and basilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]
hyps = [
    "Mildly enlarged cardiac silhouette with small pleural effusions and dependent bibasilar atelectasis.",
    "No pleural effusions or pneumothoraces.",
]

evaluator = RadEval(
    do_radgraph=True,
    do_bleu=True
)

results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))
{
  "radgraph_simple": 0.72,
  "radgraph_partial": 0.61,
  "radgraph_complete": 0.61,
  "bleu": 0.36
}

Installation

pip install RadEval              # from PyPI
pip install RadEval[api]         # include OpenAI/Gemini for MammoGREEN

Or install from source:

git clone https://github.com/jbdel/RadEval.git && cd RadEval
conda create -n radeval python=3.11 -y && conda activate radeval
pip install -e '.[api]'

Supported Metrics

Category Metric Flag Best For
Lexical BLEU do_bleu Surface-level n-gram overlap
ROUGE do_rouge Content coverage
Semantic BERTScore do_bertscore Semantic similarity
RadEval BERTScore do_radeval_bertscore Domain-adapted radiology semantics
Clinical F1CheXbert do_chexbert CheXpert finding classification
F1RadBERT-CT do_f1radbert_ct CT finding classification
F1RadGraph do_radgraph Clinical entity/relation accuracy
RaTEScore do_ratescore Entity-level synonym-aware scoring
Specialized RadGraph-RadCliQ do_radgraph_radcliq Per-pair entity+relation F1 (RadCliQ variant)
RadCliQ-v1 do_radcliq Composite clinical relevance
SRR-BERT do_srr_bert Structured report evaluation
Temporal F1 do_temporal Temporal consistency
GREEN do_green LLM-based overall quality (7B model)
MammoGREEN do_mammo_green Mammography-specific LLM scoring
RadFact-CT do_radfact_ct LLM-based factual precision/recall for CT
CRIMSON do_crimson LLM-based clinical significance scoring

Enable only the metrics you need -- each one is loaded lazily.

Detailed Output

Pass do_details=True to get per-sample scores, label breakdowns, and entity annotations for every enabled metric. See docs/metrics.md for the full output schema of each metric.

Comparing Systems

Use compare_systems to run paired approximate randomization tests between any number of systems:

from RadEval import RadEval, compare_systems

evaluator = RadEval(do_bleu=True)
signatures, scores = compare_systems(
    systems={
        'baseline': baseline_reports,
        'improved': improved_reports,
    },
    metrics={'bleu': lambda hyps, refs: evaluator(refs, hyps)['bleu']},
    references=reference_reports,
    n_samples=10000,
)

See docs/hypothesis_testing.md for a full walkthrough and interpretation guide.

Documentation

Page Contents
docs/metrics.md What each metric measures, do_details output schemas
docs/configuration.md Full parameter reference, example presets
docs/hypothesis_testing.md Statistical background, full example, performance notes
docs/file_formats.md Loading data from .tok, .json, and Python lists

RadEval Expert Dataset

A curated evaluation set annotated by board-certified radiologists for validating automatic metrics. Available on HuggingFace.

Citation

@inproceedings{xu-etal-2025-radeval,
    title = "{R}ad{E}val: A framework for radiology text evaluation",
    author = "Xu, Justin  and
      Zhang, Xi  and
      Abderezaei, Javid  and
      Bauml, Julie  and
      Boodoo, Roger  and
      Haghighi, Fatemeh  and
      Ganjizadeh, Ali  and
      Brattain, Eric  and
      Van Veen, Dave  and
      Meng, Zaiqiao  and
      Eyre, David W  and
      Delbrouck, Jean-Benoit",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-demos.40/",
    doi = "10.18653/v1/2025.emnlp-demos.40",
    pages = "546--557",
}

Contributors

Jean-Benoit Delbrouck
Jean-Benoit Delbrouck
Justin Xu
Justin Xu
Xi Zhang
Xi Zhang

Acknowledgments

Built on the work of the radiology AI community: CheXbert, RadGraph, BERTScore, RaTEScore, SRR-BERT, GREEN, and datasets like MIMIC-CXR.


If you find RadEval useful, please give us a star!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

radeval-0.1.4.tar.gz (95.3 kB view details)

Uploaded Source

File details

Details for the file radeval-0.1.4.tar.gz.

File metadata

  • Download URL: radeval-0.1.4.tar.gz
  • Upload date:
  • Size: 95.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for radeval-0.1.4.tar.gz
Algorithm Hash digest
SHA256 c388a2bf50dd47853426ade348ee3b30a56a51ce9ae66e4e522c448b0bf7f2be
MD5 347063fa4efe3e46cfbaca4cf4ce5c3b
BLAKE2b-256 da5672f35308ba662e38390a701cfc595a543bf8d3854781137eea3d8602accb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page