Skip to main content

A comprehensive evaluation toolkit for medical report generation

Project description

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

A Python package for structured chest X-ray report evaluation using LunguageScore โ€” supporting single- and sequential-report assessment

๐Ÿ“ฐ News

  • [2024/05] Lunguage benchmark paper posted on arXiv.
  • [2024/11] Lunguage++ extended version released: paper, code.
  • [2024/11] Lunguage dataset (v1.0.0) released on PhysioNet.
  • [2025/04] Lunguage dataset (v1.1.0) and Lunguage++ (v1.0.0) datasets are under review and will be made available on PhysioNet.

๐Ÿ“š Introduction

LunguageScore evaluates chest X-ray report generation models using the Lunguage benchmark. The pipeline has two stages:

Stage What it does Requires
Structuring Converts raw reports into SRO (Subject-Relation-Object) structures via LLM LLM API key or local vLLM server
Metric Computes LunguageScore (Precision / Recall / Structure Score) Two structured CSVs: predicted + ground truth

Structuring supports Single SR (single-visit) and Sequential SR (multi-visit temporal grouping).

Metric computation requires structured reports for both predicted and ground-truth. Run structuring first if you don't have them.

๐Ÿ“ Dataset

The Lunguage dataset (v1.0.0) is hosted on PhysioNet and is available to credentialed MIMIC-CXR users under the PhysioNet Data Use Agreement.

Access: https://physionet.org/content/lunguage/1.0.0/ (requires PhysioNet credentialed account with MIMIC-CXR access)

Place the dataset files in your workspace:

/your/workspace/
โ”œโ”€โ”€ dataset/
โ”‚   โ”œโ”€โ”€ Lunguage.csv            # Gold structured annotations (required)
โ”‚   โ””โ”€โ”€ Lunguage_vocab.csv      # Entity vocabulary for retrieval (required)
โ”œโ”€โ”€ benchmark/
โ”‚   โ””โ”€โ”€ my_model_results.csv    # Your model's predicted reports (see format below)
โ”œโ”€โ”€ singleSR/                   # Created automatically during structuring
โ””โ”€โ”€ sequentialSR/               # Created automatically during sequential structuring

Benchmark CSV format (for evaluating your own model):

Column Description
subject_id Patient identifier (must match Lunguage.csv)
study_id Study identifier (must match Lunguage.csv)
report Full generated report text

Installation

Package publication is in progress. For now, use source installation.

git clone https://github.com/supersupermoon/Lunguage.git
cd Lunguage/lunguage_score_package
pip install -e .

When PyPI publication is complete, pip install lunguage-score will be supported as the default installation path.

Metric stage works immediately after install โ€” no LLM needed. Structuring stage requires an LLM API key or a running vLLM server. On first use, the semantic model (FremyCompany/BioLORD-2023, ~500 MB) downloads automatically.


Quick Start: run_eval_pipeline.py

run_eval_pipeline.py is the recommended way to run the full pipeline from the command line.

Current recommendation (pre-release): run directly from this repository after source installation.

--mode โ€” input report source

--mode Use when --pred-path needed?
gold_eval Evaluating against the Lunguage gold standard No โ€” uses Lunguage.csv directly
chexagent Your own model's outputs (custom CSV) Yes
rexval / maira / maira_cascade / medversa / rgrg / medgemma / lingshu Specific benchmarks Yes

--stage โ€” pipeline stages

Stage Description
full-single Single SR structuring โ†’ LunguageScore metric
full-sequential Single SR โ†’ Sequential SR โ†’ LunguageScore metric (default)
structure-single Single SR structuring only
structure-sequential Single SR โ†’ Sequential SR (no metric)
metric-only Metric only (structured CSVs must already exist)

Supported LLM backends

Backend --model prefix --api-key
Local vLLM any local_LLM (default) โ€” requires --port and running server
OpenAI gpt- OpenAI key
Anthropic Claude claude- Anthropic key
Google MedGemma medgemma- Google AI Studio key, or local_LLM
Fireworks AI deepseek-, llama4-, qwen3- Fireworks key

Cloud APIs (OpenAI, Claude) use batch processing โ€” no local GPU required.

Examples

# Evaluate your model's reports using Claude (no GPU needed)
python run_eval_pipeline.py --stage full-single \
    --work-dir /your/workspace \
    --model claude-sonnet-4-6 --api-key sk-ant-... \
    --pred-path benchmark/my_model_results.csv \
    --mode chexagent

# Evaluate on Lunguage gold standard using local MedGemma
python run_eval_pipeline.py --stage full-single \
    --work-dir /your/workspace \
    --model medgemma-27b-text-it --api-key local_LLM \
    --port 8100 --gpu-ids 0,1 \
    --mode gold_eval

# Metric only โ€” structured CSVs already exist
python run_eval_pipeline.py --stage metric-only \
    --work-dir /your/workspace \
    --pred-sr-path ./singleSR/eval/gold_eval/.../pred_SR_df.csv \
    --gt-sr-path   ./singleSR/eval/gold_eval/.../gold_SR_df.csv

# Full sequential pipeline
python run_eval_pipeline.py --stage full-sequential \
    --work-dir /your/workspace \
    --model medgemma-27b-text-it --api-key local_LLM \
    --port 8100 --gpu-ids 0,1 \
    --mode gold_eval --all-eval

If you are validating setup before full data preparation, start with --stage metric-only and explicit --pred-sr-path / --gt-sr-path.

Starting a local vLLM server

CUDA_VISIBLE_DEVICES=0,1 vllm serve google/medgemma-27b-text-it \
    --served-model-name medgemma-27b-text-it \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 16384 \
    --port 8100

For reasoning models (Qwen3, DeepSeek) use --max-model-len 32768.

Output

{work_dir}/
โ”œโ”€โ”€ singleSR/eval/{mode}/{n}_{cand}_{model}/...
โ”‚   โ”œโ”€โ”€ pred_SR_df.csv      # Stage 1 output (structured predictions)
โ”‚   โ””โ”€โ”€ gold_SR_df.csv      # Matched gold annotations
โ”œโ”€โ”€ sequentialSR/results/{model}/.../
โ”‚   โ””โ”€โ”€ final_processed_*.csv   # Stage 2 output (temporal grouping)
โ””โ”€โ”€ results/
    โ””โ”€โ”€ lunguage_score_results.json   # Final scores

Stage 1: Single SR โ€” Report Structuring (Python API)
from lunguage_score import LunguageScorer
from lunguage_score.config import Config, StructuringConfig, SingleSRConfig

single_cfg = SingleSRConfig(
    deployment_name="claude-sonnet-4-6",  # or "gpt-4.1", "medgemma-27b-text-it"
    api_key="sk-ant-...",                  # or "local_LLM" for vLLM
    port=8100,
    work_dir="/your/workspace",
    mode="chexagent",                      # "gold_eval" for Lunguage gold set
    candidate_type="vocab_ent_rcg",
    unit="section",
    n_retrieval=5,
    output_format="SROSRO",
    candidate_usage=1.0,
    run_model=True,
    output_dir="./singleSR/data",
)

config = Config(structuring=StructuringConfig(mode="single", single=single_cfg))
scorer = LunguageScorer(config)
output_dir = scorer.structure_only(reports="./benchmark/my_model_results.csv")
# Output: {work_dir}/singleSR/eval/{mode}/.../pred_SR_df.csv

Supported modes:

mode Input source
gold_eval Lunguage.csv directly (no custom CSV needed)
chexagent Your model's CSV (subject_id, study_id, report)
rexval / maira / maira_cascade / medversa / rgrg / cvt2distilgpt2 / libra Benchmark-specific formats

Stage 2: Sequential SR โ€” Temporal Grouping (Python API)
import pandas as pd
from lunguage_score import LunguageScorer
from lunguage_score.config import Config, StructuringConfig, SequentialSRConfig

pred_sr_path = "/your/workspace/singleSR/eval/gold_eval/.../pred_SR_df.csv"

seq_cfg = SequentialSRConfig(
    LLM_name="claude-sonnet-4-6",   # or "gpt-4.1", "medgemma-27b-text-it"
    api_key="sk-ant-...",            # or "local_LLM"
    port=8100,
    work_dir="/your/workspace",
    input_path=pred_sr_path,
    output_path="./sequentialSR/results",
    batch_path="./sequentialSR/batch_files",
    all_eval=True,
    few_shot=False,   # zero-shot recommended for models with โ‰ค 8192 token limit
    model_run=True,
    process_missing=True,
)

config = Config(structuring=StructuringConfig(mode="sequential", sequential=seq_cfg))
scorer = LunguageScorer(config)
output_dir = scorer.structure_only(reports=pd.read_csv(pred_sr_path), report_column="entity")
# Output: {work_dir}/sequentialSR/results/{model}/.../final_processed_*.csv

Stage 3: LunguageScore Metric (Python API)

No LLM required.

import pandas as pd
from lunguage_score import LunguageScorer
from lunguage_score.config import Config, MetricConfig

pred_df = pd.read_csv("/path/to/pred_SR_df.csv")
gt_df   = pd.read_csv("/path/to/gold_SR_df.csv")

# Lunguage.csv uses 'ent'; structured output uses 'entity'
if 'ent' in gt_df.columns and 'entity' not in gt_df.columns:
    gt_df = gt_df.rename(columns={'ent': 'entity'})

config = Config(metrics=MetricConfig(output_dir="./results", mode="gold_eval"))
scorer = LunguageScorer(config)
results = scorer.calculate_lunguage_score_only(pred_df, gt_df)

print(f"Structure Score: {results['avg_structure_score']:.4f}")
print(f"Precision:       {results['avg_precision']:.4f}")
print(f"Recall:          {results['avg_recall']:.4f}")

MetricConfig.mode must match the structuring mode used in Stage 1:

mode Use when
gold_eval Stage 1 mode="gold_eval"
chexagent / medversa / rgrg / rexval Matching Stage 1 mode
single_maira Stage 1 mode="maira" or "maira_cascade"
sequential_maira Stage 2 sequential output

Output JSON:

{
  "lunguage_score": {
    "avg_structure_score": 0.943,
    "avg_precision": 0.943,
    "avg_recall": 0.943,
    "structure_scores": {"s10000032": 0.95, ...},
    "precision_scores": {"s10000032": 0.95, ...},
    "recall_scores":    {"s10000032": 0.95, ...}
  }
}

SOTA Metrics (GREEN, RadGraph, BLEU, BERTScore, FineRadScore)
from lunguage_score.metric.run_all_metrics import run_metrics

run_metrics(
    input_path='./results/paired_reports.csv',
    output_path='./results/sota_scores.csv',
    metrics=['green', 'radgraph', 'bleu', 'bertscore'],
    output_dir='./results/metric_artifacts',
)

Input CSV format:

study_id,report_ref,report_cand
s10000032,"The chest X-ray shows...","The radiograph demonstrates..."

Optional feature: pip install "lunguage-score[sota]" for RaTEScore, BLEU, BERTScore. Some environments may still require additional dependency/version adjustments.

Metric Notes
GREEN Install separately: pip install git+https://github.com/Stanford-AIMI/GREEN.git
RadGraph F1 Included by default
BLEU / BERTScore Install with pip install "lunguage-score[sota]"
RaTEScore Install with pip install "lunguage-score[sota]"; may be unavailable on some numpy versions
FineRadScore Requires OpenAI API key

lunguage-score CLI

lunguage-score evaluate predicted_SR_df.csv gold_SR_df.csv
lunguage-score evaluate predicted_SR_df.csv gold_SR_df.csv --output-dir ./my_results

Citation

@article{moon2025lunguage,
  title={Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation},
  author={Moon, Jonghak and others},
  journal={arXiv preprint arXiv:2505.21190},
  year={2025}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lunguage_score-1.0.0.tar.gz (156.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lunguage_score-1.0.0-py3-none-any.whl (161.7 kB view details)

Uploaded Python 3

File details

Details for the file lunguage_score-1.0.0.tar.gz.

File metadata

  • Download URL: lunguage_score-1.0.0.tar.gz
  • Upload date:
  • Size: 156.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for lunguage_score-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e96e41dcc954301a13a5579e7e8647eb7d724fb600dec0a40d6b29cc16953bb4
MD5 505f523bd7fa3df4e594a0ef1bc9bc1a
BLAKE2b-256 8578dd9b1508f20f3e558cd8390629afd30e04a93553d8ee9aaf2c1fa99b5c79

See more details on using hashes here.

File details

Details for the file lunguage_score-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: lunguage_score-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 161.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for lunguage_score-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c1b9c4e9aaf87894027962ed9cf9ae9f1bbaafff5e91d1a10a1dd45a4a1aa18b
MD5 9c0000f68ee303648935b100360623e9
BLAKE2b-256 f9b56514b537890c97bf3384b4589e8f1f4898847d64061bdf04f03a3808de33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page