TL;DR

pip install RadEval

from RadEval import RadEval
import json

refs = [
    "No definite acute cardiopulmonary process.Enlarged cardiac silhouette could be accentuated by patient's positioning.",
    "Increased mild pulmonary edema and left basal atelectasis.",
]
hyps = [
    "Relatively lower lung volumes with no focal airspace consolidation appreciated.",
    "No pleural effusions or pneumothoraces.",
]

evaluator = RadEval(
    do_radgraph=True,
    do_bleu=True
)

results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))

{
  "radgraph_simple": 0.5,
  "radgraph_partial": 0.5,
  "radgraph_complete": 0.5,
  "bleu": 0.5852363407461811
}

RadEval

All-in-one metrics for evaluating AI-generated radiology text

📖 Table of Contents

🌟 Overview

RadEval is a comprehensive evaluation framework specifically designed for assessing the quality of AI-generated radiology text. It provides a unified interface to multiple state-of-the-art evaluation metrics, enabling researchers and practitioners to thoroughly evaluate their radiology text generation models.

❓ Why RadEval

Tip

Domain-Specific: Tailored for radiology text evaluation with medical knowledge integration
Multi-Metric: Supports 11+ different evaluation metrics in one framework
Easy to Use: Simple API with flexible configuration options
Comprehensive: From traditional n-gram metrics to advanced LLM-based evaluations
Research-Ready: Built for reproducible evaluation in radiology AI research

✨ Key Features

Note

Multiple Evaluation Perspectives: Lexical, semantic, clinical, and temporal evaluations
Statistical Testing: Built-in hypothesis testing for system comparison
Batch Processing: Efficient evaluation of large datasets
Flexible Configuration: Enable/disable specific metrics based on your needs
Detailed Results: Comprehensive output with metric explanations
File Format Support: Direct evaluation from common file formats (.tok, .txt, .json)

⚙️ Installation

RadEval supports Python 3.10+ and can be installed via PyPI or from source.

Option 1: Install via PyPI (Recommended)

pip install RadEval

Tip

We recommend using a virtual environment to avoid dependency conflicts, especially since some metrics require loading large inference models.

Option 2: Install from GitHub (Latest Development Version)

Install the most up-to-date version directly from GitHub:

pip install git+https://github.com/jbdel/RadEval.git

This is useful if you want the latest features or bug fixes before the next PyPI release.

Option 3: Install in Development Mode (Recommended for Contributors)

# Clone the repository
git clone https://github.com/jbdel/RadEval.git
cd RadEval

# Create and activate a conda environment
conda create -n RadEval python=3.10 -y
conda activate RadEval

# Install in development (editable) mode
pip install -e .

This setup allows you to modify the source code and reflect changes immediately without reinstallation.

🚀 Quick Start

Example 1: Basic Evaluation

Evaluate a few reports using selected metrics:

from RadEval import RadEval
import json

refs = [
    "No definite acute cardiopulmonary process.Enlarged cardiac silhouette could be accentuated by patient's positioning.",
    "Increased mild pulmonary edema and left basal atelectasis.",
]
hyps = [
    "Relatively lower lung volumes with no focal airspace consolidation appreciated.",
    "No pleural effusions or pneumothoraces.",
]

evaluator = RadEval(
    do_radgraph=True,
    do_bleu=True
)

results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))

Output

{
  "radgraph_simple": 0.5,
  "radgraph_partial": 0.5,
  "radgraph_complete": 0.5,
  "bleu": 0.5852363407461811
}

Example 2: Comprehensive Evaluation

Set do_details=True to enable per-metric detailed outputs, including entity-level comparisons and score-specific breakdowns when supported.

from RadEval import RadEval
import json

evaluator = RadEval(
    do_srr_bert=True,
    do_rouge=True,
    do_details=True
)

refs = [
    "No definite acute cardiopulmonary process.Enlarged cardiac silhouette could be accentuated by patient's positioning.",
    "Increased mild pulmonary edema and left basal atelectasis.",
]
hyps = [
    "Relatively lower lung volumes with no focal airspace consolidation appreciated.",
    "No pleural effusions or pneumothoraces.",
]

results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))

Output

{
  "rouge": {
    "rouge1": {
      "mean_score": 0.04,
      "sample_scores": [
        0.08,
        0.0
      ]
    },
    "rouge2": {
      "mean_score": 0.0,
      "sample_scores": [
        0.0,
        0.0
      ]
    },
    "rougeL": {
      "mean_score": 0.04,
      "sample_scores": [
        0.08,
        0.0
      ]
    }
  },
  "srr_bert": {
    "srr_bert_weighted_f1": 0.16666666666666666,
    "srr_bert_weighted_precision": 0.125,
    "srr_bert_weighted_recall": 0.25,
    "label_scores": {
      "Edema (Present)": {
        "f1-score": 0.0,
        "precision": 0.0,
        "recall": 0.0,
        "support": 1.0
      },
      "Atelectasis (Present)": {
        "f1-score": 0.0,
        "precision": 0.0,
        "recall": 0.0,
        "support": 1.0
      },
      "Cardiomegaly (Uncertain)": {
        "f1-score": 0.0,
        "precision": 0.0,
        "recall": 0.0,
        "support": 1.0
      },
      "No Finding": {
        "f1-score": 0.6666666666666666,
        "precision": 0.5,
        "recall": 1.0,
        "support": 1.0
      }
    }
  }
}

Example 3: Quick Hypothesis Testing

Compare two systems statistically to validate improvements:

from RadEval import RadEval, compare_systems

# Define systems to compare
systems = {
    'baseline': [
        "No acute findings.",
        "Mild heart enlargement."
    ],
    'improved': [
        "No acute cardiopulmonary process.",
        "Mild cardiomegaly with clear lung fields."
    ]
}

# Reference ground truth
references = [
    "No acute cardiopulmonary process.",
    "Mild cardiomegaly with clear lung fields."
]

# Initialise evaluators only for selected metrics
bleu_evaluator = RadEval(do_bleu=True)
rouge_evaluator = RadEval(do_rouge=True)

# Wrap metrics into callable functions
metrics = {
    'bleu': lambda hyps, refs: bleu_evaluator(refs, hyps)['bleu'],
    'rouge1': lambda hyps, refs: rouge_evaluator(refs, hyps)['rouge1'],
}

# Run statistical test
signatures, scores = compare_systems(
    systems=systems,
    metrics=metrics, 
    references=references,
    n_samples=50,           # Number of bootstrap samples
    print_results=True      # Print significance table
)

Output

================================================================================
PAIRED SIGNIFICANCE TEST RESULTS
================================================================================
System                                             bleu         rouge1
----------------------------------------------------------------------
Baseline: baseline                              0.0000         0.3968   
----------------------------------------------------------------------
improved                                      1.0000         1.0000   
                                           (p=0.4800)     (p=0.4600)  
----------------------------------------------------------------------
- Significance level: 0.05
- '*' indicates significant difference (p < significance level)
- Null hypothesis: systems are essentially the same
- Significant results suggest systems are meaningfully different

METRIC SIGNATURES:
- bleu: bleu|ar:50|seed:12345
- rouge1: rouge1|ar:50|seed:12345

Example 4: File-based Evaluation

Recommended for batch evaluation of large sets of generated reports.

import json
from RadEval import RadEval

def evaluate_from_files():
    def read_reports(filepath):
        with open(filepath, 'r') as f:
            return [line.strip() for line in f if line.strip()]
    
    refs = read_reports('ground_truth.tok')
    hyps = read_reports('model_predictions.tok')
    
    evaluator = RadEval(
        do_radgraph=True,
        do_bleu=True,
        do_bertscore=True,
        do_chexbert=True
    )
    
    results = evaluator(refs=refs, hyps=hyps)
    
    with open('evaluation_results.json', 'w') as f:
        json.dump(results, f, indent=2)

    return results

📊 Evaluation Metrics

RadEval currently supports the following evaluation metrics:

Category	Metric	Description	Best For
Lexical	BLEU	N-gram overlap measurement	Surface-level similarity
	ROUGE	Recall-oriented evaluation	Content coverage
Semantic	BERTScore	BERT-based semantic similarity	Semantic meaning preservation
	RadEval BERTScore	Domain-adapted ModernBertModel evaluation	Medical text semantics
Clinical	CheXbert	Clinical finding classification	Medical accuracy
	RadGraph	Knowledge graph-based evaluation	Clinical relationship accuracy
	RaTEScore	Entity-level assessments	Medical synonyms
Specialized	RadCLIQ	Composite multiple metrics	Clinical relevance
	SRR-BERT	Structured report evaluation	Report structure quality
	Temporal F1	Time-sensitive evaluation	Temporal consistency
	GREEN	LLM-based metric	Overall radiology report quality

🔧 Configuration Options

RadEval Constructor Parameters

Parameter	Type	Default	Description
`do_radgraph`	bool	False	Enable RadGraph evaluation
`do_green`	bool	False	Enable GREEN metric
`do_bleu`	bool	False	Enable BLEU evaluation
`do_rouge`	bool	False	Enable ROUGE metrics
`do_bertscore`	bool	False	Enable BERTScore
`do_srr_bert`	bool	False	Enable SRR-BERT
`do_chexbert`	bool	False	Enable CheXbert classification
`do_temporal`	bool	False	Enable temporal evaluation
`do_ratescore`	bool	False	Enable RateScore
`do_radcliq`	bool	False	Enable RadCLIQ
`do_radeval_bertsore`	bool	False	Enable RadEval BERTScore
`do_details`	bool	False	Include detailed metrics

Example Configurations

# Lightweight evaluation (fast)
light_evaluator = RadEval(
    do_bleu=True,
    do_rouge=True
)

# Medical focus (clinical accuracy)
medical_evaluator = RadEval(
    do_radgraph=True,
    do_chexbert=True,
    do_green=True
)

# Comprehensive evaluation (all metrics)
full_evaluator = RadEval(
    do_radgraph=True,
    do_green=True,
    do_bleu=True,
    do_rouge=True,
    do_bertscore=True,
    do_srr_bert=True,
    do_chexbert=True,
    do_temporal=True,
    do_ratescore=True,
    do_radcliq=True,
    do_radeval_bertsore=True,
    do_details=False           # Optional: return detailed metric breakdowns
)

📁 File Format Suggestion

To ensure efficient evaluation, we recommend formatting your data in one of the following ways:

📄 Text Files (.tok, .txt)

Each line contains one report

No acute cardiopulmonary process.
Mild cardiomegaly noted.
Normal chest radiograph.

Use two separate files:

ground_truth.tok — reference reports

model_predictions.tok — generated reports

🧾 JSON Files

{
  "references": [
    "No acute cardiopulmonary process.",
    "Mild cardiomegaly noted."
  ],
  "hypotheses": [
    "Normal chest X-ray.",
    "Enlarged heart observed."
  ]
}

🐍 Python Lists

refs = ["Report 1", "Report 2"]
hyps = ["Generated 1", "Generated 2"]

Tip

File-based input is recommended for batch evaluation and reproducibility in research workflows.

🧪 Hypothesis Testing (Significance Evaluation)

RadEval supports paired significance testing to statistically compare different radiology report generation systems using Approximate Randomization (AR).

This allows you to determine whether an observed improvement in metric scores is statistically significant, rather than due to chance.

📌 Key Features

Paired comparison of any number of systems against a baseline
Statistical rigor using Approximate Randomization (AR) testing
All built-in metrics supported (BLEU, ROUGE, BERTScore, RadGraph, CheXbert, etc.)
Custom metrics integration for domain-specific evaluation
P-values and significance markers (*) for easy interpretation

🧮 Statistical Background

The hypothesis testing uses Approximate Randomization to determine if observed metric differences are statistically significant:

Null Hypothesis (H₀): The two systems perform equally well
Test Statistic: Difference in metric scores between systems
Randomization: Shuffle system assignments and recalculate differences
P-value: Proportion of random shuffles with differences ≥ observed
Significance: If p < 0.05, reject H₀ (systems are significantly different)

Note

Why AR testing? Unlike parametric tests, AR makes no assumptions about score distributions, making it ideal for evaluation metrics that may not follow normal distributions.

👀 Understanding the Results

Interpreting P-values:

p < 0.05: Statistically significant difference (marked with *)
p ≥ 0.05: No significant evidence of difference
Lower p-values: Stronger evidence of real differences

Practical Significance:

Look for consistent improvements across multiple metrics
Consider domain relevance (e.g., RadGraph for clinical accuracy)
Balance statistical and clinical significance

🖇️ Example: Compare RadEval Default Metrics and a Custom Metric

Step 1: Initialize packages and dataset

from RadEval import RadEval, compare_systems

# Reference ground truth reports
references = [
    "No acute cardiopulmonary process.",
    "No radiographic findings to suggest pneumonia.",
    "Mild cardiomegaly with clear lung fields.",
    "Small pleural effusion on the right side.",
    "Status post cardiac surgery with stable appearance.",
]
# Three systems: baseline, improved, and poor
systems = {
    'baseline': [
        "No acute findings.",
        "No pneumonia.",
        "Mild cardiomegaly, clear lungs.",
        "Small right pleural effusion.",
        "Post-cardiac surgery, stable."
    ],
    'improved': [
        "No acute cardiopulmonary process.",
        "No radiographic findings suggesting pneumonia.",
        "Mild cardiomegaly with clear lung fields bilaterally.",
        "Small pleural effusion present on the right side.",
        "Status post cardiac surgery with stable appearance."
    ],
    'poor': [
        "Normal.",
        "OK.",
        "Heart big.",
        "Some fluid.",
        "Surgery done."
    ]
}

Step 2: Define Evaluation Metrics and Parameters

We define each evaluation metric using a dedicated RadEval instance (configured to compute one specific score), and also include a simple custom metric — average word count. All metrics are wrapped into a unified metrics dictionary for flexible evaluation and comparison.

# Initialise each evaluator with the corresponding metric
bleu_evaluator = RadEval(do_bleu=True)
rouge_evaluator = RadEval(do_rouge=True)
bertscore_evaluator = RadEval(do_bertscore=True)
radgraph_evaluator = RadEval(do_radgraph=True)
chexbert_evaluator = RadEval(do_chexbert=True)

# Define a custom metric: average word count of generated reports
def word_count_metric(hyps, refs):
    return sum(len(report.split()) for report in hyps) / len(hyps)

# Wrap metrics into a unified dictionary of callables
metrics = {
    'bleu': lambda hyps, refs: bleu_evaluator(refs, hyps)['bleu'],
    'rouge1': lambda hyps, refs: rouge_evaluator(refs, hyps)['rouge1'],
    'rouge2': lambda hyps, refs: rouge_evaluator(refs, hyps)['rouge2'],
    'rougeL': lambda hyps, refs: rouge_evaluator(refs, hyps)['rougeL'],
    'bertscore': lambda hyps, refs: bertscore_evaluator(refs, hyps)['bertscore'],
    'radgraph': lambda hyps, refs: radgraph_evaluator(refs, hyps)['radgraph_partial'],
    'chexbert': lambda hyps, refs: chexbert_evaluator(refs, hyps)['chexbert-5_macro avg_f1-score'],
    'word_count': word_count_metric  # ← example of a simple custom-defined metric
}

Tip

Each metric function takes (hyps, refs) as input and returns a single float score.
This modular design allows you to flexibly plug in or remove metrics without changing the core logic of RadEval or compare_systems.
For advanced, you may define your own RadEval(do_xxx=True) variant or custom metrics and include them seamlessly here.

Step 3 Run significance testing

Use compare_systems to evaluate all defined systems against the reference reports using the metrics specified above. This step performs randomization-based significance testing to assess whether differences between systems are statistically meaningful.

print("Running significance tests...")

signatures, scores = compare_systems(
    systems=systems,
    metrics=metrics,
    references=references,
    n_samples=50,                    # Number of randomization samples
    significance_level=0.05,         # Alpha level for significance testing
    print_results=True              # Print formatted results table
)

Output

Running tests...
================================================================================
PAIRED SIGNIFICANCE TEST RESULTS
================================================================================
System                                             bleu         rouge1         rouge2         rougeL      bertscore       radgraph       chexbert     word_count
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Baseline: baseline                              0.0000         0.6652         0.3133         0.6288         0.6881         0.5538         1.0000         3.2000   
----------------------------------------------------------------------------------------------------------------------------------------------------------------
improved                                      0.6874         0.9531         0.8690         0.9531         0.9642         0.9818         1.0000         6.2000   
                                           (p=0.0000)*    (p=0.0800)     (p=0.1200)     (p=0.0600)     (p=0.0400)*    (p=0.1200)     (p=1.0000)     (p=0.0600)  
----------------------------------------------------------------------------------------------------------------------------------------------------------------
poor                                          0.0000         0.0444         0.0000         0.0444         0.1276         0.0000         0.8000         1.6000   
                                           (p=0.4000)     (p=0.0400)*    (p=0.0600)     (p=0.1200)     (p=0.0400)*    (p=0.0200)*    (p=1.0000)     (p=0.0400)* 
----------------------------------------------------------------------------------------------------------------------------------------------------------------
- Significance level: 0.05
- '*' indicates significant difference (p < significance level)
- Null hypothesis: systems are essentially the same
- Significant results suggest systems are meaningfully different

METRIC SIGNATURES:
- bleu: bleu|ar:50|seed:12345
- rouge1: rouge1|ar:50|seed:12345
- rouge2: rouge2|ar:50|seed:12345
- rougeL: rougeL|ar:50|seed:12345
- bertscore: bertscore|ar:50|seed:12345
- radgraph: radgraph|ar:50|seed:12345
- chexbert: chexbert|ar:50|seed:12345
- word_count: word_count|ar:50|seed:12345

Tip

The output includes mean scores for each metric and system, along with p-values comparing each system to the baseline.
Statistically significant improvements (or declines) are marked with an asterisk * if p < 0.05.
signatures stores each metric configuration (e.g. random seed, sample size), and scores contains raw score values per system for further analysis or plotting.

Step 4: Summarise Significant Findings

# Significance testing
print("\nSignificant differences (p < 0.05):")
baseline_name = list(systems.keys())[0] # Assume first one is the baseline

for system_name in systems.keys():
    if system_name == baseline_name:
        continue
        
    significant_metrics = []
    for metric_name in metrics.keys():
        pvalue_key = f"{metric_name}_pvalue"
        if pvalue_key in scores[system_name]:
            p_val = scores[system_name][pvalue_key]
            if p_val < 0.05:
                significant_metrics.append(metric_name)
    
    if significant_metrics:
        print(f"  {system_name} vs {baseline_name}: {', '.join(significant_metrics)}")
    else:
        print(f"  {system_name} vs {baseline_name}: No significant differences")

Output

Significant differences (p < 0.05):
  improved vs baseline: bleu, bertscore
  poor vs baseline: rouge1, bertscore, radgraph, word_count

Tip

This makes it easy to:

Verify whether model improvements are meaningful
Test new metrics or design your own
Report statistically sound results in your paper

🧠 RadEval Expert Dataset

To support reliable benchmarking, we introduce the RadEval Expert Dataset, a carefully curated evaluation set annotated by board-certified radiologists. This dataset consists of realistic radiology reports and challenging model generations, enabling nuanced evaluation across clinical accuracy, temporal consistency, and language quality. It serves as a gold standard to validate automatic metrics and model performance under expert review.

🚦 Performance Tips

Start Small: Test with a few examples before full evaluation
Select Metrics: Only enable metrics you actually need
Batch Processing: Process large datasets in smaller chunks
GPU Usage: Ensure CUDA is available for faster computation

📚 Citation

If you use RadEval in your research, please cite:

@misc{xu2025radevalframeworkradiologytext,
      title={RadEval: A framework for radiology text evaluation}, 
      author={Justin Xu and Xi Zhang and Javid Abderezaei and Julie Bauml and Roger Boodoo and Fatemeh Haghighi and Ali Ganjizadeh and Eric Brattain and Dave Van Veen and Zaiqiao Meng and David Eyre and Jean-Benoit Delbrouck},
      year={2025},
      eprint={2509.18030},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.18030}, 
}

📦 Codebase Contributors

_{Jean-Benoit Delbrouck}

_{Justin Xu}

_{Xi Zhang}

🙏 Acknowledgments

This project would not be possible without the foundational work of the radiology AI community.
We extend our gratitude to the authors and maintainers of the following open-source projects and metrics:

🧠 CheXbert, RadGraph, and CheXpert from Stanford AIMI for their powerful labelers and benchmarks.
📐 BERTScore and BLEU/ROUGE for general-purpose NLP evaluation.
🏥 RadCliQ and RaTE Score for clinically grounded evaluation of radiology reports.
🧪 SRR-BERT for structured report understanding in radiology.
🔍 Researchers contributing to temporal and factual consistency metrics in medical imaging.

Special thanks to:

All contributors to open datasets such as MIMIC-CXR, which make reproducible research possible.
Our collaborators for their support and inspiration throughout development.

We aim to build on these contributions and promote accessible, fair, and robust evaluation of AI-generated radiology text.

⭐ If you find RadEval useful, please give us a star! ⭐

Made with ❤️ for the radiology AI research community

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
RadEval		RadEval
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RadEval_banner.png		RadEval_banner.png
demo.mp4		demo.mp4
setup.py		setup.py

License

jbdel/RadEval

Folders and files

Latest commit

History

Repository files navigation

TL;DR

RadEval

📖 Table of Contents

🌟 Overview

❓ Why RadEval

✨ Key Features

⚙️ Installation

Option 1: Install via PyPI (Recommended)

Option 2: Install from GitHub (Latest Development Version)

Option 3: Install in Development Mode (Recommended for Contributors)

🚀 Quick Start

Example 1: Basic Evaluation

Example 2: Comprehensive Evaluation

Example 3: Quick Hypothesis Testing

Example 4: File-based Evaluation

📊 Evaluation Metrics

🔧 Configuration Options

RadEval Constructor Parameters

Example Configurations

📁 File Format Suggestion

📄 Text Files (.tok, .txt)

🧾 JSON Files

🐍 Python Lists

🧪 Hypothesis Testing (Significance Evaluation)

📌 Key Features

🧮 Statistical Background

👀 Understanding the Results

🖇️ Example: Compare RadEval Default Metrics and a Custom Metric

Step 1: Initialize packages and dataset

Step 2: Define Evaluation Metrics and Parameters

Step 3 Run significance testing

Step 4: Summarise Significant Findings

🧠 RadEval Expert Dataset

🚦 Performance Tips

📚 Citation

📦 Codebase Contributors

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages