Official PyTorch implementation of the following paper:
Towards Scalable Language-Image Pre-training for 3D Medical Imaging
University of Michigan
![]()
Directly leveraging uncurated clinical studies enables scalable language-image pre-training in 3D medical imaging, as the scale is no longer constrained by the manual effort required from clinicians to select a single representative scan or slice from each study. This paradigm could be more effective when equipped with a hierarchical attention mechanism inspired by the natural structure of the data: slice, scan, and study. We name this framework Hierarchical attention for Language-Image Pre-training (HLIP). For real-world clinical use, HLIP can be applied to studies containing either a single scan (e.g., chest CT) or multiple scans (e.g., brain MRI).
- (2025-06) Complete the initiation of HLIP repository.
- (2025-05) Release HLIP models trained on chest CT and brain MRI, feel free to try our demos.
python3 -m venv env
source env/bin/activate
pip install -U pip
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
git clone [email protected]:mlfoundations/open_clip.git
cd open_clip
make install
make install-training| Data | Objective | Patch Size | Attention | Model | Performance |
|---|---|---|---|---|---|
| CT-RATE (20K) | SigLIP | 8, 24, 24 |
slice + scan |
ViT Base | -/- |
| CT-RATE (20K) | CLIP | 8, 24, 24 |
slice + scan |
ViT Base | -/- |
| BrainMRI (220K) | CLIP | 16, 16, 16 |
scan + study |
ViT Base | -/- |
| BrainMRI (220K) | CLIP | 8, 16, 16 |
scan + study |
ViT Base | -/- |
| BrainMRI (220K) | CLIP | 8, 16, 16 |
slice + scan + study |
ViT Base | -/- |
| HeadCT (240K) | CLIP | 8, 16, 16 |
scan + study |
ViT Base | -/- |
Chest CT: an example from the external Rad-ChestCT dataset.
python inference_rad_chestct.py \
--model clip_vit_base_singlescan_h2_token1176 \
--use-cxr-bert \
--resume /path/to/chestct_clip_vit_base_singlescan_h2_token1176.pt \
--data ../../docs/tst32751/tst32751.ptBrain MRI: an example from the external BraTS23 dataset.
python inference_pub_brain_5.py \
--model clip_vit_base_multiscan_h2_token1176 \
--resume /path/to/brainmri_clip_vit_base_multiscan_h2_token1176.pt \
--patch-size 8 16 16 \
--num-slices 72 \
--data ../../docs/BraTS-GLI-00459-000Visualizing the activation with --interpret.
CT-RATE
python zeroshot_ct_rate.py \
--model clip_vit_base_singlescan_h2_token2744 \
--use-cxr-bert \
--resume /path/to/chestct_clip_vit_base_singlescan_h2_token2744.pt \
--data-root /data/ct_rate/ \
--zeroshot-template volumeRad-ChestCT
python zeroshot_rad_chestct.py \
--model clip_vit_base_singlescan_h2_token2744 \
--use-cxr-bert \
--resume /path/to/chestct_clip_vit_base_singlescan_h2_token2744.pt \
--data-root /data/rad_chestct/ \
--zeroshot-template volumeBrain MRI
python pub_brain_5_embed.py \
--model clip_vit_base_multiscan_h2_token1176 \
--resume /path/to/brainmri_clip_vit_base_multiscan_h2_token1176.pt \
--data-root /path/to/pub_brain_5
--num-slices 144 \
--embed-root /path/to/pub_brain_5_embedpython zeroshot_pub_brain_5.py \
--model clip_vit_base_multiscan_h2_token1176 \
--resume /path/to/brainmri_clip_vit_base_multiscan_h2_token1176.pt \
--embed-root /path/to/pub_brain_5_embed \
--num-slices 144 \
--zeroshot_prompt prompt \
--zeroshot_template templateAs there are ~18K studies in the Pub-Brain-5 dataset, evaluation may take ~30 minutes. We first extract the embedding for each study, followed by zero-shot classification. This procedure supports researchers interested in prompt engineering.
--num-slices is set to 144 during evaluation, though we use a fixed input size of 48, 224, 224. We found that HLIP can directly transfer and benefit from higher-resolution inputs at test time.
Our training implementation is closely aligned with open-clip, allowing us to leverage features such as patch dropout and siglip. Below, we provide a training code demo for chest CT. Training on CT-RATE for 20 epochs takes ~6 hours using a node with 4 A40 GPUs.
torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 main.py \
--logs_dir /path/to/logs/ \
--json-root ../../data/ct_rate/files/ --data-root /path/to/data/ct_rate/ \
--train-data raw_annotation --input-info -1150 350 crop \
--zeroshot-ct-rate ../../data/ct_rate/metafiles/valid_labels.csv --zeroshot-template volume \
--zeroshot-frequency 1 \
--save-frequency 1 \
--report-to wandb \
--wandb-project-name chest_ct \
--warmup 377 \
--batch-size 16 \
--accum-batch 1 \
--lr=1e-5 \
--wd=0.2 \
--epochs=20 \
--precision amp \
--workers 4 \
--grad-checkpointing \
--model clip_vit_base_singlescan_h2_token2744 \
--use-cxr-bert \
--lock-textUse the following commands for patch dropout:
--force-patch-dropout 0.5 \
--beta2 0.95Use the following commands for siglip:
--model siglip_vit_base_singlescan_h2_token2744 \
--beta2 0.95 \
--siglipIf you find this repository helpful, please consider citing:
@article{zhao2025towards,
title={Towards Scalable Language-Image Pre-training for 3D Medical Imaging},
author={Zhao, Chenhui and Lyu, Yiwei and Chowdury, Asadur and Harake, Edward and Kondepudi, Akhil and Rao, Akshay and Hou, Xinhai and Lee, Honglak and Hollon, Todd},
journal={arXiv preprint arXiv:2505.21862},
year={2025}
}