Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Zch0414/hlip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HLIP

Official PyTorch implementation of the following paper:
Towards Scalable Language-Image Pre-training for 3D Medical Imaging
University of Michigan
arXiv 

Overview

HLIP overview

Directly leveraging uncurated clinical studies enables scalable language-image pre-training in 3D medical imaging, as the scale is no longer constrained by the manual effort required from clinicians to select a single representative scan or slice from each study. This paradigm could be more effective when equipped with a hierarchical attention mechanism inspired by the natural structure of the data: slice, scan, and study. We name this framework Hierarchical attention for Language-Image Pre-training (HLIP). For real-world clinical use, HLIP can be applied to studies containing either a single scan (e.g., chest CT) or multiple scans (e.g., brain MRI).

Updates

  • (2025-06) Complete the initiation of HLIP repository.
  • (2025-05) Release HLIP models trained on chest CT and brain MRI, feel free to try our demos.

Getting Started

Install

open-clip

python3 -m venv env
source env/bin/activate
pip install -U pip
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
git clone [email protected]:mlfoundations/open_clip.git
cd open_clip
make install
make install-training

Model Card

Data Objective Patch Size Attention Model Performance
CT-RATE (20K) SigLIP 8, 24, 24 slice + scan ViT Base -/-
CT-RATE (20K) CLIP 8, 24, 24 slice + scan ViT Base -/-
BrainMRI (220K) CLIP 16, 16, 16 scan + study ViT Base -/-
BrainMRI (220K) CLIP 8, 16, 16 scan + study ViT Base -/-
BrainMRI (220K) CLIP 8, 16, 16 slice + scan + study ViT Base -/-
HeadCT (240K) CLIP 8, 16, 16 scan + study ViT Base -/-

Demo

Chest CT: an example from the external Rad-ChestCT dataset.

python inference_rad_chestct.py \
  --model clip_vit_base_singlescan_h2_token1176 \
  --use-cxr-bert \
  --resume /path/to/chestct_clip_vit_base_singlescan_h2_token1176.pt \
  --data ../../docs/tst32751/tst32751.pt

Brain MRI: an example from the external BraTS23 dataset.

python inference_pub_brain_5.py \
  --model clip_vit_base_multiscan_h2_token1176 \
  --resume /path/to/brainmri_clip_vit_base_multiscan_h2_token1176.pt \
  --patch-size 8 16 16 \
  --num-slices 72 \
  --data ../../docs/BraTS-GLI-00459-000

Visualizing the activation with --interpret.

Evaluation

CT-RATE

python zeroshot_ct_rate.py \
  --model clip_vit_base_singlescan_h2_token2744 \
  --use-cxr-bert \
  --resume /path/to/chestct_clip_vit_base_singlescan_h2_token2744.pt \
  --data-root /data/ct_rate/ \
  --zeroshot-template volume

Rad-ChestCT

python zeroshot_rad_chestct.py \
  --model clip_vit_base_singlescan_h2_token2744 \
  --use-cxr-bert \
  --resume /path/to/chestct_clip_vit_base_singlescan_h2_token2744.pt \
  --data-root /data/rad_chestct/ \
  --zeroshot-template volume

Brain MRI

python pub_brain_5_embed.py \
  --model clip_vit_base_multiscan_h2_token1176 \
  --resume /path/to/brainmri_clip_vit_base_multiscan_h2_token1176.pt \
  --data-root /path/to/pub_brain_5
  --num-slices 144 \
  --embed-root /path/to/pub_brain_5_embed
python zeroshot_pub_brain_5.py \
  --model clip_vit_base_multiscan_h2_token1176 \
  --resume /path/to/brainmri_clip_vit_base_multiscan_h2_token1176.pt \
  --embed-root /path/to/pub_brain_5_embed \
  --num-slices 144 \
  --zeroshot_prompt prompt \
  --zeroshot_template template

As there are ~18K studies in the Pub-Brain-5 dataset, evaluation may take ~30 minutes. We first extract the embedding for each study, followed by zero-shot classification. This procedure supports researchers interested in prompt engineering.

--num-slices is set to 144 during evaluation, though we use a fixed input size of 48, 224, 224. We found that HLIP can directly transfer and benefit from higher-resolution inputs at test time.

Training

Our training implementation is closely aligned with open-clip, allowing us to leverage features such as patch dropout and siglip. Below, we provide a training code demo for chest CT. Training on CT-RATE for 20 epochs takes ~6 hours using a node with 4 A40 GPUs.

torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 main.py \
  --logs_dir /path/to/logs/ \
  --json-root ../../data/ct_rate/files/ --data-root /path/to/data/ct_rate/ \
  --train-data raw_annotation --input-info -1150 350 crop \
  --zeroshot-ct-rate ../../data/ct_rate/metafiles/valid_labels.csv --zeroshot-template volume \
  --zeroshot-frequency 1 \
  --save-frequency 1 \
  --report-to wandb \
  --wandb-project-name chest_ct \
  --warmup 377 \
  --batch-size 16 \
  --accum-batch 1 \
  --lr=1e-5 \
  --wd=0.2 \
  --epochs=20 \
  --precision amp \
  --workers 4 \
  --grad-checkpointing \
  --model clip_vit_base_singlescan_h2_token2744 \
  --use-cxr-bert \
  --lock-text

Use the following commands for patch dropout:

  --force-patch-dropout 0.5 \
  --beta2 0.95

Use the following commands for siglip:

  --model siglip_vit_base_singlescan_h2_token2744 \
  --beta2 0.95 \
  --siglip

Citation

If you find this repository helpful, please consider citing:

@article{zhao2025towards,
  title={Towards Scalable Language-Image Pre-training for 3D Medical Imaging},
  author={Zhao, Chenhui and Lyu, Yiwei and Chowdury, Asadur and Harake, Edward and Kondepudi, Akhil and Rao, Akshay and Hou, Xinhai and Lee, Honglak and Hollon, Todd},
  journal={arXiv preprint arXiv:2505.21862},
  year={2025}
}

About

Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published