Official implementation of LLaVa-Rad, introduced in "A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings".
LLaVA-Rad can take in as input a frontal chest X-ray and optionally a reason for exam and will output the corresponding findings.
Note: if you are interested in radiologist aligned evaluation of generated reports, we recommend you use the CheXprompt codebase.
- Introduction
- Requirements
- Installation
- Train
- Inference
- Evaluation
- Citation
- License and Usage Notices
- Acknowledgements
We trained and tested LLaVA-Rad using Python 3.10. For optimal inference, we recommend using a GPU environment. LLaVA-Rad has been tested on NVIDIA V100 and A100 GPUs with CUDA 11.x (or newer) drivers, on recent versions of Ubuntu.
Follow these steps to set up LLaVA-Rad:
- Clone the repository and navigate to the project folder:
git clone https://github.com/microsoft/LLaVA-Rad.git cd LLaVA-Rad - Create and activate a virtual environment (Python 3.10):
conda create -n llavarad python=3.10 -y conda activate llavarad
- Upgrade pip and install the package:
pip install --upgrade pip # enable PEP 660 support pip install -e .
- [Optional] Install additional dependencies for training:
pip install ninja pip install flash-attn --no-build-isolation
When starting from scratch, the following checkpoints are needed:
- A pre-trained LM checkpoint, e.g., lmsys/vicuna-7b-v1.5
- By default, we use a customized domain-specific ViT, BiomedCLIP-CXR. See README.md for details.
Before running the commands below, you need to have the data, image folder, and the above checkpoints ready.
0.1 Data
To download the data, sign the data use agreement and follow the instructions for download at LLaVA-Rad MIMIC-CXR Annotations on PhysioNet. This will include reports with extracted sections in LLaVA format, split into train/dev/test.
0.2 Images
You need to download the MIMIC-CXR-JPG images from PhysioNet by signing the data use agreement and following the instructions.
0.3 Model weights
You can find the pretrained model weights for BiomedCLIP-CXR and LLaVA-Rad at https://huggingface.co/microsoft/llava-rad.
Notes before proceeding:
- Change the paths in the scripts below according to where you downloaded the data.
- Batch size is set for 4-GPU machines. If your machine has a difference number of GPUs, please change batch size. Training commands have been tested on a single 80GB A100 and 4x80GB H100, using torch 2.4.1 and cuda 11.8 with flash attention 2.7.2.post1.
At this stage, we only train the projection layer (which aligns the vision features with text features). The vision encoder and LLM are all frozen.
bash scripts/pretrain.shWe get a pretrained projector mm_projector.bin after pretraining.
Once we have a pretrained projector, we can do fine-tuning. The command below fine-tunes the projector and LoRA of LLM:
bash scripts/finetune_lora.shBefore running the command below, you need to change the script accordingly.
bash scripts/eval.shNote: To reproduce the evaluation results from the manuscript on the MIMIC-CXR dataset, changing the script means uncommenting and updating the paths for query_file and image_folder.
In the manuscript, the Open-I and CheXpert chest X-ray images and reports are also used for evaluation. These datasets are available at their corresponding sources: Open-I | CheXpert.
If you have run inference using multiple GPUs and have a resulting set of chunks with results, make sure you concatenate prediction chunks into a single file before running the following command:
cd llava/eval/rr_eval
python run.py ${YOUR_PREDICTION_FILE}@Article{ZambranoChaves2025,
author={Zambrano Chaves, Juan Manuel and Huang, Shih-Cheng and Xu, Yanbo and Xu, Hanwen and Usuyama, Naoto and Zhang, Sheng and Wang, Fei and Xie, Yujia and Khademi, Mahmoud and Yang, Ziyi and Awadalla, Hany and Gong, Julia and Hu, Houdong and Yang, Jianwei and Li, Chunyuan and Gao, Jianfeng and Gu, Yu and Wong, Cliff and Wei, Mu and Naumann, Tristan and Chen, Muhao and Lungren, Matthew P. and Chaudhari, Akshay and Yeung-Levy, Serena and Langlotz, Curtis P. and Wang, Sheng and Poon, Hoifung},
title={A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings},
journal={Nature Communications},
year={2025},
month={Apr},
day={01},
volume={16},
number={1},
pages={3108},
issn={2041-1723},
doi={10.1038/s41467-025-58344-x},
url={https://doi.org/10.1038/s41467-025-58344-x}
}
The data, code, and model checkpoints are licensed and intended for research use only. The code and model checkpoints are subject to additional restrictions as determined by the Terms of Use of LLaMA, Vicuna, and GPT-4 respectively. Code and model checkpoints may be used for research purposes and should not be used in direct clinical care or for any clinical decision making purpose.
Our codebase heavily relies on LLaVA v1.5. Please check out their repo for more information, and consider citing them in addition to our manuscript if you use this codebase.
@misc{liu2023improvedllava,
title={Improved Baselines with Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
publisher={arXiv:2310.03744},
year={2023},
}