π Github | π₯ Model Download | π Paper Link | π Project Page
UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis
A unified medical foundation model enabling both understanding and generation capabilities within a single architecture
We introduce UniMedVL, the unified medical foundation model for seamless multimodal understanding and generation. Four key innovations distinguish UniMedVL:
-
Unified Observation-Knowledge-Analysis Architecture: UniMedVL sets itself apart from prior medical AI models by following a clinically-inspired three-level framework that mirrors how physicians process medical information, enabling both understanding and generation within a single architecture.
-
Versatile Medical Multimodal Capabilities: UniMedVL supports a broad spectrum of medical tasks, including visual question answering, medical report generation, text-to-medical-image synthesis, cross-modal translation, and virtual staining across 9 imaging modalities.
-
Large-Scale Medical Dataset: We present UniMed-5M, a comprehensive medical multimodal dataset containing 5.6M+ high-quality samples with three-stage quality verification and expert validation, covering understanding, generation, and interleaved tasks.
-
Superior Performance: UniMedVL achieves state-of-the-art performance on multiple evaluation datasets, with 75.4% accuracy on SLAKE VQA, 53.5% on PathVQA, and competitive generation quality (96.29 average gFID), setting a new standard in unified medical AI.
- π Paper & Evaluations - Research documentation and evaluation results
- πΌοΈ Visualizations - Result figures and model demonstrations
- πΎ Model Checkpoints - Pre-trained UniMedVL weights (14B parameters)
- π§ Inference Code - Model loading and inference examples
- ποΈ Training Code - Full training pipeline and configuration files
- π UniMed-5M Dataset - Training dataset with quality control
UniMedVL follows a workflow-guided three-level framework that mirrors how physicians process medical information:
flowchart TD
A[Observation Level] --> B[Knowledge Level] --> C[Analysis Level]
A1[UniMed-5M Dataset<br/>5.6M samples<br/>8 imaging modalities] --> A
A --> A2[Quality Control<br/>Three-stage verification<br/>Expert validation]
B1[Progressive Curriculum<br/>Foundation β Instruction β Unified] --> B
B --> B2[Cross-modal Knowledge Fusion<br/>Understanding β Generation]
C1[Unified Architecture<br/>Dual encoders + MOT] --> C
C --> C2[Multimodal Outputs<br/>Reports + Images + Annotations]
Three-Stage Progressive Curriculum Learning:
-
π§ Stage 1 - Foundation Training (85K steps)
- Basic medical pattern recognition
- Visual-language alignment
- Data ratio: 75% I2T, 25% T2I
-
π Stage 2 - Instruction Tuning (120K steps)
- Cross-modal understanding enhancement
- Medical expertise development
- Data ratio: 40% I2T, 45% T2I, 10% Interleaved
-
π Stage 3 - Unified Training (70K steps)
- Advanced multimodal synthesis
- Interleaved task mastery
- Data ratio: 37% I2T, 35% T2I, 25% Interleaved
Here we present some comprehensive visualization results demonstrating UniMedVL's capabilities. For additional visualization results and comparisons, please see our Project Page.
Performance Across Training Stages
Multimodal Tasks Demonstration
Medical Visual Question Answering
Medical Report Generation
Text-to-Medical-Image Generation
Medical-Image Generation across 8 modalities
Medical Visual Question Answering Performance
| Model | Params | Type | VQA-RAD | SLAKE | PathVQA | OmniMedVQA | GMAI-MMBench |
|---|---|---|---|---|---|---|---|
| GMAI-VL | 7B | Medical-specific | 66.3 | 72.9 | 39.8 | 88.5 | 61.74 |
| HuatuoGPT-Vision | 7B | Medical-specific | 53.0 | 49.1 | 32.0 | 50.0 | 50.22 |
| Bagel | 7B | Unified | 60.09 | 58.91 | 39.05 | 71.13 | 48.11 |
| HealthGPT-L14 | 14B | Unified | 58.3 | 64.5 | 44.4 | 74.4 | 43.1 |
| UniMedVL | 14B | Unified | 61.9 | 75.4 | 53.5 | 85.8 | 60.75 |
Medical Image Generation Performance
Text-to-image generation performance across 8 medical imaging modalities. Metrics: gFID β (lower is better) / BioMedCLIP Score β (higher is better)
| Model | CFP | CXR | CT | HIS | MRI | OCT | Ultrasound | Endoscopy | Average |
|---|---|---|---|---|---|---|---|---|---|
| Bagel (7B) | 217.19/0.650 | 182.80/0.662 | 163.78/0.652 | 206.18/0.643 | 175.74/0.639 | 307.80/0.719 | 255.78/0.672 | 214.61/0.668 | 215.49/0.660 |
| UniMedVL (14B) | 53.20/0.708 | 73.04/0.702 | 73.04/0.696 | 149.01/0.704 | 90.36/0.706 | 99.27/0.721 | 95.38/0.706 | 133.11/0.707 | 96.29/0.706 |
Interleaved Multimodal Tasks Performance
Virtual Immunohistochemistry Staining (H&E β IHC)
| Method | Type | PSNR β | SSIM β |
|---|---|---|---|
| Pyramid Pix2pix | Specialized | 21.16 | 0.477 |
| HealthGPT-M3 | Unified | 15.81 | 0.242 |
| UniMedVL | Unified | 20.27 | 0.456 |
MRI Super-Resolution (4Γ upsampling)
| Method | Type | PSNR β | SSIM β |
|---|---|---|---|
| AMIR | Specialized | 31.99 | 0.939 |
| HealthGPT-M3 | Unified | 18.37 | 0.580 |
| UniMedVL | Unified | 27.29 | 0.890 |
Cross-Modal Synthesis (T2 β FLAIR MRI)
| Method | Type | Average PSNR β | Average SSIM β |
|---|---|---|---|
| ResViT | Specialized | 25.38 | 0.889 |
| HealthGPT-M3 | Unified | 19.09 | 0.748 |
| UniMedVL | Unified | 25.07 | 0.882 |
Counterfactual Medical Image Generation
Performance on counterfactual chest X-ray generation with explanatory text. β indicates unified fine-tuning variant.
| Method | gFID β | AUROC β | F1 β | BLEU-3 β | METEOR β | ROUGE-L β |
|---|---|---|---|---|---|---|
| ProgEmu | 29.21 | 0.792 | 0.891 | 0.124 | 0.410 | 0.261 |
| UniMedVLβ | 27.17 | 0.797 | 0.873 | 0.264 | 0.449 | 0.465 |
conda env create -f codes/environment.yaml
conda activate unimedvlTwo interactive inference scripts are provided in the codes/ directory:
-
Medical Visual Question Answering (
interactive_vqa_inferencer.py) -
Medical Image Generation (
interactive_image_generator.py)
- Download the UniMedVL checkpoint
- Set
model_pathandROOTin the script configuration - Run the script:
python codes/interactive_vqa_inferencer.pyorpython codes/interactive_image_generator.py
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
If you use this project in your research or work, please cite it as:
@misc{ning2025unimedvlunifyingmedicalmultimodal,
title={Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis},
author={Junzhi Ning and Wei Li and Cheng Tang and Jiashi Lin and Chenglong Ma and Chaoyang Zhang and Jiyao Liu and Ying Chen and Shujian Gao and Lihao Liu and Yuandong Pu and Huihui Xu and Chenhui Gou and Ziyan Huang and Yi Xin and Qi Qin and Zhongying Deng and Diping Song and Bin Fu and Guang Yang and Yuanfeng Ji and Tianbin Li and Yanzhou Su and Jin Ye and Shixiang Tang and Ming Hu and Junjun He},
year={2025},
eprint={2510.15710},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.15710},
}We sincerely thank the following projects and their contributors for their invaluable open-source contributions that made this research possible:
- Bagel - Foundation model architecture and training methodology inspiration
- HealthGPT - Medical domain adaptation and evaluation framework
- VLMEvalKit - Comprehensive evaluation toolkit for vision-language models