Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai†
Huazhong University of Science & Technology
(
We present MERGE, a simple unified diffusion model for image generation and depth estimation. Its core lies in leveraging streamlined converters and rich visual prior stored in generative image models. Our model, derived from fixed generative image models and fine-tuned pluggable converters with synthetic data, expands powerful zero-shot depth estimation capability.
- [21/Oct/2025] The training and inference code is now available!
- [18/Sep/2025] MERGE is accepted to NeurIPS 2025! 🥳🥳🥳
This installation was tested on: Ubuntu 20.04 LTS, Python 3.9.21, CUDA 11.8, NVIDIA H20-80GB.
- Clone the repository (requires git):
git clone https://github.com/HongkLin/MERGE
cd MERGE
- Install dependencies (requires conda):
conda create -n merge python=3.9.21 -y
conda activate merge
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
- Follow Marigold to prepare depth training data (Hypersim and Virtual KITTI 2), the default dataset structure is as follows:
datasets/
hypersim/
test/
train/
ai_001_001/
...
ai_055_010/
val/
vkitti/
depth/
Scene01/
...
Scene20/
rgb/
- Download the pre-trained PixArt-α and FLUX.1 [dev], then modify the pretrained_model_name_or_path.
- Run the training command! 🚀
conda activate merge
# Training MERGE-B model
bash train_scripts/train_merge_b_depth.sh
# Training MERGE-L model
bash train_scripts/train_merge_l_depth.sh
- Place your images in a directory, for example, under
/data(where we have prepared several examples). - Run the inference command:
# for MERGE-B
python inference_merge_base_depth.py --pretrained_model_path PATH/PixArt-XL-2-512x512 --model_weights PATH/merge_base_depth --image_path ./data/demo_1.png
# for MERGE-L
python inference_merge_large_depth.py --pretrained_model_path PATH/FLUX.1-dev --model_weights PATH/merge_large_depth --image_path ./data/demo_1.png
Below are the released models and their corresponding configurations:
| CHECKPOINT_DIR | PRETRAINED_MODEL | TASK_NAME |
|---|---|---|
merge-base-depth-v1 |
PixArt-XL-2-512x512 | depth |
merge-large-depth-v1 |
FLUX.1-dev | depth |
If you find this repository useful in your research, please consider giving a star ⭐ and a citation
@inproceedings{lin2025merge,
title={More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models},
author={Lin, Hongkai and Liang, Dingkang and Mingyang Du and Xin Zhou and Bai, Xiang},
booktitle={Advances in Neural Information Processing Systems},
year={2025},
}
- Thanks to Diffusers for their wonderful technical support and awesome collaboration!
- Thanks to Hugging Face for sponsoring the nicely demo!
- Thanks to DiT for their wonderful work and codebase!
- Thanks to PixArt-α for their wonderful work and codebase!
- Thanks to FLUX, Marigolod for their wonderful work!