More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai^†

Huazhong University of Science & Technology

($\dagger$) Corresponding author.

We present MERGE, a simple unified diffusion model for image generation and depth estimation. Its core lies in leveraging streamlined converters and rich visual prior stored in generative image models. Our model, derived from fixed generative image models and fine-tuned pluggable converters with synthetic data, expands powerful zero-shot depth estimation capability.

📢 News

[21/Oct/2025] The training and inference code is now available!
[18/Sep/2025] MERGE is accepted to NeurIPS 2025! 🥳🥳🥳

🛠️ Setup

This installation was tested on: Ubuntu 20.04 LTS, Python 3.9.21, CUDA 11.8, NVIDIA H20-80GB.

Clone the repository (requires git):

git clone https://github.com/HongkLin/MERGE
cd MERGE

Install dependencies (requires conda):

conda create -n merge python=3.9.21 -y
conda activate merge
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

🔥 Training

Follow Marigold to prepare depth training data (Hypersim and Virtual KITTI 2), the default dataset structure is as follows:

datasets/
    hypersim/
        test/
        train/
            ai_001_001/
            ...
            ai_055_010/
        val/
    vkitti/
        depth/
            Scene01/
            ...
            Scene20/
        rgb/

Download the pre-trained PixArt-α and FLUX.1 [dev], then modify the pretrained_model_name_or_path.
Run the training command! 🚀

conda activate merge

# Training MERGE-B model
bash train_scripts/train_merge_b_depth.sh

# Training MERGE-L model
bash train_scripts/train_merge_l_depth.sh

🕹️ Inference

Place your images in a directory, for example, under /data (where we have prepared several examples).
Run the inference command:

# for MERGE-B
python inference_merge_base_depth.py --pretrained_model_path PATH/PixArt-XL-2-512x512 --model_weights PATH/merge_base_depth --image_path ./data/demo_1.png

# for MERGE-L
python inference_merge_large_depth.py --pretrained_model_path PATH/FLUX.1-dev --model_weights PATH/merge_large_depth --image_path ./data/demo_1.png

Choose your model

Below are the released models and their corresponding configurations:

CHECKPOINT_DIR	PRETRAINED_MODEL	TASK_NAME
`merge-base-depth-v1`	PixArt-XL-2-512x512	depth
`merge-large-depth-v1`	FLUX.1-dev	depth

⚖️ Main Results

Zero-shot Depth Estimation Results

Zero-shot Normal Estimation Results

📖BibTeX

If you find this repository useful in your research, please consider giving a star ⭐ and a citation

@inproceedings{lin2025merge,
      title={More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models}, 
      author={Lin, Hongkai and Liang, Dingkang and Mingyang Du and Xin Zhou and Bai, Xiang},
      booktitle={Advances in Neural Information Processing Systems},
      year={2025},
}

🤗Acknowledgements

Thanks to Diffusers for their wonderful technical support and awesome collaboration!
Thanks to Hugging Face for sponsoring the nicely demo!
Thanks to DiT for their wonderful work and codebase!
Thanks to PixArt-α for their wonderful work and codebase!
Thanks to FLUX, Marigolod for their wonderful work!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
asset/images		asset/images
config		config
data		data
data_split		data_split
merge		merge
src		src
train_scripts		train_scripts
LICENSE		LICENSE
README.md		README.md
inference_merge_base_depth.py		inference_merge_base_depth.py
inference_merge_large_depth.py		inference_merge_large_depth.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

📢 News

🛠️ Setup

🔥 Training

🕹️ Inference

Choose your model

⚖️ Main Results

Zero-shot Depth Estimation Results

Zero-shot Normal Estimation Results

📖BibTeX

🤗Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

H-EmbodVis/MERGE

Folders and files

Latest commit

History

Repository files navigation

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

📢 News

🛠️ Setup

🔥 Training

🕹️ Inference

Choose your model

⚖️ Main Results

Zero-shot Depth Estimation Results

Zero-shot Normal Estimation Results

📖BibTeX

🤗Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages