Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[NeurIPS 2025] More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

License

Notifications You must be signed in to change notification settings

H-EmbodVis/MERGE

Repository files navigation

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai

Huazhong University of Science & Technology

($\dagger$) Corresponding author.

Paper Website Code License

MERGE_teasor. We present MERGE, a simple unified diffusion model for image generation and depth estimation. Its core lies in leveraging streamlined converters and rich visual prior stored in generative image models. Our model, derived from fixed generative image models and fine-tuned pluggable converters with synthetic data, expands powerful zero-shot depth estimation capability.


📢 News

  • [21/Oct/2025] The training and inference code is now available!
  • [18/Sep/2025] MERGE is accepted to NeurIPS 2025! 🥳🥳🥳

🛠️ Setup

This installation was tested on: Ubuntu 20.04 LTS, Python 3.9.21, CUDA 11.8, NVIDIA H20-80GB.

  1. Clone the repository (requires git):
git clone https://github.com/HongkLin/MERGE
cd MERGE
  1. Install dependencies (requires conda):
conda create -n merge python=3.9.21 -y
conda activate merge
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt 

🔥 Training

  1. Follow Marigold to prepare depth training data (Hypersim and Virtual KITTI 2), the default dataset structure is as follows:
datasets/
    hypersim/
        test/
        train/
            ai_001_001/
            ...
            ai_055_010/
        val/
    vkitti/
        depth/
            Scene01/
            ...
            Scene20/
        rgb/
  1. Download the pre-trained PixArt-α and FLUX.1 [dev], then modify the pretrained_model_name_or_path.
  2. Run the training command! 🚀
conda activate merge

# Training MERGE-B model
bash train_scripts/train_merge_b_depth.sh

# Training MERGE-L model
bash train_scripts/train_merge_l_depth.sh


🕹️ Inference

  1. Place your images in a directory, for example, under /data (where we have prepared several examples).
  2. Run the inference command:
# for MERGE-B
python inference_merge_base_depth.py --pretrained_model_path PATH/PixArt-XL-2-512x512 --model_weights PATH/merge_base_depth --image_path ./data/demo_1.png

# for MERGE-L
python inference_merge_large_depth.py --pretrained_model_path PATH/FLUX.1-dev --model_weights PATH/merge_large_depth --image_path ./data/demo_1.png

Choose your model

Below are the released models and their corresponding configurations:

CHECKPOINT_DIR PRETRAINED_MODEL TASK_NAME
merge-base-depth-v1 PixArt-XL-2-512x512 depth
merge-large-depth-v1 FLUX.1-dev depth

⚖️ Main Results

Zero-shot Depth Estimation Results

Zero-shot Normal Estimation Results


📖BibTeX

If you find this repository useful in your research, please consider giving a star ⭐ and a citation

@inproceedings{lin2025merge,
      title={More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models}, 
      author={Lin, Hongkai and Liang, Dingkang and Mingyang Du and Xin Zhou and Bai, Xiang},
      booktitle={Advances in Neural Information Processing Systems},
      year={2025},
}

🤗Acknowledgements

  • Thanks to Diffusers for their wonderful technical support and awesome collaboration!
  • Thanks to Hugging Face for sponsoring the nicely demo!
  • Thanks to DiT for their wonderful work and codebase!
  • Thanks to PixArt-α for their wonderful work and codebase!
  • Thanks to FLUX, Marigolod for their wonderful work!

About

[NeurIPS 2025] More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published