Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ Show-o Public
forked from showlab/Show-o

[ICLR 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.

License

Notifications You must be signed in to change notification settings

RayJue/Show-o

Β 
Β 

Repository files navigation


One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie1*Β  Weijia Mao1*Β  Zechen Bai1*Β  David Junhao Zhang1*Β 
Weihao Wang2Β  Kevin Qinghong Lin1Β  Yuchao Gu1 Zhijie Chen2Β  Zhenheng Yang2Β  Mike Zheng Shou1

1 Show Lab, National University of SingaporeΒ  2 BytedanceΒ 

ArXiv ArXiv Demo slack badge WeChat badge


Improved Native Unified Multimodal Models

Jinheng Xie1Β  Zhenheng Yang2Β  Mike Zheng Shou1

1 Show Lab, National University of SingaporeΒ  2 BytedanceΒ 

ArXiv ArXiv WeChat badge

News

  • [2025-07-05] Fix some issues related to visualization of generated images during training.
  • [2025-07-05] We release the training and inference code for simple mixed-modality generation.
  • [2025-06-27] We release the training code for multimodal understanding and generation.
  • [2025-06-25] We thank team OneIG-Bench for evaluating Show-o2 models on their new benchmark, in which our models have achieved leading performance in terms of Alignment and Reasoning metrics. The leaderboard is maintained here.

  • [2025-06-20] We are including more concurrent works in our comparative analysis tables. Feel free to reach out to us if we miss your works.
  • [2025-06-19] We release the Show-o2 models with 1.5B and 7B LLM parameters for multimodal understanding and generation. We perform the unified learning of multimodal understanding and generation on the text token and 3D Causal VAE space, which is scalable for text, image, and video modalities. A dual-path of spatial (-temporal) fusion is proposed to accommodate the distinct feature dependency of multimodal understanding and generation. We employ specific heads with autoregressive modeling and flow matching for the overall unified learning of multimodal understanding, image/video and mixed-modality generation.


GIF 1 GIF 2
GIF 1 GIF 2 GIF 3 GIF 4
  • [2025-01-23] Show-o has been accepted to ICLR 2025.

  • [2024-10-15] Update Arxiv paper to include new features and experimental results.

    • Support image generation in a resolution of 512x512.

    • Improve the multimodal understanding capabilities of purely discrete Show-o.

    • Improve the performance on the GenEval benchmark.

    • Explore the impact of dataset scale and image resolution on multimodal understanding capabilities of discrete image tokens. For more information, please refer to the paper.

    • We release the weight of Show-o before fine-tuning on LLaVA instructional tuning datasets. You can fine-tune it following the configurations in ./configs.
  • [2024-09-12] Arxiv paper updated to include preliminaries about discrete diffusion.

  • [2024-09-03] We deploy an online demo on Hugging Face Space. πŸ€— Have fun!

  • [2024-09-02] We release the training code for pre-training and instruction tuning! πŸ”₯πŸ”₯

  • [2024-09-01] Add FlexAttention implementation for accleration. Thanks to @Horace for providing examples.

  • [2024-08-28] We maintain a repo of Awesome Unified Multimodal Models. If you are interested in unified models, star and watch it to get latest updates!

  • [2024-08-27] Add integration to Hugging Face! Thanks to @NielsRogge.

  • [2024-08-26] We build two community platforms to facilitate discussion, request and collaboration! Reach us with Discord and WeChat!

  • [2024-08-23] We release the inference code of Show-o (1.3B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation, text-guided inpainting and extrapolation.

What is the new about Show-o?

Below is a characteristics comparison among understanding only, generation only, and unified (understanding & generation) models. Vision and Language indicate the representations from specific input modalities. In this context, Diffusion represents both continuous and discrete diffusion.

Below is an overview of Show-o. The input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.


TODO

  • Release the inference code.
  • Release the training code.
  • Support image generation in a resolution of 512x512.
  • Scale up the model size (based on LLaMA3) and increase the number of training data.

Hugging Face models and annotations

The Show-o2 checkpoints can be found on Hugging Face:

The Show-o checkpoints can be found on Hugging Face:

Getting Started

First, set up the environment:

pip3 install -r requirements.txt

Login your wandb account on your machine or server.

wandb login <your wandb keys>

Inference demo for Multimodal Understanding and you can view the results on wandb.

option (c)

python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'

or option (a)

python3 inference_mmu.py config=configs/showo_demo_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'

Inference demo for Text-to-Image Generation and you can view the results (in a resolution of 512x512) on wandb.

python3 inference_t2i.py config=configs/showo_demo_512x512.yaml \
batch_size=1 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=5 generation_timesteps=50 \
mode='t2i'

Inference demo for Text-guided Inpainting and you can view the results (in a resolution of 256x256) on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp

Inference demo for Text-guided Extrapolation and you can view the results (in a resolution of 256x256) on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg

Training pipeline

Prepare your training data and change the data path in configs/xx.yaml.

Note that, our training process is based on accelerate. Please ensure to config your accelerate for distributed training. We provide config examples below for (distributed) training on a single GPU or multiple GPUs.

β”œβ”€β”€ accelerate_configs/ 
|   β”œβ”€β”€ multi_nodes (6x8 GPUs)
|   |   β”œβ€”β€” ...
|   β”œβ”€β”€ 1_gpu.yaml
|   └── 8_gpu_deepspeed_zero2.yaml

Stage 1 - Pre-training on ImageNet-1K dataset. Change the data path to ImageNet-1K in configs/showo_pretraining_stage1.yaml. Note that, we use the internal packages to process the RefinedWeb dataset, and you must manually comment the code part related to language modeling in training/train.py or write a new dataloder.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_pretraining_stage1.yaml

Once trained, the checkpoint folder is structured as follows:

β”œβ”€β”€ show-o-training-stage1/ 
|   β”œβ”€β”€ ...
|   β”œβ”€β”€ checkpoint-500000
|   └── config.yaml

A bit cumbersome. Just create a new output folder (edited in the yaml config) for stage 2, copy the latest checkpoint of stage 1 to this folder, and rename it to checkpoint-0. It will be automatically resumed for next stage training. Apply same procedures for the resume training in the following stages.

β”œβ”€β”€ show-o-training-stage2/ 
|   └── checkpoint-0

Stage 2 - Pre-training on Image-Text dataset. The default dataloader is based on WebDataset. Change the data path in configs/showo_pretraining_stage2.yaml.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_pretraining_stage2.yaml

Stage 3 - Pre-training on High-quality Image-Text dataset. Change the data path in configs/showo_pretraining_stage3.yaml

Copy the pre-trained weights to the output_dir (specified in the config)

β”œβ”€β”€ show-o-training-stage3/ 
|   └── checkpoint-0
accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_pretraining_stage3.yaml

[Option a] Stage 3 - Instruction tuning on LLaVA dataset (llava-pretrain). Change the data path in llava/llava_data_vq_unified.py.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_instruction_tuning_1.yaml

[Option a] Stage 3 - Instruction tuning on LLaVA dataset (llava-tuning). Change the data path in llava/llava_data_vq_unified.py.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_instruction_tuning_2.yaml

[Option c] Stage 3 - Instruction tuning on LLaVA dataset (llava-pretrain) with CLIP-ViT. Change the data path in llava/llava_pretrain_data.py.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_w_clip_vit.py config=configs/showo_instruction_tuning_1_w_clip_vit.yaml

[Option c] Stage 3 - Instruction tuning on LLaVA dataset (llava-tuning) with CLIP-ViT. Change the data path in llava/llava_instuct_data.py.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_w_clip_vit.py config=configs/showo_instruction_tuning_2_w_clip_vit.yaml

Request new features? Willing to contribute?

We welcome your bravo new ideas and contributions! If you would like to see any new features in Show-o, or you want to contribute to this project, please fill in this form!

Pending Requested Features

  • Mixed-modal generation
  • Support training on more datasets
  • Visual tokenizer training

Find more at Contributing and Roadmap.

Join Discussion

Welcome to discuss with us and continuously improve the user experience of Show-o. Reach us with this Discord channel or the WeChat QR code below!

Citation

To cite the paper and model, please use the below:

@article{xie2024showo,
  title={Show-o: One Single Transformer to Unify Multimodal Understanding and Generation},
  author={Xie, Jinheng and Mao, Weijia and Bai, Zechen and Zhang, David Junhao and Wang, Weihao and Lin, Kevin Qinghong and Gu, Yuchao and Chen, Zhijie and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2408.12528},
  year={2024}
}

@article{xie2025showo2,
  title={Show-o2: Improved Native Unified Multimodal Models},
  author={Xie, Jinheng and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2506.15564},
  year={2025}
}

Acknowledgments

This work is heavily based on open-muse, Phi-1.5, muse-maskgit-pytorch, maskgit, taming-transformers, transformers, accelerate, diffusers, and webdataset. Thanks to all the authors for their great work.

About

[ICLR 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.6%
  • Shell 1.4%
  • Jupyter Notebook 1.0%