Jinheng Xie1* 
Weijia Mao1* 
Zechen Bai1* 
David Junhao Zhang1* 
Weihao Wang2 
Kevin Qinghong Lin1 
Yuchao Gu1
Zhijie Chen2 
Zhenheng Yang2 
Mike Zheng Shou1
1 Show Lab, National University of Singapore 2 Bytedance
News
- [2024-08-26] We build two community platforms to facilitate discussion, request and collaboration! Reach us with Discord and WeChat!
- [2024-08-23] We release the inference code of Show-o (1.3B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation, text-guided inpainting and extrapolation.
Below is an overview of Show-o. The input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.
- Release the inference code.
- Release the training code (in the coming weeks).
- Scale up the model size (based on LLaMA3) and increase the number of training data.
First, set up the environment:
pip3 install -r requirments.txt
Download model weight of a pre-trained LLM (Phi-1.5):
git lfs install
git clone https://huggingface.co/microsoft/phi-1_5
Download model weights of Show-o and put them to a directory in the structure below:
├── checkpoints/ 
|   ├── magvitv2.pth
|   ├── showo.bin
|   ├── showo_w_clip_vit.bin
|   ├── phi-1_5
Login your wandb account on your machine or server.
wandb login <your wandb keys>
Inference demo for Multimodal Understanding and you can view the results on wandb.
python3 inference_mmu.py config=configs/showo_demo_w_clip_vit.yaml \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?' \
pretrained_model_path=./checkpoints/showo_w_clip_vit.bin
Inference demo for Text-to-Image Generation and you can view the results on wandb.
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=1.75 generation_timesteps=18 \
mode='t2i' pretrained_model_path=./checkpoints/showo.bin
Inference demo for Text-guided Inpainting and you can view the results on wandb.
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
pretrained_model_path=./checkpoints/showo.bin \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp
Inference demo for Text-guided Extrapolation and you can view the results on wandb.
python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
pretrained_model_path=./checkpoints/showo.bin \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg
Welcome to discuss with us and continuously improve the user experience of Show-o. Reach us with this Discord channel or the WeChat QR code below!
To cite the paper and model, please use the below:
@article{xie2024showo,
  title={Show-o: One Single Transformer to Unify Multimodal Understanding and Generation},
  author={Xie, Jinheng and Mao, Weijia and Bai, Zechen and Zhang, David Junhao and Wang, Weihao and Lin, Kevin Qinghong and Gu, Yuchao and Chen, Zhijie and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2408.12528},
  year={2024}
}
This work is heavily based on open-muse, Phi-1.5, muse-maskgit-pytorch, maskgit, taming-transformers, transformers, accelerate, diffusers, and webdatset. Thanks to all the authors for their great work.