Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Correr-Zhou/OmniShow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OmniShow logo

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Donghao Zhou1,*, Guisheng Liu2,*, Hao Yang2, Jiatong Li2,†, Jingyu Lin3, Xiaohu Huang4,
Yichen Liu2, Xin Gao2, Cunjian Chen3, Shilei Wen2,Β§, Chi-Wing Fu1, Pheng-Ann Heng1,Β§

1The Chinese University of Hong Kong, 2ByteDance, 3Monash University, 4The University of Hong Kong

*Equal contribution, †Project lead, Β§Corresponding author

     

πŸ”₯ Updates

  • 2026.05: Training and inference code for Wan-based models is released!
  • 2026.05: OmniShow is accepted by ICML 2026! πŸŽ‰
  • 2026.04: The Data of HOIVG-Bench is available on HuggingFace! πŸ€—
  • 2026.04: The technical report of OmniShow is released!

🌟 Highlights

  • Multimodal Controllable Model: OmniShow is the first all-in-one model for Human-Object Interaction Video Generation (HOIVG) with text, reference image, audio, and pose conditioning.
  • Flexible Task Coverage: A single model supports R2V, RA2V, RP2V, and RAP2V generation within one coherent framework.
  • Enabling Broader Applications: OmniShow exhibits remarkable versatility in broader applications, such as audio-driven avatars, object swapping, and video remixing.
  • New Benchmark: HOIVG-Bench provides a dedicated and comprehensive benchmark for evaluating HOIVG under diverse multimodal conditions.
OmniShow Overview

πŸš€ Introducing OmniShow

We propose OmniShow, a video generation model that unifies text, reference image, audio, and pose conditions for HOIVG, which consists of:

  1. Unified Channel-wise Conditioning effectively injects reference image and pose cues via unified channel concatenation. It augments noisy video tokens with pseudo-frames, which are supervised by a reference reconstruction loss to preserve semantic details.
  2. Gated Local-Context Attention ensures precise audio-visual synchronization. It packs audio features with sufficient contextual information and injects them via masked attention to align video frames with corresponding audio segments, followed by adaptive gating to stabilize early training.
  3. Decoupled-Then-Joint Training makes the efficient utilization of heterogeneous datasets possible. We first train specialized R2V and A2V models on separate sub-task datasets, then fuse them via weight interpolation, followed by joint fine-tuning to unify multimodal capabilities.
OmniShow Pipeline

Learn more details

πŸ“Š HOIVG-Bench

To systematically evaluate HOIVG under diverse multimodal conditions, we construct HOIVG-Bench, a dedicated benchmark with 135 carefully curated samples and task-specific metrics. Each sample contains a detailed text caption, a human reference image, an object reference image, semantically aligned audio, and a coherent pose sequence.

HOIVG-Bench

🎬 Demo

Across varied tasks, OmniShow exhibits high-fidelity reference preservation, natural motion dynamics, and precise audio-visual synchronization. Please visit the OmniShow project page for more immersive and diverse video demonstrations.

OmniShow Qualitative Results

πŸ† Benchmark Evaluation

OmniShow achieves overall state-of-the-art performance across various multimodal generation tasks, and it is the only model that supports the full RAP2V setting.

Reference-to-Video Generation (R2V)

Method TA↑ FaceSim↑ NexusScore↑ AES↑ IQA↑ VQ↑ MQ↑
HunyuanCustom 7.523 0.440 0.359 0.452 0.697 10.11 5.286
HuMo-1.7B 7.087 0.647 0.333 0.441 0.723 9.76 3.406
HuMo-17B 7.949 0.843 0.346 0.448 0.726 9.97 3.685
VACE 8.413 0.759 0.368 0.457 0.722 10.72 5.442
Phantom-1.3B 8.342 0.708 0.351 0.459 0.722 10.90 5.637
Phantom-14B 8.609 0.876 0.366 0.449 0.741 10.93 5.517
OmniShow (Ours) 7.746 0.874 0.389 0.468 0.740 11.12 5.885

Reference+Audio-to-Video Generation (RA2V)

Method TA↑ FaceSim↑ NexusScore↑ Sync-C↑ Sync-D↓ AES↑ IQA↑ VQ↑ MQ↑
HunyuanCustom 7.289 0.457 0.350 6.072 10.08 0.439 0.715 9.15 3.658
HuMo-1.7B 7.489 0.575 0.329 7.234 9.117 0.428 0.731 9.97 4.182
HuMo-17B 8.146 0.805 0.344 8.013 8.316 0.439 0.739 10.27 4.269
OmniShow (Ours) 8.093 0.810 0.369 8.612 7.608 0.465 0.742 10.86 5.554

Reference+Pose-to-Video Generation (RP2V)

Method TA↑ FaceSim↑ NexusScore↑ AKD↓ PCK↑ AES↑ IQA↑ VQ↑ MQ↑
AnchorCrafter 2.669 0.404 0.215 0.229 0.176 0.499 0.673 8.95 4.241
VACE 7.690 0.600 0.352 0.206 0.336 0.450 0.712 10.14 5.393
OmniShow (Ours) 6.526 0.474 0.418 0.174 0.460 0.447 0.722 10.28 4.937

βœ… Todo List

  • Training Code (Wan-Based)
  • Inference Code (Wan-Based)
  • Data of HOIVG-Bench
  • Evaluation Code of HOIVG-Bench

πŸ› οΈ Environment Setup

We recommend using a clean Conda environment with Python 3.11:

git clone https://github.com/Correr-Zhou/OmniShow.git
cd OmniShow

conda create -n omnishow python=3.11 -y
conda activate omnishow

pip install -e .
pip install -r requirements.txt

If the default PyTorch installation does not match your CUDA version, reinstall PyTorch manually. For example, for CUDA 12.4:

pip install --index-url https://download.pytorch.org/whl/cu124 \
  torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0

πŸ“¦ Data and Model Preparation

Download the required Wan backbones, tokenizer assets, Wav2Vec2 audio encoder, and the OmniShow example dataset:

bash download_weights.sh
bash download_data.sh

By default, the scripts organize files as:

OmniShow/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ Wan-AI/Wan2.1-I2V-14B-480P/
β”‚   β”œβ”€β”€ Wan-AI/Wan2.1-I2V-14B-720P/
β”‚   β”œβ”€β”€ Wan-AI/Wan2.1-T2V-1.3B/
β”‚   └── facebook/wav2vec2-base-960h/
└── data/
    └── donghao-zhou/OmniShow_example_dataset/

The example dataset follows the metadata format below:

Field Description
text_prompt Text description of the target video.
ref_image_human Relative path to the human reference image.
ref_image_object Relative path to the object reference image.
audio Relative path to the audio file.
audio_caption Textual description of the audio content.
pose_video Relative path to the pose video.
target_video Relative path to the training target video. Used for training metadata.

This release focuses on reproducing our method on Wan-based models. Checkpoints are not included due to internal policy constraints. The target_video files of the example dataset are generated by OmniShow and intended for checking that the code runs correctly. For HOIVG-Bench, please download the benchmark from HuggingFace.

⚑ Quick Start

Training

Run the default Wan-based OmniShow training script:

bash run_train/train_omnishow_wan.sh

The default script uses the example dataset and trains the r2v setting at 480p. You can edit the following variables in run_train/train_omnishow_wan.sh to switch task, resolution, or data path:

GEN_TASK="r2v"      # r2v / a2v / ra2v / rp2v / rap2v
RESOLUTION="480p"  # 480p / 720p
DATA_FILE="data/donghao-zhou/OmniShow_example_dataset/meta_data_train.csv"
DATASET_BASE_PATH="data/donghao-zhou/OmniShow_example_dataset"

The training entry also supports direct command-line usage:

accelerate launch --config_file run_train/accelerate_config_14B_zero3.yaml \
  --num_processes 8 \
  run_train/train_omnishow_wan.py \
  --dataset_base_path data/donghao-zhou/OmniShow_example_dataset \
  --dataset_metadata_path data/donghao-zhou/OmniShow_example_dataset/meta_data_train.csv \
  --height 832 \
  --width 480 \
  --num_frames 49 \
  --gen_task r2v \
  --model_id_with_origin_paths "Wan-AI/Wan2.1-I2V-14B-480P:diffusion_pytorch_model*.safetensors,Wan-AI/Wan2.1-I2V-14B-480P:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.1-I2V-14B-480P:Wan2.1_VAE.pth,Wan-AI/Wan2.1-I2V-14B-480P:models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth" \
  --trainable_models "dit" \
  --learning_rate 1e-5 \
  --num_epochs 1000 \
  --save_steps 500 \
  --remove_prefix_in_ckpt "pipe.dit." \
  --output_path outputs/train_omnishow_wan_r2v_480p \
  --use_gradient_checkpointing_offload \
  --initialize_model_on_cpu

Inference

Run the default Wan-based OmniShow inference script:

bash run_infer/infer_omnishow_wan.sh

The default script reads meta_data_infer.csv and saves generated videos to outputs/. To use a fine-tuned checkpoint, set DIT_CHECKPOINT in run_infer/infer_omnishow_wan.sh:

DIT_CHECKPOINT="path/to/your/checkpoint.safetensors"

The inference entry also supports direct command-line usage:

python run_infer/infer_omnishow_wan.py \
  --model_id Wan-AI/Wan2.1-I2V-14B-480P \
  --csv data/donghao-zhou/OmniShow_example_dataset/meta_data_infer.csv \
  --base_path data/donghao-zhou/OmniShow_example_dataset \
  --output_dir outputs/infer_omnishow_wan_r2v_480p \
  --gen_task r2v \
  --dit_checkpoint path/to/your/checkpoint.safetensors \
  --height 832 \
  --width 480 \
  --num_frames 49 \
  --num_inference_steps 50 \
  --cfg_scale 6 \
  --seed 42

🧭 Advanced Usage

OmniShow supports the following tasks:

Task Conditions Typical inference CSV fields
r2v text + reference images text_prompt, ref_image_human, ref_image_object
a2v text + first frame + audio text_prompt, input_image, audio
ra2v text + reference images + audio text_prompt, ref_image_human, ref_image_object, audio
rp2v text + reference images + pose text_prompt, ref_image_human, ref_image_object, pose_video
rap2v text + reference images + audio + pose text_prompt, ref_image_human, ref_image_object, audio, pose_video

Both training and inference scripts expose the same task switch:

GEN_TASK="r2v"  # r2v / a2v / ra2v / rp2v / rap2v

To switch resolution, you can also edit RESOLUTION in the corresponding script:

RESOLUTION="480p"  # 480p / 720p

The scripts automatically select the matching base model.

If you use a custom aspect ratio or resolution, also check the HEIGHT and WIDTH values in the script.

For training, the most commonly edited variables in run_train/train_omnishow_wan.sh are:

GEN_TASK="r2v"
RESOLUTION="480p"
DATA_FILE="path/to/your/meta_data_train.csv"
DATASET_BASE_PATH="path/to/your/dataset_root"
OUTPUT_DIR="outputs/train_omnishow_wan_${GEN_TASK}_${RESOLUTION}"
LAUNCH_NUM_PROCESSES=8
LEARNING_RATE="1e-5"
NUM_EPOCHS=1000
SAVE_STEPS=500
NUM_FRAMES=49

For inference, the most commonly edited variables in run_infer/infer_omnishow_wan.sh are:

GEN_TASK="r2v"
RESOLUTION="480p"
DATA_FILE="path/to/your/meta_data_infer.csv"
DATASET_BASE_PATH="path/to/your/dataset_root"
OUTPUT_DIR="outputs/infer_omnishow_wan_${GEN_TASK}_${RESOLUTION}"
DIT_CHECKPOINT="path/to/your/checkpoint.safetensors"
NUM_INFERENCE_STEPS=50
CFG_SCALE=6
SEED=42

If DIT_CHECKPOINT is left empty, inference uses the base Wan DiT weights. Set it when evaluating a fine-tuned checkpoint.

🧾 Preparing Your Own Dataset

To use your own data, follow the same CSV-driven format as the example dataset. All media paths in the CSV should be relative to DATASET_BASE_PATH.

A typical dataset can be organized as:

your_dataset/
β”œβ”€β”€ meta_data_train.csv
β”œβ”€β”€ meta_data_infer.csv
β”œβ”€β”€ ref_image_human/
β”œβ”€β”€ ref_image_object/
β”œβ”€β”€ input_image/
β”œβ”€β”€ audio/
β”œβ”€β”€ pose_video/
└── target_video/       # training only

Training metadata should include target_video, while inference metadata does not need it. For a2v training, the first frame is taken from target_video; for a2v inference, provide input_image. The detailed requirements for CSV fields are as follows:

Field Required for Description
text_prompt all tasks Text description of the target video.
ref_image_human r2v, ra2v, rp2v, rap2v Relative path to the human reference image.
ref_image_object r2v, ra2v, rp2v, rap2v Relative path to the object reference image.
input_image a2v inference Relative path to the first-frame image.
audio a2v, ra2v, rap2v Relative path to the audio file.
audio_caption optional Textual description of the audio content.
pose_video rp2v, rap2v Relative path to the pose video.
target_video training only Relative path to the target video used for supervision. Also provides the first frame for a2v training.
output_name inference optional Output filename stem for the generated video.
negative_prompt inference optional Per-sample negative prompt. If omitted, the default negative prompt is used.
seed inference optional Per-sample random seed. If omitted, the script-level seed is used.

Example training row:

text_prompt,ref_image_human,ref_image_object,audio,audio_caption,pose_video,target_video
"A person presents a object to the camera.",ref_image_human/0001.png,ref_image_object/0001.png,audio/0001.wav,"Object introduction speech.",pose_video/0001.mp4,target_video/0001.mp4

Example inference row:

text_prompt,ref_image_human,ref_image_object,audio,audio_caption,pose_video,output_name
"A person presents a object to the camera.",ref_image_human/0001.png,ref_image_object/0001.png,audio/0001.wav,"Object introduction speech.",pose_video/0001.mp4,sample_0001

πŸ—‚οΈ File Structure

The released code is organized around the OmniShow training and inference workflow:

OmniShow/
β”œβ”€β”€ assets/                         # Figures used in this README.
β”œβ”€β”€ diffsynth/                       # Core framework and OmniShow implementation.
β”‚   β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ diffusion/
β”‚   β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ modules/
β”‚   β”œβ”€β”€ pipelines/
β”‚   β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── version.py
β”œβ”€β”€ run_train/                      # Training entrypoint, launcher script, and Accelerate config.
β”‚   β”œβ”€β”€ accelerate_config_14B_zero3.yaml
β”‚   β”œβ”€β”€ train_omnishow_wan.py
β”‚   └── train_omnishow_wan.sh
β”œβ”€β”€ run_infer/                      # Inference entrypoint and example launcher script.
β”‚   β”œβ”€β”€ infer_omnishow_wan.py
β”‚   └── infer_omnishow_wan.sh
β”œβ”€β”€ download_weights.sh             # Downloads Wan, tokenizer, and audio encoder weights.
β”œβ”€β”€ download_data.sh                # Downloads the OmniShow example dataset.
β”œβ”€β”€ requirements.txt                # Python dependencies used by the release.
└── README.md                       # Project overview and usage instructions.

βš–οΈ Ethics

OmniShow is released for research purposes. The code and data are intended to support responsible study of video generation. Please follow the following guidelines:

  • Do not use the model for identity misuse, impersonation, harassment, deception, or other harmful content generation.
  • Respect the licenses and usage restrictions of the underlying Wan models, Wav2Vec2, datasets, and any input media.
  • When using personal images, voices, or videos, obtain proper consent and follow applicable laws and platform policies.
  • Generated content should be clearly disclosed when used in public-facing scenarios.

🀝 Acknowledgements

This codebase was built upon DiffSynth-Studio. We sincerely thank the contributors of this project for their excellent code.

πŸ”— Citation

If you find OmniShow useful or inspiring, please consider giving us a ⭐ on GitHub. Your support helps more people discover the project!

If OmniShow is helpful for your research or projects, please consider citing our work:

@article{zhou2026omnishow,
  title={OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation},
  author={Zhou, Donghao and Liu, Guisheng and Yang, Hao and Li, Jiatong and Lin, Jingyu and Huang, Xiaohu and Liu, Yichen and Gao, Xin and Chen, Cunjian and Wen, Shilei and Fu, Chi-Wing and Heng, Pheng-Ann},
  journal={arXiv preprint arXiv:2604.11804},
  year={2026}
}

πŸ“¬ Contact

For questions about OmniShow, please contact Donghao Zhou at [email protected].

Releases

No releases published

Packages

 
 
 

Contributors