Yichen Liu2, Xin Gao2, Cunjian Chen3, Shilei Wen2,Β§, Chi-Wing Fu1, Pheng-Ann Heng1,Β§
- 2026.05: Training and inference code for Wan-based models is released!
- 2026.05: OmniShow is accepted by ICML 2026! π
- 2026.04: The Data of HOIVG-Bench is available on HuggingFace! π€
- 2026.04: The technical report of OmniShow is released!
- Multimodal Controllable Model: OmniShow is the first all-in-one model for Human-Object Interaction Video Generation (HOIVG) with text, reference image, audio, and pose conditioning.
- Flexible Task Coverage: A single model supports R2V, RA2V, RP2V, and RAP2V generation within one coherent framework.
- Enabling Broader Applications: OmniShow exhibits remarkable versatility in broader applications, such as audio-driven avatars, object swapping, and video remixing.
- New Benchmark: HOIVG-Bench provides a dedicated and comprehensive benchmark for evaluating HOIVG under diverse multimodal conditions.
We propose OmniShow, a video generation model that unifies text, reference image, audio, and pose conditions for HOIVG, which consists of:
- Unified Channel-wise Conditioning effectively injects reference image and pose cues via unified channel concatenation. It augments noisy video tokens with pseudo-frames, which are supervised by a reference reconstruction loss to preserve semantic details.
- Gated Local-Context Attention ensures precise audio-visual synchronization. It packs audio features with sufficient contextual information and injects them via masked attention to align video frames with corresponding audio segments, followed by adaptive gating to stabilize early training.
- Decoupled-Then-Joint Training makes the efficient utilization of heterogeneous datasets possible. We first train specialized R2V and A2V models on separate sub-task datasets, then fuse them via weight interpolation, followed by joint fine-tuning to unify multimodal capabilities.
Learn more details
To systematically evaluate HOIVG under diverse multimodal conditions, we construct HOIVG-Bench, a dedicated benchmark with 135 carefully curated samples and task-specific metrics. Each sample contains a detailed text caption, a human reference image, an object reference image, semantically aligned audio, and a coherent pose sequence.
Across varied tasks, OmniShow exhibits high-fidelity reference preservation, natural motion dynamics, and precise audio-visual synchronization. Please visit the OmniShow project page for more immersive and diverse video demonstrations.
OmniShow achieves overall state-of-the-art performance across various multimodal generation tasks, and it is the only model that supports the full RAP2V setting.
| Method | TAβ | FaceSimβ | NexusScoreβ | AESβ | IQAβ | VQβ | MQβ |
|---|---|---|---|---|---|---|---|
| HunyuanCustom | 7.523 | 0.440 | 0.359 | 0.452 | 0.697 | 10.11 | 5.286 |
| HuMo-1.7B | 7.087 | 0.647 | 0.333 | 0.441 | 0.723 | 9.76 | 3.406 |
| HuMo-17B | 7.949 | 0.843 | 0.346 | 0.448 | 0.726 | 9.97 | 3.685 |
| VACE | 8.413 | 0.759 | 0.368 | 0.457 | 0.722 | 10.72 | 5.442 |
| Phantom-1.3B | 8.342 | 0.708 | 0.351 | 0.459 | 0.722 | 10.90 | 5.637 |
| Phantom-14B | 8.609 | 0.876 | 0.366 | 0.449 | 0.741 | 10.93 | 5.517 |
| OmniShow (Ours) | 7.746 | 0.874 | 0.389 | 0.468 | 0.740 | 11.12 | 5.885 |
| Method | TAβ | FaceSimβ | NexusScoreβ | Sync-Cβ | Sync-Dβ | AESβ | IQAβ | VQβ | MQβ |
|---|---|---|---|---|---|---|---|---|---|
| HunyuanCustom | 7.289 | 0.457 | 0.350 | 6.072 | 10.08 | 0.439 | 0.715 | 9.15 | 3.658 |
| HuMo-1.7B | 7.489 | 0.575 | 0.329 | 7.234 | 9.117 | 0.428 | 0.731 | 9.97 | 4.182 |
| HuMo-17B | 8.146 | 0.805 | 0.344 | 8.013 | 8.316 | 0.439 | 0.739 | 10.27 | 4.269 |
| OmniShow (Ours) | 8.093 | 0.810 | 0.369 | 8.612 | 7.608 | 0.465 | 0.742 | 10.86 | 5.554 |
| Method | TAβ | FaceSimβ | NexusScoreβ | AKDβ | PCKβ | AESβ | IQAβ | VQβ | MQβ |
|---|---|---|---|---|---|---|---|---|---|
| AnchorCrafter | 2.669 | 0.404 | 0.215 | 0.229 | 0.176 | 0.499 | 0.673 | 8.95 | 4.241 |
| VACE | 7.690 | 0.600 | 0.352 | 0.206 | 0.336 | 0.450 | 0.712 | 10.14 | 5.393 |
| OmniShow (Ours) | 6.526 | 0.474 | 0.418 | 0.174 | 0.460 | 0.447 | 0.722 | 10.28 | 4.937 |
- Training Code (Wan-Based)
- Inference Code (Wan-Based)
- Data of HOIVG-Bench
- Evaluation Code of HOIVG-Bench
We recommend using a clean Conda environment with Python 3.11:
git clone https://github.com/Correr-Zhou/OmniShow.git
cd OmniShow
conda create -n omnishow python=3.11 -y
conda activate omnishow
pip install -e .
pip install -r requirements.txtIf the default PyTorch installation does not match your CUDA version, reinstall PyTorch manually. For example, for CUDA 12.4:
pip install --index-url https://download.pytorch.org/whl/cu124 \
torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0Download the required Wan backbones, tokenizer assets, Wav2Vec2 audio encoder, and the OmniShow example dataset:
bash download_weights.sh
bash download_data.shBy default, the scripts organize files as:
OmniShow/
βββ models/
β βββ Wan-AI/Wan2.1-I2V-14B-480P/
β βββ Wan-AI/Wan2.1-I2V-14B-720P/
β βββ Wan-AI/Wan2.1-T2V-1.3B/
β βββ facebook/wav2vec2-base-960h/
βββ data/
βββ donghao-zhou/OmniShow_example_dataset/
The example dataset follows the metadata format below:
| Field | Description |
|---|---|
text_prompt |
Text description of the target video. |
ref_image_human |
Relative path to the human reference image. |
ref_image_object |
Relative path to the object reference image. |
audio |
Relative path to the audio file. |
audio_caption |
Textual description of the audio content. |
pose_video |
Relative path to the pose video. |
target_video |
Relative path to the training target video. Used for training metadata. |
This release focuses on reproducing our method on Wan-based models. Checkpoints are not included due to internal policy constraints.
The target_video files of the example dataset are generated by OmniShow and intended for checking that the code runs correctly.
For HOIVG-Bench, please download the benchmark from HuggingFace.
Run the default Wan-based OmniShow training script:
bash run_train/train_omnishow_wan.shThe default script uses the example dataset and trains the r2v setting at 480p. You can edit the following variables in run_train/train_omnishow_wan.sh to switch task, resolution, or data path:
GEN_TASK="r2v" # r2v / a2v / ra2v / rp2v / rap2v
RESOLUTION="480p" # 480p / 720p
DATA_FILE="data/donghao-zhou/OmniShow_example_dataset/meta_data_train.csv"
DATASET_BASE_PATH="data/donghao-zhou/OmniShow_example_dataset"The training entry also supports direct command-line usage:
accelerate launch --config_file run_train/accelerate_config_14B_zero3.yaml \
--num_processes 8 \
run_train/train_omnishow_wan.py \
--dataset_base_path data/donghao-zhou/OmniShow_example_dataset \
--dataset_metadata_path data/donghao-zhou/OmniShow_example_dataset/meta_data_train.csv \
--height 832 \
--width 480 \
--num_frames 49 \
--gen_task r2v \
--model_id_with_origin_paths "Wan-AI/Wan2.1-I2V-14B-480P:diffusion_pytorch_model*.safetensors,Wan-AI/Wan2.1-I2V-14B-480P:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.1-I2V-14B-480P:Wan2.1_VAE.pth,Wan-AI/Wan2.1-I2V-14B-480P:models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth" \
--trainable_models "dit" \
--learning_rate 1e-5 \
--num_epochs 1000 \
--save_steps 500 \
--remove_prefix_in_ckpt "pipe.dit." \
--output_path outputs/train_omnishow_wan_r2v_480p \
--use_gradient_checkpointing_offload \
--initialize_model_on_cpuRun the default Wan-based OmniShow inference script:
bash run_infer/infer_omnishow_wan.shThe default script reads meta_data_infer.csv and saves generated videos to outputs/. To use a fine-tuned checkpoint, set DIT_CHECKPOINT in run_infer/infer_omnishow_wan.sh:
DIT_CHECKPOINT="path/to/your/checkpoint.safetensors"The inference entry also supports direct command-line usage:
python run_infer/infer_omnishow_wan.py \
--model_id Wan-AI/Wan2.1-I2V-14B-480P \
--csv data/donghao-zhou/OmniShow_example_dataset/meta_data_infer.csv \
--base_path data/donghao-zhou/OmniShow_example_dataset \
--output_dir outputs/infer_omnishow_wan_r2v_480p \
--gen_task r2v \
--dit_checkpoint path/to/your/checkpoint.safetensors \
--height 832 \
--width 480 \
--num_frames 49 \
--num_inference_steps 50 \
--cfg_scale 6 \
--seed 42OmniShow supports the following tasks:
| Task | Conditions | Typical inference CSV fields |
|---|---|---|
r2v |
text + reference images | text_prompt, ref_image_human, ref_image_object |
a2v |
text + first frame + audio | text_prompt, input_image, audio |
ra2v |
text + reference images + audio | text_prompt, ref_image_human, ref_image_object, audio |
rp2v |
text + reference images + pose | text_prompt, ref_image_human, ref_image_object, pose_video |
rap2v |
text + reference images + audio + pose | text_prompt, ref_image_human, ref_image_object, audio, pose_video |
Both training and inference scripts expose the same task switch:
GEN_TASK="r2v" # r2v / a2v / ra2v / rp2v / rap2vTo switch resolution, you can also edit RESOLUTION in the corresponding script:
RESOLUTION="480p" # 480p / 720pThe scripts automatically select the matching base model.
If you use a custom aspect ratio or resolution, also check the HEIGHT and WIDTH values in the script.
For training, the most commonly edited variables in run_train/train_omnishow_wan.sh are:
GEN_TASK="r2v"
RESOLUTION="480p"
DATA_FILE="path/to/your/meta_data_train.csv"
DATASET_BASE_PATH="path/to/your/dataset_root"
OUTPUT_DIR="outputs/train_omnishow_wan_${GEN_TASK}_${RESOLUTION}"
LAUNCH_NUM_PROCESSES=8
LEARNING_RATE="1e-5"
NUM_EPOCHS=1000
SAVE_STEPS=500
NUM_FRAMES=49For inference, the most commonly edited variables in run_infer/infer_omnishow_wan.sh are:
GEN_TASK="r2v"
RESOLUTION="480p"
DATA_FILE="path/to/your/meta_data_infer.csv"
DATASET_BASE_PATH="path/to/your/dataset_root"
OUTPUT_DIR="outputs/infer_omnishow_wan_${GEN_TASK}_${RESOLUTION}"
DIT_CHECKPOINT="path/to/your/checkpoint.safetensors"
NUM_INFERENCE_STEPS=50
CFG_SCALE=6
SEED=42If DIT_CHECKPOINT is left empty, inference uses the base Wan DiT weights. Set it when evaluating a fine-tuned checkpoint.
To use your own data, follow the same CSV-driven format as the example dataset. All media paths in the CSV should be relative to DATASET_BASE_PATH.
A typical dataset can be organized as:
your_dataset/
βββ meta_data_train.csv
βββ meta_data_infer.csv
βββ ref_image_human/
βββ ref_image_object/
βββ input_image/
βββ audio/
βββ pose_video/
βββ target_video/ # training only
Training metadata should include target_video, while inference metadata does not need it. For a2v training, the first frame is taken from target_video; for a2v inference, provide input_image.
The detailed requirements for CSV fields are as follows:
| Field | Required for | Description |
|---|---|---|
text_prompt |
all tasks | Text description of the target video. |
ref_image_human |
r2v, ra2v, rp2v, rap2v |
Relative path to the human reference image. |
ref_image_object |
r2v, ra2v, rp2v, rap2v |
Relative path to the object reference image. |
input_image |
a2v inference |
Relative path to the first-frame image. |
audio |
a2v, ra2v, rap2v |
Relative path to the audio file. |
audio_caption |
optional | Textual description of the audio content. |
pose_video |
rp2v, rap2v |
Relative path to the pose video. |
target_video |
training only | Relative path to the target video used for supervision. Also provides the first frame for a2v training. |
output_name |
inference optional | Output filename stem for the generated video. |
negative_prompt |
inference optional | Per-sample negative prompt. If omitted, the default negative prompt is used. |
seed |
inference optional | Per-sample random seed. If omitted, the script-level seed is used. |
Example training row:
text_prompt,ref_image_human,ref_image_object,audio,audio_caption,pose_video,target_video
"A person presents a object to the camera.",ref_image_human/0001.png,ref_image_object/0001.png,audio/0001.wav,"Object introduction speech.",pose_video/0001.mp4,target_video/0001.mp4Example inference row:
text_prompt,ref_image_human,ref_image_object,audio,audio_caption,pose_video,output_name
"A person presents a object to the camera.",ref_image_human/0001.png,ref_image_object/0001.png,audio/0001.wav,"Object introduction speech.",pose_video/0001.mp4,sample_0001The released code is organized around the OmniShow training and inference workflow:
OmniShow/
βββ assets/ # Figures used in this README.
βββ diffsynth/ # Core framework and OmniShow implementation.
β βββ configs/
β βββ core/
β βββ diffusion/
β βββ models/
β βββ modules/
β βββ pipelines/
β βββ utils/
β βββ __init__.py
β βββ version.py
βββ run_train/ # Training entrypoint, launcher script, and Accelerate config.
β βββ accelerate_config_14B_zero3.yaml
β βββ train_omnishow_wan.py
β βββ train_omnishow_wan.sh
βββ run_infer/ # Inference entrypoint and example launcher script.
β βββ infer_omnishow_wan.py
β βββ infer_omnishow_wan.sh
βββ download_weights.sh # Downloads Wan, tokenizer, and audio encoder weights.
βββ download_data.sh # Downloads the OmniShow example dataset.
βββ requirements.txt # Python dependencies used by the release.
βββ README.md # Project overview and usage instructions.
OmniShow is released for research purposes. The code and data are intended to support responsible study of video generation. Please follow the following guidelines:
- Do not use the model for identity misuse, impersonation, harassment, deception, or other harmful content generation.
- Respect the licenses and usage restrictions of the underlying Wan models, Wav2Vec2, datasets, and any input media.
- When using personal images, voices, or videos, obtain proper consent and follow applicable laws and platform policies.
- Generated content should be clearly disclosed when used in public-facing scenarios.
This codebase was built upon DiffSynth-Studio. We sincerely thank the contributors of this project for their excellent code.
If you find OmniShow useful or inspiring, please consider giving us a β on GitHub. Your support helps more people discover the project!
If OmniShow is helpful for your research or projects, please consider citing our work:
@article{zhou2026omnishow,
title={OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation},
author={Zhou, Donghao and Liu, Guisheng and Yang, Hao and Li, Jiatong and Lin, Jingyu and Huang, Xiaohu and Liu, Yichen and Gao, Xin and Chen, Cunjian and Wen, Shilei and Fu, Chi-Wing and Heng, Pheng-Ann},
journal={arXiv preprint arXiv:2604.11804},
year={2026}
}For questions about OmniShow, please contact Donghao Zhou at [email protected].




