LongVie 2 is a multimodal controllable world model for generating ultra-long videos with depth and pointmap control signals.
Authors: Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu†, Chenyang Si†, Ziwei Liu†
conda create -n longvie python=3.10 -y
conda activate longvie
conda install psutil
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/[email protected]
cd LongVie
pip install -e .- Download the base model
Wan2.1-I2V-14B-480P:
python download_wan2.1.py- Download the LongVie2 weights and place them in
./model/LongVie/
Generate a 5s video clip (~8-9 mins on a single A100 GPU):
bash sample_longvideo.shbash train.shWe provide utilities for extracting control signals in ./utils:
# Extract depth maps
bash get_depth.sh
# Convert depth to .mp4 format
python depth_npy2mp4.py
# Extract trajectory
bash get_track.pyTo refine prompts after editing the first frame:
python qwen_caption_refine.pyIf you find this work useful, please consider citing:
@misc{gao2025longvie,
title={LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation},
author={Jianxiong Gao and Zhaoxi Chen and Xian Liu and Jianfeng Feng and Chenyang Si and Yanwei Fu and Yu Qiao and Ziwei Liu},
year={2025},
eprint={2508.03694},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.03694},
}
@misc{gao2025longvie2,
title={LongVie 2: Multimodal Controllable Ultra-Long Video World Model},
author={Jianxiong Gao and Zhaoxi Chen and Xian Liu and Junhao Zhuang and Chengming Xu and Jianfeng Feng and Yu Qiao and Yanwei Fu and Chenyang Si and Ziwei Liu},
year={2025},
eprint={2512.13604},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.13604},
}