Can you imagine playing various different games through a single model? Like Black Myth: Wukong. 🤩 DeepVerse can "fantasize" the entire world behind images and enable free exploration through interaction 🎮️. Please follow the instructions below to experience DeepVerse!
-
2025-8: The weight and code of DeepVerse are released! See Here!
-
2025-6: The paper of DeepVerse is released! Also, check out our previous 4D diffusion world model Aether!
-
Set virtual environment
conda create -n deepverse python=3.10 conda activate deepverse pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt
-
Model weight download
from huggingface_hub import snapshot_download repo_id = "SOTAMak1r/DeepVerse1.1" ak = "your ak" snapshot_download( local_dir="/path/to/your/folder", repo_id=repo_id, local_dir_use_symlinks=False, resume_download=True, use_auth_token=ak, )
-
Let's start with a simple example. Use
--input_imageto specify the initial image, and--model_pathas the directory for model weights.python run.py \ --model_path /path/to/model \ --input_image ./assets/demo1.png \ --prompt_type text \ --prompt 'The character rides a horse and walks on the street'The inference process runs on a single NVIDIA A800 with a speed of
4 FPS, while the video is saved at20 FPS. The maximum GPU memory usage during inference is17GB. All result files will be saved in theoutputfolder by default.We present some sampling results.demo1.mp4
output.mp4
The character rides a horse and walks on the street The character walked along the snowy path To save depth images simultaneously, use
--add_depth. To save point clouds simultaneously, use--add_ply. When saving point clouds, we perform temporal sampling with a default interval of8 frames. Additionally, we randomly downsample the point cloud to1/10of its original point count to further reduce the PLY file size. If adjustments are needed, modify the configuration in thesave_plyfunction inrun.py.
Here’s an example command:python run.py \ --model_path /path/to/model \ --input_image ./assets/demo3.png \ --prompt_type text \ --prompt 'The car is driving slowly in the direction of the road' --add_depth --add_ply The results will be saved as: output ├── generated_video.mp4 # rgb (+depth) ├── generated_video_frame0.ply # frame 0's ply ├── generated_video_frame8.ply # frame 8's ply ├── ... ├── generated_video_frame64.ply # frame 64's ply ├── ...You will obtain the following results:
demo3.mp4
RGB & Depth PLY files (visualized in Meshlab) -
DeepVerse supports control using actions, which are divided into two aspects: translation and steering, as detailed below:
- translation: fL F fR \ | / \ | / L ----+---- R / | \ / | \ rL B rR 'S': 'Stay where you are.' 'L': 'Move to the left.' 'rL': 'Move to the rear left.' 'B': 'Move backward.', 'rR': 'Move to the rear right.' 'R': 'Move to the right.' 'fR': 'Move to the front right.' 'F': 'Move forward.' 'fL': 'Move to the front left.' - steering: 'N': 'The perspective hasn\'t changed.', 'L': 'Rotate the perspective counterclockwise.', 'R': 'Rotate the perspective clockwise.',Each step must include both translation and steering signals. The translation signal comes first (which can be one or two characters), followed by the steering signal (a single character). The information for the same moment should be enclosed in
(). Below is the format for inputting actions:- 😄valid: (rLN)(fRL)(BN)(LN)(RN) ... - 😨invalid: (rL)(fR_L)(B)(N)(FRB) ...We provide an example command as follows, using
--prompt_type actionto specify the use of action control:python run.py \ --model_path /path/to/model \ --input_image ./assets/demo2.png \ --prompt_type action \ --prompt '(FN)(FN)(fLN)(fLN)(fRN)(fRN)(SN)(FR)(FR)(FR)(FN)(FN)(FN)' \ --add_controler --add_depth --add_plyUse the
--add_controlercommand to include controller information in the saved video.demo4.mp4
(FN)(FN)(fLN)(fLN)(fRN)(fRN)(SN)(FR)(FR)(FR)(FN)(FN)(FN) PLY files
NOTE: If you want to use action control on non-3A game images (OOD), we recommend using
--no_need_depthfor better visual results. This is because DeepVerse1.1's training set includes some real-world videos (without geometry labels) in the mix.
demo5.mp4 |
demo6.mp4 |
| (BN)(BN)(BN)(BN)(BN)(BN)(SN)(SN)(BN)(BN)(BN)(BN)(BN) | (FN)(FN)(FN)(FN)(FN)(SN)(fRL)(fRL)(fRL)(fLR)(fLR)(fLR)(FN)(FN)(FN) |
We would like to express our gratitude to the contributors to the open-source community, as the following papers and code repositories form the foundation of our work: (1) Pyramid-Flow and SD3: Provided open-source base models and code; (2) GameNGen: Offered valuable insights that significantly influenced our research direction; (3) Aether, GST, and Dust3R: Supplied open-source code and key functions. These contributions have enriched our understanding and inspired our efforts.
If our work assists your research, feel free to give us a star ⭐ or cite us using:
@article{chen2025deepverse,
title={DeepVerse: 4D Autoregressive Video Generation as a World Model},
author={Chen, Junyi and Zhu, Haoyi and He, Xianglong and Wang, Yifan and Zhou, Jianjun and Chang, Wenzheng and Zhou, Yang and Li, Zizun and Fu, Zhoujie and Pang, Jiangmiao and others},
journal={arXiv preprint arXiv:2506.01103},
year={2025}
}