Zeng Tao2 He Zhu2 Fangzhou Hong1 Liang Pan2† Ziwei Liu1†
TL;DR: 4DNeX is a feed-forward framework for generating 4D scene representations from a single image by fine-tuning a video diffusion model. It produces high-quality dynamic point clouds and enables downstream tasks such as novel-view video synthesis with strong generalizability.
teaser.mp4
We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) To alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) We introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) We propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for the 4D generation task. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX achieves competitive performance compared to existing 4D generation approaches, offering a scalable and generalizable solution for single-image-based 4D scene generation.
- Data Preprocessing Scripts
- Training Scripts
- Inference Scripts
- Pointmap Registration Scripts
- Visualization Scripts
We use anaconda or miniconda to manage the python environment:
conda create -n "4dnex" python=3.10 -y
conda activate 4dnex
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
# git lfs and rerun
conda install -c conda-forge git-lfs
conda install -c conda-forge rerun-sdkOur model is developed on top of Wan2.1 I2V 14B, please download the pretrained model from Hugging Face and place it in the pretrained directory as following structure:
4DNeX/
└── pretrained/
└── Wan2.1-I2V-14B-480P-Diffusers/
├── model_index.json
├── scheduler/
├── unet/
├── vae/
├── text_encoder/
├── tokenizer/
└── ...
Then, you may download our pretrained LoRA weights from HuggingFace here and place it in the ./pretrained directory:
cd pretrained
mkdir 4dnex-lora
cd 4dnex-lora
huggingface-cli download FrozenBurning/4DNex-Lora --local-dir .
cd ../..
export PRETRAINED_LORA_PATH=./pretrained/4dnex-loraAfter setup the environment and pretrained model, you can run the following command to generate 4D scene representations from a single image, the output video and point map will be saved in the OUTPUT_DIR directory. Assuming we are going to save the results in the ./results directory, we can run the following command:
export OUTPUT_DIR=./results
python inference.py --prompt ./example/prompt.txt --image ./example/image.txt --out $OUTPUT_DIR --sft_path ./pretrained/Wan2.1-I2V-14B-480P-Diffusers/transformer --type i2vwbw-demb-samerope --mode xyzrgb --lora_path $PRETRAINED_LORA_PATH --lora_rank 64We store the path to the image in the ./example/image.txt file, and the prompt in the ./example/prompt.txt file for inference. Feel free to modify the prompt and image path to generate your own 4D scene representations.
To visualize the generated 4D scene representations, you may first perform pointmap registration using the following command:
python pm_registration.py --pkl_dir $OUTPUT_DIRThen, you may visualize the pointmap registration results using Rerun as follows:
python rerun_vis.py --rr_recording test_log.rrd --pkl_dir $OUTPUT_DIR
rerun test_log.rrd --web-viewerPlease checkout our 10M 4D dataset from here, and place it in the ./data directory.
The data can be organized in the following structure:
data/
├── dynamic/
│ ├── dynamic_1/
│ ├── dynamic_2/
│ └── dynamic_3/
├── static/
│ ├── static_1/
│ └── static_2/
├── caption/
│ └── dynamic_1_with_caption_upload.csv
│ └── dynamic_2_with_caption_upload.csv
│ └── dynamic_3_with_caption_upload.csv
│ └── static_1_with_caption_upload.csv
│ └── static_2_with_caption_upload.csv
└── raw/
├── dynamic/
│ ├── dynamic_1/
│ ├── dynamic_2/
│ └── dynamic_3/
└── static/
├── static_1/
└── static_2/
Run the command below to preprocess it:
python build_wan_dataset.py \
--data_dir ./data \
--out ./data/wan21Once preprocessing is finished, the output directory will be organized as follows:
wan21/
├── cache/
├── videos/
├── first_frames/
├── pointmap/
├── pointmap_latents/
├── prompts.txt
├── videos.txt
└── generated_datalist.txt
To launch training, we assume all data are in the ./data/wan21 directory, and run the following command:
bash scripts/finetune.shAfter training, you may convert the zero checkpoint to fp32 checkpoint for inference. For example, the output will be saved in the ./training/4dnex/5000-out directory as follows:
python scripts/zero_to_fp32.py ./training/4dnex/checkpoint-5000 ./training/4dnex/5000-out --safe_serializationIf you find our work useful for your research, please consider citing our paper:
@article{chen20254dnex,
title={4DNeX: Feed-Forward 4D Generative Modeling Made Easy},
author={Chen, Zhaoxi and Liu, Tianqi and Zhuo, Long and Ren, Jiawei and Tao, Zeng and Zhu, He and Hong, Fangzhou and Pan, Liang and Liu, Ziwei},
journal={arXiv preprint arXiv:2508.13154},
year={2025}
}