📄 Paper | 🎥 Homepage | 💻 Code
- 🎯 Extreme Viewpoint Synthesis: Generate high-quality 4D videos with camera movements ranging from -90° to 90°
- 🔧 Depth Watertight Mesh: Novel geometric representation that models both visible and occluded regions
- ⚡ Lightweight Architecture: Only 1% trainable parameters (140M) of the 14B video diffusion backbone
- 🎭 No Multi-view Training: Innovative masking strategy eliminates the need for expensive multi-view datasets
- 🏆 State-of-the-art Performance: Outperforms existing methods, especially on extreme camera angles
EX-4D transforms monocular videos into camera-controllable 4D experiences with physically consistent results under extreme viewpoints.
Our framework consists of three key components:
- 🔺 Depth Watertight Mesh Construction: Creates a robust geometric prior that explicitly models both visible and occluded regions
- 🎭 Simulated Masking Strategy: Generates effective training data from monocular videos without multi-view datasets
- ⚙️ Lightweight LoRA Adapter: Efficiently integrates geometric information with pre-trained video diffusion models
# Clone the repository
git clone https://github.com/tau-yihouxiang/EX-4D.git
cd EX-4D
# Create conda environment
conda create -n ex4d python=3.10
conda activate ex4d
# Install PyTorch (2.x recommended)
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
# Install Nvdiffrast
pip install git+https://github.com/NVlabs/nvdiffrast.git
# Install dependencies and diffsynth
pip install -e .
# Install depthcrafter for depth estimation. (Follow DepthCrafter's installing instruction for checkpoints preparation.)
git clone https://github.com/Tencent/DepthCrafter.githuggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./models/Wan-AI
huggingface-cli download yihouxiang/EX-4D --local-dir ./models/EX-4D# --cam 180 (30 / 60 / 90 / zoom_in / zoom_out )
python recon.py --input_video examples/flower/input.mp4 --cam 180 --output_dir outputs/flower --save_meshpython generate.py --color_video outputs/flower/color_180.mp4 --mask_video outputs/flower/mask_180.mp4 --output_video outputs/flower/output.mp4|
Input Video |
➜
|
Output Video |
- 70.7% of participants preferred EX-4D over baseline methods
- Superior performance in physical consistency and extreme viewpoint quality
- Significant improvement as camera angles become more extreme
- 🎮 Gaming: Create immersive 3D game cinematics from 2D footage
- 🎬 Film Production: Generate novel camera angles for post-production
- 🥽 VR/AR: Create free-viewpoint video experiences
- 📱 Social Media: Generate dynamic camera movements for content creation
- 🏢 Architecture: Visualize spaces from multiple viewpoints
- Depth Dependency: Performance relies on monocular depth estimation quality
- Computational Cost: Requires significant computation for high-resolution videos
- Reflective Surfaces: Challenges with reflective or transparent materials
- Real-time inference optimization (3DGS / 4DGS)
- Support for higher resolutions (1K, 2K)
- Neural mesh refinement techniques
We would like to thank the DiffSynth-Studio v1.1.1 project for providing the foundational diffusion framework.
If you find our work useful, please consider citing:
@misc{hu2025ex4dextremeviewpoint4d,
title={EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh},
author={Hu, Tao and Peng, Haoyang and Liu, Xiao and Ma, Yuewen},
year={2025},
eprint={2506.05554},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.05554}
}