A diffusion-based unified framework capable of repositioning, inserting, replacing, and deleting objects in driving scenario videos.
git clone [email protected]:yvanliang/DriveEditor.gitconda create -n DriveEditor python=3.10 -y
conda activate DriveEditor
pip install torch==2.1.1 torchvision==0.16.1 xformers==0.0.23 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install .
pip install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata # install sdata for training-
Download the demo data from Google Drive and place it in the
checkpointsdirectory. -
Download the pretrained models from Google Drive and place it in the
checkpointsdirectory.
-
Download
sv3d_p.safetensorsandsvd.safetensorsfrom huggingface model hub and place them in thecheckpointsdirectory. -
Execute the following command to combine the two models:
python scripts/combine_ckpts.py
-
Download the toy training data from Google Drive and extract it into the
checkpointsdirectory to obtaintrain_data.pkl.
-
We provide a gradio demo for editing driving scenario videos. To run the demo, a GPU with more than 32GB of VRAM is required. Execute the following command:
python interactive_gui.py -
If you don't have a GPU with more than 32GB of VRAM but have two 24GB VRAM GPUs, you can use both GPUs for inference, although it will take more time. First, modify
sgm/modules/diffusionmodules/video_model.py:-
At line 684, add:
h_out_3d = h_out_3d.to(x.device) hs_3d_all = [t.to(x.device) for t in hs_3d_all]
-
At line 794, add:
x = x.to("cuda:1") timesteps = timesteps.to("cuda:1") context = context.to("cuda:1") y = y.to("cuda:1") if time_context is not None: time_context = time_context.to("cuda:1") image_only_indicator = image_only_indicator.to("cuda:1")
-
Then run the following command:
python interactive_gui_2gpu.py
-
To train the model, execute the following command:
python main.py -b configs/train.yaml --wandb --enable_tf32 True --no-test
We appreciate the releasing code of Stable Video Diffusion and ChatSim.