-
[2025.2.27] π₯ The code is released! If you have any questions, please feel free to open an issue.
-
[2025.1.10] πΉοΈ Our MikuDance has recently been launched on the Lipu, an AI creation community designed for animation enthusiasts. We invite everyone to download and try it out.
-
[2024.11.15] β¨οΈ Paper and project page are released! Please see our demo videos on the project page. Considering the company's policy, the code release will be delayed. We will do our best to make it open source as soon as possible.
We Recommend a python version >=3.10
and cuda version =11.7
. Then build environment as follows:
# [Optional] Create a virtual env
conda create -n MikuDance python=3.10
conda activate MikuDance
# Install with pip:
pip install -r requirements.txt
Automatically downloading: You can run the following command to download weights automatically:
python tools/download_weights.py
Weights will be placed under the ./pretrained_weights
direcotry. The whole downloading process may take a long time.
Manually downloading: You can also download weights manually, which has some steps:
-
Download MikuDance weights, which include three parts:
denoising_unet.pth
,reference_unet.pth
andmotion_module.pth
. -
Download pretrained weight of based models and other components:
Finally, these weights should be orgnized as follows:
./pretrained_weights/
|-- image_encoder
| |-- config.json
| `-- pytorch_model.bin
|-- sd-vae-ft-mse
| |-- config.json
| |-- diffusion_pytorch_model.bin
| `-- diffusion_pytorch_model.safetensors
|-- stable-diffusion-v1-5
| |-- feature_extractor
| | `-- preprocessor_config.json
| |-- model_index.json
| |-- unet
| | |-- config.json
| | `-- diffusion_pytorch_model.bin
| `-- v1-inference.yaml
|-- vae_temporal_decoder
| |-- config.json
| `-- diffusion_pytorch_model.safetensors
|-- denoising_unet.pth
|-- motion_module.pth
|-- reference_unet.pth
Note: If you have installed some of the pretrained models, such as StableDiffusion V1.5
, you can specify their paths in the config file.
Running inference scripts:
python -m scripts.inference_video \
--config ./configs/inference/inference_video.yaml \
-W 768 -H 768 --fps 30 --steps 20
You can refer the format of inference_video.yaml
to animate your own reference images and pose videos.
Note: The target face, hand, w2c, c2w, and the reference depth are optional. If you don't have them, you can set them to null
in the config file.
Note: The -W
and -H
are the width and height of the output video, respectively. The width and height must be an integer multiple of 8. The --fps
is the frame rate of the output video. The --steps
is the denoising steps.
You can refer the src/dataset/anime_image_dataset.py
and src/dataset/anime_video_dataset.py
to prepare your own dataset for the two training stages respectively.
Our dataset was organized as follows:
./data/
|-- video_1/
| |-- frame_0001.jpg
| |-- pose_0001.jpg
| |-- face_0001.jpg
| |-- hand_0001.jpg
| |-- depth_0001.npy
| |-- w2c_0001.npy
| |-- c2w_0001.npy
| |-- frame_0002.jpg
| |-- ...
|-- video_2/
| |-- ...
Note: w2c
and c2w
are the camera parameters (world2camera and camera2world matrix) of the frame, depth
is the depth map of the frame. You can organize your own dataset format according to your needs.
accelerate launch scripts/train_stage1.py --config configs/train/train_stage1.yaml
Put the pretrained motion module weights mm_sd_v15_v2.ckpt
(download link) under ./pretrained_weights
.
accelerate launch scripts/train_stage2.py --config configs/train/train_stage2.yaml
We utilize Xpose to estimate the pose of the character. You can download the pretrained weights of Xpose from here and put it under ./src/XPose/weights
.
Pose estimation for driving videos:
cd ./src/XPose
python inference_on_video.py \
-c config_model/UniPose_SwinT.py \
-p weights/unipose_swint.pth \
-i /input_video_path \
-o /output_video_path \
-t "person" -k "person" \ # change to "face" or "hand" for face and hands keypoints
# -- real_human # If the driving video is a real human video, we recommend to add this flag to adjust the head-body scale of the keypoints.
Pose estimation for reference images:
cd ./src/XPose
python inference_on_image.py \
-c config_model/UniPose_SwinT.py \
-p weights/unipose_swint.pth \
-i /input_image_path \
-o /output_image_path \
-t "person" -k "person"
Note: We predefined the color map for the character keypoints. It is necessary to use the same color map and visualization settings as ours during inference.
Note: If the driving video features a real human and there is a significant difference in face scale compared to anime characters, we recommend setting the tgt_face_path
to null
in the config file.
We utilize DROID-SLAM to estimate the camera parameters of the driving video. You can follow the instructions in the DROID-SLAM repository to install it in the ./src/DROID-SLAM
directory. Then you can run the following command to estimate the camera parameters:
cd ./src/DROID-SLAM
python get_camera_from_video.py -i /input_video_path -o /output_path
Note: The environment of DROID-SLAM is different from MikuDance, you may need to install it by following the instructions in the DROID-SLAM repository.
Note: The camera parameters are optional for the inference of MikuDance. If you don't have them, you can set them to null
in the config file.
Note: In inference, the camera parameters are saved at the video level. But in our training dataset, the camera parameters are saved at the frame level.
We utilize Intel/dpt-hybrid-midas for depth estimation.
python tools/depth_from_image.py --image_path /input_image_path --save_dir /output_path
If MikuDance is useful for your research, welcome to π this repo and cite our work using the following BibTeX:
@misc{zhang2024mikudance,
title={MikuDance: Animating Character Art with Mixed Motion Dynamics},
author={Jiaxu Zhang and Xianfang Zeng and Xin Chen and Wei Zuo and Gang Yu and Zhigang Tu},
year={2024},
eprint={2411.08656},
archivePrefix={arXiv},
primaryClass={cs.CV}
}