π€π€π€ Videotuna is a useful codebase for text-to-video applications.
π VideoTuna is the first repo that integrates multiple AI video generation models including text-to-video (T2V), image-to-video (I2V), text-to-image (T2I), and video-to-video (V2V) generation for model inference and finetuning (to the best of our knowledge).
π VideoTuna is the first repo that provides comprehensive pipelines in video generation, from fine-tuning to pre-training, continuous training, and post-training (alignment) (to the best of our knowledge).
π All-in-one framework: Inference and fine-tune various up-to-date pre-trained video generation models.
π Continuous training: Keep improving your model with new data.
π Fine-tuning: Adapt pre-trained models to specific domains.
π Human preference alignment: Leverage RLHF to align with human preferences.
π Post-processing: Enhance and rectify the videos with video-to-video enhancement model.
- [2025-04-22] π Supported inference for
Wan2.1andStep Videoand fine-tuning forHunyuanVideo T2V, with a unified codebase architecture. - [2025-02-03] π Supported automatic code formatting via PR#27. Thanks @samidarko!
- [2025-02-01] π Migrated to Poetry for streamlined dependency and script management (PR#25). Thanks @samidarko!
- [2025-01-20] π Supported fine-tuning for
Flux-T2I. - [2025-01-01] π Released training for
VideoVAE+in the VideoVAEPlus repo. - [2025-01-01] π Supported inference for
Hunyuan VideoandMochi. - [2024-12-24] π Released
VideoVAE+: a SOTA Video VAE modelβnow available in this repo! Achieves better video reconstruction than NVIDIAβsCosmos-Tokenizer. - [2024-12-01] π Supported inference for
CogVideoX-1.5-T2V&I2VandVideo-to-Video Enhancementfrom ModelScope. - [2024-12-01] π Supported fine-tuning for
CogVideoX. - [2024-11-01] π π Released VideoTuna v0.1.0!
Initial support includes inference forVideoCrafter1-T2V&I2V,VideoCrafter2-T2V,DynamiCrafter-I2V,OpenSora-T2V,CogVideoX-1-2B-T2V,CogVideoX-1-T2V,Flux-T2I, and training/fine-tuning ofVideoCrafter,DynamiCrafter, andOpen-Sora.
conda create -n videotuna python=3.10 -y
conda activate videotuna
pip install poetry
poetry install- β It takes around 3 minitues.
Optional: Flash-attn installation
Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn via:
poetry run install-flash-attn - β It takes 1 minitue.
Optional: Video-to-video enhancement
poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
- If this command β get stucked, kill and re-run it will solve the issue.
Click to check instructions
Install Poetry: https://python-poetry.org/docs/#installation
Then:
poetry config virtualenvs.in-project true # optional but recommended, will ensure the virtual env is created in the project root
poetry config virtualenvs.create true # enable this argument to ensure the virtual env is created in the project root
poetry env use python3.10 # will create the virtual env, check with `ls -l .venv`.
poetry env activate # optional because Poetry commands (e.g. `poetry install` or `poetry run <command>`) will always automatically load the virtual env.
poetry installOptional: Flash-attn installation
Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn via:
poetry run install-flash-attnOptional: Video-to-video enhancement
poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
- If this command β get stucked, kill and re-run it will solve the issue.
Click to check instructions
On MacOS with Apple Silicon chip use docker compose because some dependencies are not supporting arm64 (e.g. bitsandbytes, decord, xformers).
First build:
docker compose build videotunaTo preserve the project's files permissions set those env variables:
export HOST_UID=$(id -u)
export HOST_GID=$(id -g)Install dependencies:
docker compose run --remove-orphans videotuna poetry env use /usr/local/bin/python
docker compose run --remove-orphans videotuna poetry run python -m pip install --upgrade pip setuptools wheel
docker compose run --remove-orphans videotuna poetry install
docker compose run --remove-orphans videotuna poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.htmlNote: installing swissarmytransformer might hang. Just try again and it should work.
Add a dependency:
docker compose run --remove-orphans videotuna poetry add wheelCheck dependencies:
docker compose run --remove-orphans videotuna poetry run pip freezeRun Poetry commands:
docker compose run --remove-orphans videotuna poetry run formatStart a terminal:
docker compose run -it --remove-orphans videotuna bash- Please follow docs/checkpoints.md to download model checkpoints.
- After downloading, the model checkpoints should be placed as Checkpoint Structure.
Run the following commands to inference models:
It will automatically perform T2V/T2I based on prompts in inputs/t2v/prompts.txt,
and I2V based on images and prompts in inputs/i2v/576x1024.
T2V
| Task | Model | Command | Length (#Frames) | Resolution | Inference Time | GPU Memory (GB) |
|---|---|---|---|---|---|---|
| T2V | HunyuanVideo | poetry run inference-hunyuan-t2v |
129 | 720x1280 | 32min | 60G |
| T2V | WanVideo | poetry run inference-wanvideo-t2v-720p |
81 | 720x1280 | 32min | 70G |
| T2V | StepVideo | poetry run inference-stepvideo-t2v-544x992 |
51 | 544x992 | 8min | 61G |
| T2V | Mochi | poetry run inference-mochi |
84 | 480x848 | 2min | 26G |
| T2V | CogVideoX-5b | poetry run inference-cogvideo-t2v-diffusers |
49 | 480x720 | 2min | 3G |
| T2V | CogVideoX-2b | poetry run inference-cogvideo-t2v-diffusers |
49 | 480x720 | 2min | 3G |
| T2V | Open Sora V1.0 | poetry run inference-opensora-v10-16x256x256 |
16 | 256x256 | 11s | 24G |
| T2V | VideoCrafter-V2-320x512 | poetry run inference-vc2-t2v-320x512 |
16 | 320x512 | 26s | 11G |
| T2V | VideoCrafter-V1-576x1024 | poetry run inference-vc1-t2v-576x1024 |
16 | 576x1024 | 2min | 15G |
I2V
| Task | Model | Command | Length (#Frames) | Resolution | Inference Time | GPU Memory (GB) |
|---|---|---|---|---|---|---|
| I2V | WanVideo | poetry run inference-wanvideo-i2v-720p |
81 | 720x1280 | 28min | 77G |
| I2V | HunyuanVideo | poetry run inference-hunyuan-i2v-720p |
129 | 720x1280 | 29min | 43G |
| I2V | CogVideoX-5b-I2V | poetry run inference-cogvideox-15-5b-i2v |
49 | 480x720 | 5min | 5G |
| I2V | DynamiCrafter | poetry run inference-dc-i2v-576x1024 |
16 | 576x1024 | 2min | 53G |
| I2V | VideoCrafter-V1 | poetry run inference-vc1-i2v-320x512 |
16 | 320x512 | 26s | 11G |
T2I
| Task | Model | Command | Length (#Frames) | Resolution | Inference Time | GPU Memory (GB) |
|---|---|---|---|---|---|---|
| T2I | Flux-dev | poetry run inference-flux-dev |
1 | 768x1360 | 4s | 37G |
| T2I | Flux-dev | poetry run inference-flux-dev --enable_vae_tiling --enable_sequential_cpu_offload |
1 | 768x1360 | 4.2min | 2G |
| T2I | Flux-schnell | poetry run inference-flux-schnell |
1 | 768x1360 | 1s | 37G |
| T2I | Flux-schnell | poetry run inference-flux-schnell --enable_vae_tiling --enable_sequential_cpu_offload |
1 | 768x1360 | 24s | 2G |
Please follow the docs/datasets.md to try provided toydataset or build your own datasets.
All training commands were tested on H800 80G GPUs.
T2V
| Task | Model | Mode | Command | More Details | #GPUs |
|---|---|---|---|---|---|
| T2V | Wan Video | Lora Fine-tune | poetry run train-wan2-1-t2v-lora |
docs/finetune_wan.md | 1 |
| T2V | Wan Video | Full Fine-tune | poetry run train-wan2-1-t2v-fullft |
docs/finetune_wan.md | 1 |
| T2V | Hunyuan Video | Lora Fine-tune | poetry run train-hunyuan-t2v-lora |
docs/finetune_hunyuanvideo.md | 2 |
| T2V | CogvideoX | Lora Fine-tune | poetry run train-cogvideox-t2v-lora |
docs/finetune_cogvideox.md | 1 |
| T2V | CogvideoX | Full Fine-tune | poetry run train-cogvideox-t2v-fullft |
docs/finetune_cogvideox.md | 4 |
| T2V | Open-Sora v1.0 | Full Fine-tune | poetry run train-opensorav10 |
- | 1 |
| T2V | VideoCrafter | Lora Fine-tune | poetry run train-videocrafter-lora |
docs/finetune_videocrafter.md | 1 |
| T2V | VideoCrafter | Full Fine-tune | poetry run train-videocrafter-v2 |
docs/finetune_videocrafter.md | 1 |
I2V
| Task | Model | Mode | Command | More Details | #GPUs |
|---|---|---|---|---|---|
| I2V | Wan Video | Lora Fine-tune | poetry run train-wan2-1-i2v-lora |
docs/finetune_wan.md | 1 |
| I2V | Wan Video | Full Fine-tune | poetry run train-wan2-1-i2v-fullft |
docs/finetune_wan.md | 1 |
| I2V | CogvideoX | Lora Fine-tune | poetry run train-cogvideox-i2v-lora |
docs/finetune_cogvideox.md | 1 |
| I2V | CogvideoX | Full Fine-tune | poetry run train-cogvideox-i2v-fullft |
docs/finetune_cogvideox.md | 4 |
T2I
| Task | Model | Mode | Command | More Details | #GPUs |
|---|---|---|---|---|---|
| T2I | Flux | Lora Fine-tune | poetry run train-flux-lora |
docs/finetune_flux.md | 1 |
We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.
Git hooks are handled with pre-commit library.
Run the following command to install hooks on commit. They will check formatting, linting and types.
poetry run pre-commit install
poetry run pre-commit install --hook-type commit-msgpoetry run pre-commit run --all-filesWe thank the following repos for sharing their awesome models and codes!
- Wan2.1: Wan: Open and Advanced Large-Scale Video Generative Models.
- HunyuanVideo: A Systematic Framework For Large Video Generation Model.
- Step-Video: A text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames.
- Mochi: A new SOTA in open-source video generation models
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
- Open-Sora: Democratizing Efficient Video Production for All
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
- VADER: Video Diffusion Alignment via Reward Gradients
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- Flux: Text-to-image models from Black Forest Labs.
- SimpleTuner: A fine-tuning kit for text-to-image generation.
- LLMs-Meet-MM-Generation: A paper collection of utilizing LLMs for multimodal generation (image, video, 3D and audio).
- MMTrail: A multimodal trailer video dataset with language and music descriptions.
- Seeing-and-Hearing: A versatile framework for Joint VA generation, V2A, A2V, and I2A.
- Self-Cascade: A Self-Cascade model for higher-resolution image and video generation.
- ScaleCrafter and HiPrompt: Free method for higher-resolution image and video generation.
- FreeTraj and FreeNoise: Free method for video trajectory control and longer-video generation.
- Follow-Your-Emoji, Follow-Your-Click, and Follow-Your-Pose: Follow family for controllable video generation.
- Animate-A-Story: A framework for storytelling video generation.
- LVDM: Latent Video Diffusion Model for long video generation and text-to-video generation.
Please follow CC-BY-NC-ND. If you want a license authorization, please contact the project leads Yingqing He ([email protected]) and Yazhou Xing ([email protected]).
@software{videotuna,
author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
month = {Nov},
year = {2024},
url = {https://github.com/VideoVerses/VideoTuna}
}