- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.2k
Description
System Info / 系統信息
OOM on 8 A800 40G GPUs , i use the train_ddp_i2v.sh
`ATA_ARGS=(
--data_root "videos_perpromptunion_0718"
--caption_column "prompt.txt"
--video_column "videos.txt"
# --image_column "images.txt"  # comment this line will use first frame of video as image conditioning
--train_resolution "81x768x1360"  # (frames x height x width), frames should be 8N+1
)
Training Configuration
TRAIN_ARGS=(
--train_epochs 2 # number of training epochs
--seed 42 # random seed
--batch_size 1
--gradient_accumulation_steps 1
--mixed_precision "bf16"  # ["no", "fp16"] # Only CogVideoX-2B supports fp16 training
)
System Configuration
SYSTEM_ARGS=(
--num_workers 8
--pin_memory True
--nccl_timeout 1800
)
Checkpointing Configuration
CHECKPOINT_ARGS=(
--checkpointing_steps 50 # save checkpoint every x steps
--checkpointing_limit 2 # maximum number of checkpoints to keep, after which the oldest one is deleted
# --resume_from_checkpoint "/absolute/path/to/checkpoint_dir"  # if you want to resume from a checkpoint, otherwise, comment this line
)
Validation Configuration
VALIDATION_ARGS=(
--do_validation false  # ["true", "false"]
--validation_dir "/absolute/path/to/your/validation_set"
--validation_steps 20  # should be multiple of checkpointing_steps
--validation_prompts "prompts.txt"
--validation_images "images.txt"
--gen_fps 16
)`
Information / 问题信息
- The official example scripts / 官方的示例脚本
- My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
Expected behavior / 期待表现
如果使用zero训练能解决这个问题吗