train lora on  cogvideox1.5-i2v-5b  OOM

### System Info / 系統信息

OOM on 8 A800 40G GPUs ,  i use the train_ddp_i2v.sh 

`ATA_ARGS=(
    --data_root "videos_perpromptunion_0718"
    --caption_column "prompt.txt"
    --video_column "videos.txt"
    # --image_column "images.txt"  # comment this line will use first frame of video as image conditioning
    --train_resolution "81x768x1360"  # (frames x height x width), frames should be 8N+1
)

# Training Configuration
TRAIN_ARGS=(
    --train_epochs 2 # number of training epochs
    --seed 42 # random seed
    --batch_size 1
    --gradient_accumulation_steps 1
    --mixed_precision "bf16"  # ["no", "fp16"] # Only CogVideoX-2B supports fp16 training
)

# System Configuration
SYSTEM_ARGS=(
    --num_workers 8
    --pin_memory True
    --nccl_timeout 1800
)

# Checkpointing Configuration
CHECKPOINT_ARGS=(
    --checkpointing_steps 50 # save checkpoint every x steps
    --checkpointing_limit 2 # maximum number of checkpoints to keep, after which the oldest one is deleted
    # --resume_from_checkpoint "/absolute/path/to/checkpoint_dir"  # if you want to resume from a checkpoint, otherwise, comment this line
)

# Validation Configuration
VALIDATION_ARGS=(
    --do_validation false  # ["true", "false"]
    --validation_dir "/absolute/path/to/your/validation_set"
    --validation_steps 20  # should be multiple of checkpointing_steps
    --validation_prompts "prompts.txt"
    --validation_images "images.txt"
    --gen_fps 16
)`

### Information / 问题信息

- [x] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务

### Reproduction / 复现过程

<img width="1623" height="483" alt="Image" src="https://github.com/user-attachments/assets/649bfa31-d3d0-4f4e-99bd-5c9c6ef90dcc" />

### Expected behavior / 期待表现

如果使用zero训练能解决这个问题吗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

train lora on cogvideox1.5-i2v-5b OOM #787

System Info / 系統信息

Training Configuration

System Configuration

Checkpointing Configuration

Validation Configuration

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

train lora on cogvideox1.5-i2v-5b OOM #787

Description

System Info / 系統信息

Training Configuration

System Configuration

Checkpointing Configuration

Validation Configuration

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions