A practical guide for training LTX-2 audio-video LoRAs on consumer GPUs (24-32GB VRAM).
This fork adds:
- RamTorch CPU offloading for training on consumer GPUs
- Memory-efficient configurations tested on RTX 5090
- Practical guidance from real-world training experience
- System Requirements
- Installation
- Download Models
- Dataset Preparation
- Captioning with Qwen + Whisper
- Preprocessing (Latent Computation)
- Training Configuration
- Running Training
- Testing Checkpoints
- Parameter Reference
- Troubleshooting
| Component | Minimum | Recommended |
|---|---|---|
| GPU | 24GB VRAM | 32GB VRAM |
| System RAM | 32GB | 64GB+ |
| Storage | 100GB | 200GB+ |
- OS: Linux (required - triton is Linux-only)
- Python: 3.10+
- CUDA: 12.0+
| GPU | VRAM | RAM | Config | Speed |
|---|---|---|---|---|
| RTX 5090 | 32GB | 64GB | BF16 + 60% ramtorch | ~7s/step |
| A100 | 80GB | 128GB | BF16, no offload | ~4s/step |
git clone https://github.com/relaxis/LTX-2.git
cd LTX-2curl -LsSf https://astral.sh/uv/install.sh | shuv sync
source .venv/bin/activateFor loss tracking and training visualization:
uv run wandb loginCreate a models/ directory:
mkdir -p models
cd modelsDownload the BF16 model for training (FP8 doesn't support autograd):
wget https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-dev.safetensorsOption A: Standard Gemma 3
git lfs install
git clone https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized gemma-3Option B: Abliterated Gemma (for uncensored content)
For NSFW or unrestricted content generation, use an abliterated Gemma model with safety refusals removed. Search HuggingFace for "gemma abliterated" models.
| Requirement | Specification |
|---|---|
| Frame count | Must satisfy frames % 8 == 1 (valid: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121) |
| Resolution | Width and height must be divisible by 32 |
| Duration | Minimum 5 seconds recommended |
| Format | MP4, AVI, MOV, or other ffmpeg-compatible formats |
| Audio | Required for audio-video training (will be converted to stereo 44.1kHz) |
| VRAM | Resolution | Frames | Notes |
|---|---|---|---|
| 24GB | 512x512 | 49-89 | Testing, simple concepts |
| 32GB | 704x384 | 97 | Standard training |
| 48GB | 768x512 | 97-121 | Higher quality |
| 80GB | 1024x768 | 121+ | Production quality |
| Dataset Size | Recommended Steps | Effective Epochs | Notes |
|---|---|---|---|
| 10-25 videos | 500-1000 | 20-100 | Small concept/style |
| 25-50 videos | 1000-2000 | 20-80 | Standard concept |
| 50-100 videos | 2000-3000 | 20-60 | Complex concept |
| 100+ videos | 2000-4000 | 20-40 | Large-scale training |
Formula: effective_epochs = (steps * gradient_accumulation) / dataset_size
datasets/my_dataset/
videos/
video1.mp4
video2.mp4
...
dataset.jsonl
Each line is a JSON object:
{"media_path": "videos/video1.mp4", "prompt": "[VISUAL]: trigger. Description... [SPEECH]: None [SOUNDS]: ambient [TEXT]: None"}For NSFW or uncensored content, use an abliterated/uncensored VLM model.
- Visual captioning:
huihui-ai/Qwen2.5-VL-7B-Instruct-abliterated(or similar uncensored) - Audio transcription:
openai/whisper-large-v3
LTX-2 expects captions with explicit section markers:
[VISUAL]: trigger_token. Detailed visual description of the scene...
[SPEECH]: Transcribed speech or "None"
[SOUNDS]: Description of ambient/non-speech sounds
[TEXT]: Any on-screen text or "None"
#!/usr/bin/env python3
"""
Caption videos using Qwen2.5-VL + Whisper for LTX-2 training
"""
import os
import torch
import json
import subprocess
import tempfile
from pathlib import Path
from tqdm import tqdm
# Qwen2.5-VL imports
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# Whisper imports
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
# === CONFIGURATION ===
VIDEO_DIR = "/path/to/your/videos"
OUTPUT_JSONL = "/path/to/output/dataset.jsonl"
TRIGGER_TOKEN = "your_trigger" # e.g., "mystyle", "mychar"
QWEN_MODEL = "huihui-ai/Qwen2.5-VL-7B-Instruct-abliterated"
WHISPER_MODEL = "openai/whisper-large-v3"
CAPTION_PROMPT = """Describe this video in detail for AI training.
Focus on: actions, movements, camera angles, lighting, subjects.
Be descriptive and specific. Just provide the description."""
def extract_audio(video_path: Path, output_path: Path) -> bool:
"""Extract audio from video using ffmpeg"""
cmd = [
"ffmpeg", "-y", "-i", str(video_path),
"-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1",
str(output_path)
]
try:
result = subprocess.run(cmd, capture_output=True, timeout=60)
return result.returncode == 0 and output_path.exists()
except Exception as e:
print(f" Audio extraction failed: {e}")
return False
def transcribe_audio(whisper_model, whisper_processor, audio_path: Path, device: str) -> str:
"""Transcribe audio using Whisper"""
try:
audio, sr = librosa.load(str(audio_path), sr=16000)
inputs = whisper_processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = {k: v.to(device=device, dtype=torch.float16) for k, v in inputs.items()}
with torch.inference_mode():
generated_ids = whisper_model.generate(
inputs["input_features"],
max_length=448,
language="en",
task="transcribe"
)
transcription = whisper_processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
return transcription.strip()
except Exception as e:
print(f" Transcription failed: {e}")
return ""
def caption_video(qwen_model, qwen_processor, video_path: Path, device: str) -> str:
"""Caption a video using Qwen2.5-VL"""
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": str(video_path),
"max_pixels": 360 * 420,
"nframes": 16,
},
{"type": "text", "text": CAPTION_PROMPT},
],
}
]
text = qwen_processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = qwen_processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to(device)
with torch.inference_mode():
generated_ids = qwen_model.generate(
**inputs,
max_new_tokens=300,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = qwen_processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
return output_text[0].strip()
def main():
device = "cuda" if torch.cuda.is_available() else "cpu"
video_dir = Path(VIDEO_DIR)
video_files = sorted(list(video_dir.glob("**/*.mp4")))
print(f"Found {len(video_files)} videos")
# Load models
print("Loading Whisper...")
whisper_processor = WhisperProcessor.from_pretrained(WHISPER_MODEL)
whisper_model = WhisperForConditionalGeneration.from_pretrained(
WHISPER_MODEL, torch_dtype=torch.float16, device_map="auto"
)
print("Loading Qwen2.5-VL...")
qwen_processor = AutoProcessor.from_pretrained(QWEN_MODEL)
qwen_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
QWEN_MODEL, torch_dtype=torch.bfloat16, device_map="auto",
attn_implementation="flash_attention_2",
)
results = []
for video_path in tqdm(video_files, desc="Captioning"):
# Transcribe audio
speech_text = ""
with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as tmp:
tmp_path = Path(tmp.name)
if extract_audio(video_path, tmp_path):
speech_text = transcribe_audio(whisper_model, whisper_processor, tmp_path, device)
# Caption video
try:
visual_caption = caption_video(qwen_model, qwen_processor, video_path, device)
except Exception as e:
print(f" Caption failed: {e}")
visual_caption = "video content"
# Build LTX-2 format caption
speech_line = speech_text if speech_text else "None"
full_caption = f"[VISUAL]: {TRIGGER_TOKEN}. {visual_caption} [SPEECH]: {speech_line} [SOUNDS]: ambient sounds [TEXT]: None"
results.append({
"prompt": full_caption,
"media_path": str(video_path.relative_to(video_dir))
})
# Save JSONL
output_path = Path(OUTPUT_JSONL)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w") as f:
for result in results:
f.write(json.dumps(result) + "\n")
print(f"Saved {len(results)} captions to {output_path}")
if __name__ == "__main__":
main()pip install transformers accelerate qwen-vl-utils librosa torch
pip install flash-attn --no-build-isolation # Optional, faster inferencePreprocessing converts videos and captions into latent representations for efficient training.
cd packages/ltx-trainer
uv run python scripts/process_videos.py \
--input-dir /path/to/datasets/my_dataset \
--output-dir /path/to/datasets/my_dataset/.precomputed \
--model-path /path/to/models/ltx-2-19b-dev.safetensorsuv run python scripts/process_captions.py \
--input-dir /path/to/datasets/my_dataset \
--output-dir /path/to/datasets/my_dataset/.precomputed \
--text-encoder-path /path/to/models/gemma-3uv run python scripts/process_dataset.py /path/to/dataset.jsonl \
--resolution-buckets "704x384x97" \
--model-path /path/to/models/ltx-2-19b-dev.safetensors \
--text-encoder-path /path/to/models/gemma-3 \
--with-audiodataset_folder/.precomputed/
├── conditions/ # Text embeddings (.pt files)
│ └── videos/
├── latents/ # Video latents (.pt files)
│ └── videos/
└── audio_latents/ # Audio latents (.pt files)
└── videos/
Important: All three directories must have matching files. If any video fails to process (too short, no audio), remove it from all directories.
# configs/my_lora.yaml
model:
model_path: "/path/to/models/ltx-2-19b-dev.safetensors"
text_encoder_path: "/path/to/models/gemma-3"
training_mode: "lora"
load_checkpoint: null
lora:
rank: 64
alpha: 64
dropout: 0.0
target_modules:
- "to_k"
- "to_q"
- "to_v"
- "to_out.0"
- "ff.net.0.proj"
- "ff.net.2"
- "audio_ff.net.0.proj"
- "audio_ff.net.2"
training_strategy:
name: "text_to_video"
first_frame_conditioning_p: 0.3
with_audio: true
optimization:
learning_rate: 5e-5
steps: 3000
batch_size: 1
gradient_accumulation_steps: 2
max_grad_norm: 1.0
optimizer_type: "adamw8bit"
scheduler_type: "constant"
scheduler_params: {}
enable_gradient_checkpointing: true
acceleration:
mixed_precision_mode: "bf16"
quantization: null # Don't use - slower than ramtorch
hqq_quantization: null
load_text_encoder_in_8bit: true
ramtorch_offload: true
ramtorch_offload_percent: 0.60 # Increase to 0.70 for 24GB GPU
load_fp8_model: false
data:
preprocessed_data_root: "/path/to/datasets/my_dataset/.precomputed"
num_dataloader_workers: 2
validation:
prompts:
- "[VISUAL]: trigger. Your validation prompt here... [SPEECH]: None [SOUNDS]: ambient [TEXT]: None"
negative_prompt: "worst quality, blurry, distorted"
video_dims: [768, 512, 97]
frame_rate: 30.0
seed: 42
inference_steps: 30
interval: 999999 # Disabled - validation OOMs with ramtorch
videos_per_prompt: 1
guidance_scale: 4.0
stg_scale: 1.0
stg_blocks: [29]
stg_mode: "stg_v"
generate_audio: true
skip_initial_validation: true
checkpoints:
interval: 100
keep_last_n: 30
flow_matching:
timestep_sampling_mode: "shifted_logit_normal"
timestep_sampling_params: {}
wandb:
enabled: true
project: "ltx-2-training"
entity: null
tags: ["ltx2", "lora"]
log_validation_videos: false
seed: 42
output_dir: "/path/to/outputs/my_lora"For audio-video training (recommended):
target_modules:
- "to_k"
- "to_q"
- "to_v"
- "to_out.0"
- "ff.net.0.proj"
- "ff.net.2"
- "audio_ff.net.0.proj"
- "audio_ff.net.2"For video-only training:
target_modules:
- "attn1.to_k"
- "attn1.to_q"
- "attn1.to_v"
- "attn1.to_out.0"cd /path/to/LTX-2/packages/ltx-trainer
PYTORCH_ALLOC_CONF=expandable_segments:True uv run python scripts/train.py configs/my_lora.yaml- W&B Dashboard: Visit https://wandb.ai and find your project
- Terminal: Watch loss values in the progress bar
- Checkpoints: Saved to
output_dir/checkpoint-{step}/
| Steps | Time (32GB GPU) |
|---|---|
| 1000 | ~2 hours |
| 3000 | ~6 hours |
| 6000 | ~12 hours |
Since validation is disabled during training (OOMs with ramtorch), test checkpoints manually:
cd packages/ltx-pipelines
uv run python scripts/inference.py \
--model-path /path/to/models/ltx-2-19b-dev.safetensors \
--text-encoder-path /path/to/models/gemma-3 \
--lora-path /path/to/outputs/my_lora/checkpoint-1000/lora_weights.safetensors \
--prompt "[VISUAL]: trigger. Your test prompt... [SPEECH]: None [SOUNDS]: ambient [TEXT]: None" \
--output-path test_output.mp4Test multiple checkpoints (e.g., 500, 1000, 1500, 2000) to find the best one.
| Training Type | LR Range | Recommended |
|---|---|---|
| Style/aesthetic | 5e-5 to 1e-4 | 5e-5 |
| Character/subject | 1e-5 to 5e-5 | 3e-5 |
| Motion/action | 5e-5 to 2e-4 | 1e-4 |
| Complex concept | 1e-5 to 5e-5 | 5e-5 |
| Rank | Trainable Params | Use Case |
|---|---|---|
| 8-16 | ~25-50M | Simple style transfer |
| 32 | ~100M | Standard (recommended) |
| 64-128 | ~200-400M | Complex concepts, high fidelity |
| GPU VRAM | Offload % | Notes |
|---|---|---|
| 24GB | 0.70-0.75 | May be slow |
| 32GB | 0.60 | Recommended |
| 40GB | 0.40-0.50 | Faster |
| 48GB+ | 0.20-0.30 | Much faster |
- Increase
ramtorch_offload_percent(e.g., 0.70) - Reduce
lora.rank(e.g., 32) - Ensure
enable_gradient_checkpointing: true - Use
PYTORCH_ALLOC_CONF=expandable_segments:True
The validation sampler doesn't support ramtorch offloading. Set validation.interval: 999999 to disable and test checkpoints manually.
| Approach | Speed | Notes |
|---|---|---|
| BF16 + ramtorch 60% | ~7s/step | Recommended |
| INT8-quanto + ramtorch | ~8s/step | Slower - dequant overhead |
| FP8-quanto + ramtorch | ~10s/step | Even slower |
Don't use quantization - the dequantization overhead is slower than ramtorch CPU transfers.
NotImplementedError: "ufunc_add_CUDA" not implemented for 'Float8_e4m3fn'
FP8 models don't support autograd. Use the BF16 model (ltx-2-19b-dev.safetensors).
- Ensure
with_audio: truein training_strategy - Use short target_modules patterns:
"to_k"not"attn1.to_k" - Verify
audio_latents/exists and has matching files
If preprocessing fails for some videos, you may have mismatched counts:
latents/videos/: 50 files
audio_latents/videos/: 47 files
conditions/videos/: 50 files
Remove videos that failed from all three directories to ensure matching counts.
- Reduce learning rate
- Enable gradient clipping:
max_grad_norm: 1.0 - Check for corrupted training samples
- Install dependencies:
uv sync && source .venv/bin/activate - Download BF16 model:
ltx-2-19b-dev.safetensors - Download Gemma text encoder
- Prepare videos (frame count % 8 == 1, resolution % 32 == 0)
- Create dataset.jsonl with LTX-2 format captions
- Run preprocessing (videos + captions)
- Verify matching file counts in
.precomputed/ - Create training config with ramtorch offload
- Start training with wandb enabled
- Test checkpoints manually to find best one