Scale Unsloth to multiple GPUs with just torchrun. No configuration files, no custom frameworks - pure PyTorch DDP.
- 🚀 2-4x faster than single GPU
- 🎯 Zero configuration - works out of the box
- 💾 Same VRAM per GPU as single GPU Unsloth
- 🔧 Any Unsloth model - Qwen, Llama, Gemma, etc.
# Install dependencies
uv add torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
uv add unsloth datasets transformers trl
uv add git+https://github.com/anhvth/opensloth.gitReplace python with torchrun:
# Single GPU
python train_scripts/train_ddp.py
# Multi-GPU
torchrun --nproc_per_node=2 train_scripts/train_ddp.py # 2 GPUs
torchrun --nproc_per_node=4 train_scripts/train_ddp.py # 4 GPUsOpenSloth automatically handles GPU distribution, gradient sync, and batch sizing.
| Setup | Time | Speedup |
|---|---|---|
| 1 GPU | 19m 34s | 1.0x |
| 2 GPUs | 8m 28s | 2.3x |
Expected scaling: 2 GPUs = ~2.3x, 4 GPUs = ~4.5x, 8 GPUs = ~9x
from unsloth import FastLanguageModel
from trl import SFTConfig, SFTTrainer
from opensloth.patching.ddp_patch import ddp_patch
ddp_patch() # Enable DDP compatibility
# Standard Unsloth setup
local_rank = int(os.environ.get("LOCAL_RANK", 0))
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-0.6B",
device_map={"": local_rank},
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16)
trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...)
trainer.train()Run: torchrun --nproc_per_node=4 your_script.py
Current (Recommended): Simple torchrun + DDP patch
from opensloth.patching.ddp_patch import ddp_patch
ddp_patch()
# ... standard Unsloth codeOld Approach (v0.1.8): For complex configuration files, use:
git checkout https://github.com/anhvth/opensloth/releases/tag/v0.1.8- Unsloth - 2x faster training library
- TRL - Transformer Reinforcement Learning
- PyTorch DDP - Distributed training
git clone https://github.com/anhvth/opensloth.git
cd opensloth
torchrun --nproc_per_node=4 train_scripts/train_ddp.pyHappy training! 🦥⚡