Source code of Middleware'24 paper: Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous Clusters
Docker image: pytorch/pytorch 2.1.0-cuda12.1-cudnn8-devel
Numpy version: 1.22.4
Aladdin introduced the HeteroDataLoader for adaptive batch size training over heterogeneous clusters. For other APIs, refer the AdaptDL Documentation
BEFORE:
torch.distributed.init_process_group("nccl")
model = torch.nn.parallel.DistributedDataParallel(model)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=128)
for epoch in range(100):
...AFTER:
adaptdl.torch.init_process_group("nccl")
model = adaptdl.torch.AdaptiveDataParallel(model, optimizer)
dataloader = adaptdl.torch.HeteroDataLoader(dataset, batch_size=128)
for epoch in adaptdl.torch.remaining_epochs_until(100):
...Cannikin is built based on the adaptive training library of AdaptDL. It can be used following:
- Adapting the batch size and learning rate for a single training job
- (Standalone Training).