Thanks to visit codestin.com
Credit goes to github.com

Skip to content

chengyinie/hetero_adaptdl

 
 

Repository files navigation

Introduction

Source code of Middleware'24 paper: Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous Clusters

Environment setup

Docker image: pytorch/pytorch 2.1.0-cuda12.1-cudnn8-devel

Numpy version: 1.22.4

Easy-to-use Elastic API

Aladdin introduced the HeteroDataLoader for adaptive batch size training over heterogeneous clusters. For other APIs, refer the AdaptDL Documentation

BEFORE:

torch.distributed.init_process_group("nccl")
model = torch.nn.parallel.DistributedDataParallel(model)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=128)
for epoch in range(100):
    ...

AFTER:

adaptdl.torch.init_process_group("nccl")
model = adaptdl.torch.AdaptiveDataParallel(model, optimizer)
dataloader = adaptdl.torch.HeteroDataLoader(dataset, batch_size=128)
for epoch in adaptdl.torch.remaining_epochs_until(100):
    ...

Getting Started

Cannikin is built based on the adaptive training library of AdaptDL. It can be used following:

Adapting the batch size and learning rate for a single training job
(Standalone Training).

About

Adaptive DNN training over heterogeneous GPUs

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.8%
  • Shell 4.5%
  • Other 0.7%