Add DeepSpeed trainer for large-scale training#5856
Conversation
for more information, see https://pre-commit.ci
|
@wanchichen, can you review this PR? |
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5856 +/- ##
===========================================
+ Coverage 0 43.06% +43.06%
===========================================
Files 0 819 +819
Lines 0 75193 +75193
===========================================
+ Hits 0 32384 +32384
- Misses 0 42809 +42809
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
Great work! The implementation is very clean, so I do not have many comments. Only two questions:
|
|
|
||
| with reporter.measure_time("step_time"): | ||
| # (0) ensure all ranks have not finished. | ||
| dist.all_reduce(iterator_stop, ReduceOp.SUM) |
There was a problem hiding this comment.
This all-reduce is not necessary (same with the one in valid_one_epoch). We can remove lines 185-187
There was a problem hiding this comment.
I think that forced synchronization is necessary, as it functions like a join operation, allowing all ranks to start forwarding simultaneously and avoiding timeouts. Torch_DDP does not require this synchronization operation, partly because DDP has its own join function, and partly because, for distributed training like DeepSpeed, the communication between machines is more complex and timeouts are more likely to occur. Therefore, sacrificing some waiting time (in fact, I believe the proportion of waiting time will not be very high) in exchange for training stability is necessary. cc @jctian98
There was a problem hiding this comment.
Thanks for the both comments :)
Sice the ZeRO will always have extensive communication, maybe this force sync is ok to use. I'll post more observation if I find these lines are the performance bottleneck.
|
@wanchichen Thanks for the review!
|
There was a problem hiding this comment.
do we need this?
you removed this in the other part of the latest commit.
There was a problem hiding this comment.
ditto
do we need this?
you removed this in the other part in the latest commit.
|
for more information, see https://pre-commit.ci
|
Sounds good. |
We may still discuss with @wanchichen about if we should remove the |
| "zero_optimization": { | ||
| "stage": 2, | ||
| "contiguous_gradients": true, | ||
| "overlap_comm": true, | ||
| "reduce_scatter": true, | ||
| "reduce_bucket_size": 5e8, | ||
| "allgather_bucket_size": 5e8 | ||
| }, | ||
| "zero_optimization": { | ||
| "stage": 2, | ||
| "contiguous_gradients": true, | ||
| "overlap_comm": true, | ||
| "reduce_scatter": true, | ||
| "reduce_bucket_size": 5e8, | ||
| "allgather_bucket_size": 5e8 | ||
| }, |
| @@ -0,0 +1,47 @@ | |||
| { | |||
| "train_batch_size": 32, | |||
There was a problem hiding this comment.
train_batch_size (32 in this case) might be conflict with train.yaml::batch_size (64 in this case)
|
@xingchensong Many thanks for the review!!! I have updated the example config. Current example is simple a toy. Later when we run some experiments at large scale we can share our deepspeed config together with the recipes. @sw005320 if CI is ok, I think this PR is ready to merge. |
|
Thanks a lot, @jctian98! |
What?
This PR adds another trainer object that warp the DeepSpeed so that it will automatically handle many trainer-related things, especially some advanced features:
(1) This DeepSpeed trainer is based on the data parallelism and will allow us to train model as large as ~13B.
(2) It can be smoothly switched from previous ESPnet trainer.
(3) Unlike model parallelism (will be needed when >13B), this trainer doesn't pose any requirement to the model architecture.
To use it, simply change add these lines in training config:
use_deepspeed: truedeepspeed_config: <path-to-config>.jsonMost trainer-related options with be moved to
deepspeed_config, so that the training config will only need to define things like model arch and data loader.Discussion:
what is the good place to add a README.md file? or is it needed?
Will request @wanchichen to take a look.
Thanks