DDP: coalescing many little broadcasts to improve performance#4978
Conversation
|
I'm curious about how much time saving in this coalesced broadcast, and the overhead of packing/unpacking tensors. Should them be moved into a new contiguous memory. I'm testing DDP on a single node due to the GIL in DP, but failed to get reasonable speedup. Could you please give me your system setup for running DDP. |
c3adf44 to
85aab24
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@Stonesjtu For about 100MB of broadcast across two nodes, I did a rough perf evaluation, a single broadcast vs the original many broadcasts, we can nearly double the performance. 0.0268 sec (with 1 coalesced b-cast) vs 0.0436 sec (with the original logic). |
apaszke
left a comment
There was a problem hiding this comment.
LGTM, but the code could be simpler and more robust if we used _take_tensors
| # Here we will coalesce all the parameters and buffers in several big | ||
| # flat tensors and broadcast them out to reduce the number of broadcasts | ||
| # as well as improve performance, even though this function is | ||
| # only called one time. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| tensors_bucket.append([]) | ||
| cur_bucket_size = 0 | ||
| tensors_bucket[-1].append(tensor) | ||
| cur_bucket_size += tensor.numel() * tensor.element_size() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@teng-li Is it possible to simply flatten the storage of all parameters (assuming they are static for the duration of the training)? As long as you do in place operations on parameters, which I think PyTorch does, it seems like you'd be able to copy the entire model with just one broadcast. |
|
We can't do this transformation in general, because there are cases where users want to flatten the parameters as they like, and we shouldn't interfere. |
At the constructor, when we broadcast module 0 on Node 0's entire module states to all other nodes, we broadcast each state (tensor) one by one, this is very inefficient and also causes NCCL/Gloo deadlocks (previous, even though this is a different issue).
This PR coalesces the entire module states into a series of big Tensors and broadcasts them out in one shot, just like _sync_buffers() at each forward() function.
The memory is limited by the bucket size, which is 10MB by default.
This also reduces the chance that multi-process/node cross node NCCL deadlocks, which happens only after a long series of broadcast. So far after this change, I haven't seen any deadlocks yet.
Tested on ResNet50 training, I printed out the entire module's state on both nodes before the broadcast and after the broadcast, and verified that after the coalesced broadcast, node 0 and node 1 have the exact state and node 0's state hasn't changed either. Thus, this PR should be safe to land