Updates NCCL to 2.17.1#97843
Conversation
|
The reported memory leak (https://github.com/pytorch/pytorch/actions/runs/4549833473/jobs/8040348962?pr=97843) in |
|
should it have the 2.0.1 milestone tag to be added to 2.0.1 #97272 |
Point-releases are for bug-fixes only. This PR does not mention anything about fixing a regression. |
|
It is a regression fix. The included NCCL version since pt-1.13 hangs with Not being able to use Context: |
|
@pytorchbot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
|
Successfully rebased |
d41eece to
2f4214a
Compare
0d61395 to
000f3b8
Compare
eea4d09 to
b7ab2e4
Compare
|
@pytorchbot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
|
Successfully rebased |
b7ab2e4 to
b44e88d
Compare
|
@pytorchbot merge -f "The current failures don't seem related to this PR" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@stas00 I think we should not include this commit in 2.0.1 because there are more changes (improved fault tolerance, socket code refactor etc.) in NCCL 2.17.1 than just the fix for |
|
That works, @syed-ahmed - I appreciate you checking in and explaining why it's not safe to push not well tested code in at the last moment. Thank you for trying to get it into 2.0.1 in the first place. |
|
Hey all! Can we make sure this PR is included in Pytorch 2.1.0? Thanks! |
|
@apoorvkh It looks like 2.1 will use NCCL 2.18.5: https://github.com/pytorch/pytorch/tree/release/2.1/third_party/nccl |
Re-open of pytorch#97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch. Pull Request resolved: pytorch#97843 Approved by: https://github.com/kwen2501
Re-open of #97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch.