Updates NCCL to 2.17.1#97407
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97407
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 FailuresAs of commit 22141bc: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@ptrblck to review and green light the PR. |
|
Thanks much for creating the PR! |
|
@weiwangmeta should we also update internally? |
Thanks for tagging! I will check internal version and align with this PR via 3rd party update process. |
@ngimel fyi that the other internal team will take care of NCCL submodule updates according to their roadmap. Merging of this PR can be independent of internal NCCL submodule update. |
|
@pytorchbot merge |
|
This PR updates submodules third_party/nccl/nccl If those updates are intentional, please add "submodule" keyword to PR title/description. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR: Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge --help |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot merge -r |
|
@pytorchbot successfully started a rebase job. Check the current status here |
|
Successfully rebased |
380ec0e to
22141bc
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "looks like it broke inductor distributed tests https://hud.pytorch.org/pytorch/pytorch/commit/b113a09ef90decbc703722bfdc2064fc5eb54a19#12344853677" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@syed-ahmed your PR has been successfully reverted. |
This reverts commit b113a09. Reverted #97407 on behalf of https://github.com/clee2000 due to looks like it broke inductor distributed tests https://hud.pytorch.org/pytorch/pytorch/commit/b113a09ef90decbc703722bfdc2064fc5eb54a19#12344853677
@clee2000 I don't think no signal is right description here: I've issued Also, why there is |
The newly opened PR uses the same commit and got tagged with ciflow/periodic, so the periodic jobs show up on hud for this PR as well. The periodic jobs I see on hud finished at 9pm, which is after I reverted this PR |
Re-open of #97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch. Pull Request resolved: #97843 Approved by: https://github.com/kwen2501
This PR updates NCCL submodule to 2.17.1. Closes NVIDIA/nccl#750 Pull Request resolved: pytorch#97407 Approved by: https://github.com/ngimel, https://github.com/ptrblck, https://github.com/malfet
This reverts commit e4bb83a. Reverted pytorch#97407 on behalf of https://github.com/clee2000 due to looks like it broke inductor distributed tests https://hud.pytorch.org/pytorch/pytorch/commit/e4bb83a4f581e97563d961dace4f0e31b0843600#12344853677
Re-open of pytorch#97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch. Pull Request resolved: pytorch#97843 Approved by: https://github.com/kwen2501
This PR updates NCCL submodule to 2.17.1.
Closes NVIDIA/nccl#750
cc: @ptrblck @crcrpar