Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Updates NCCL to 2.17.1#97843

Closed
syed-ahmed wants to merge 3 commits into
pytorch:mainfrom
syed-ahmed:bump-nccl
Closed

Updates NCCL to 2.17.1#97843
syed-ahmed wants to merge 3 commits into
pytorch:mainfrom
syed-ahmed:bump-nccl

Conversation

@syed-ahmed
Copy link
Copy Markdown
Collaborator

@syed-ahmed syed-ahmed commented Mar 29, 2023

Re-open of #97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch.

@pytorch-bot pytorch-bot Bot added the topic: not user facing topic category label Mar 29, 2023
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 29, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97843

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 Failures

As of commit b44e88d:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base be0b12e:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Skylion007 Skylion007 added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor labels Mar 29, 2023
@ptrblck
Copy link
Copy Markdown
Collaborator

ptrblck commented Mar 30, 2023

The reported memory leak (https://github.com/pytorch/pytorch/actions/runs/4549833473/jobs/8040348962?pr=97843) in test_function_returns_input_inner_requires_grad_True_save_for_vjp_save_tensors_output_mark_dirty_True_cuda seems to be related to #97799.

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Mar 30, 2023

should it have the 2.0.1 milestone tag to be added to 2.0.1 #97272

@syed-ahmed syed-ahmed requested a review from a team as a code owner March 30, 2023 21:06
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 5, 2023
@malfet
Copy link
Copy Markdown
Contributor

malfet commented Apr 5, 2023

should it have the 2.0.1 milestone tag to be added to 2.0.1 #97272

Point-releases are for bug-fixes only. This PR does not mention anything about fixing a regression.

@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Apr 5, 2023

It is a regression fix. The included NCCL version since pt-1.13 hangs with CUDA_LAUNCH_BLOCKING=1. It has been fixed in nccl-2.17, which was released a month ago.

Not being able to use CUDA_LAUNCH_BLOCKING=1 is a huge problem for some pytorch users. At the moment the only solution is to use pt-1.12.

Context:

@syed-ahmed
Copy link
Copy Markdown
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased bump-nccl onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bump-nccl && git pull --rebase)

@syed-ahmed syed-ahmed force-pushed the bump-nccl branch 4 times, most recently from 0d61395 to 000f3b8 Compare April 12, 2023 02:02
@syed-ahmed syed-ahmed force-pushed the bump-nccl branch 2 times, most recently from eea4d09 to b7ab2e4 Compare April 16, 2023 01:53
@syed-ahmed
Copy link
Copy Markdown
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased bump-nccl onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout bump-nccl && git pull --rebase)

@syed-ahmed
Copy link
Copy Markdown
Collaborator Author

@malfet @ngimel this is ready to be merged. The original torchdynamo hang is fixed by shutting down sockets properly in NCCL. The current failures don't seem related to this PR.

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Apr 17, 2023

cc @kwen2501, @awgu, distributed team is in charge of nccl updates.

@kwen2501
Copy link
Copy Markdown
Collaborator

@pytorchbot merge -f "The current failures don't seem related to this PR"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@syed-ahmed
Copy link
Copy Markdown
Collaborator Author

@stas00 I think we should not include this commit in 2.0.1 because there are more changes (improved fault tolerance, socket code refactor etc.) in NCCL 2.17.1 than just the fix for CUDA_LAUNCH_BLOCKING=1. Given that it's more than just one change affecting different parts of the execution, I believe it doesn't fit the criteria for this release. I would suggest using the nightlies until the next release.

@syed-ahmed syed-ahmed deleted the bump-nccl branch April 17, 2023 23:19
@stas00
Copy link
Copy Markdown
Contributor

stas00 commented Apr 17, 2023

That works, @syed-ahmed - I appreciate you checking in and explaining why it's not safe to push not well tested code in at the last moment.

Thank you for trying to get it into 2.0.1 in the first place.

@Skylion007
Copy link
Copy Markdown
Collaborator

NCCL 2.18.1 is released now as well. @kwen2501 @awgu Anything preventing us from updating to that version?

@apoorvkh
Copy link
Copy Markdown

apoorvkh commented Sep 6, 2023

Hey all! Can we make sure this PR is included in Pytorch 2.1.0? Thanks!

@awgu
Copy link
Copy Markdown
Collaborator

awgu commented Sep 6, 2023

@apoorvkh It looks like 2.1 will use NCCL 2.18.5: https://github.com/pytorch/pytorch/tree/release/2.1/third_party/nccl

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Re-open of pytorch#97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch.
Pull Request resolved: pytorch#97843
Approved by: https://github.com/kwen2501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR Merged merging open source topic: bug fixes topic category topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.