[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26#154055
[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26#154055kwen2501 wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154055
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e6b5bbe with merge base fa85434 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| // hang in NCCL 2.26 in non-blocking mode. We can revisit if NCCL fixes the | ||
| // bug. See https://github.com/pytorch/pytorch/issues/153960 | ||
| // else if (getBoundDeviceId()) { | ||
| // useNonblocking_ = true; |
There was a problem hiding this comment.
Is there any existing unit test that would need to be adjusted when this option is toggled on or off?
There was a problem hiding this comment.
The flag is more of an internal choice than contract.
There are several tests that passes device_id, so hopefully they don't break.
There was a problem hiding this comment.
I was hoping that this would fix the test_non_blocking_with_eager_init
but with v2.7.1RC, (docker pull http://ghcr.io/pytorch/pytorch-test:2.7.1-cuda12.6-cudnn9-runtime), I am still reproducing the timeout/hang:
root@d70999cd4c34:/my_workspace/wei-pytorch/test/distributed# python test_c10d_nccl.py -v -k test_non_blocking_with_eager_init test_non_blocking_with_eager_init (__main__.ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init) ...
There was a problem hiding this comment.
I am switching to a platform that has better OS (was previously using a ubuntu 20.04 based system and there could be known issues).
But now encountering with v2.7.1RC:
ModuleNotFoundError: No module named 'torch.distributed._spmd'
update: used wrong (runtime) container, should use devel container. Never mind on this command.
cc @atalman
There was a problem hiding this comment.
The test didn't hang for me. On H100 machine.
There was a problem hiding this comment.
Yes, I confirm this test does not hang for me as well on H100. I would follow up internally on the potential issues with the ubuntu 20.04 stack.
Below is on H100
`python3 test_c10d_nccl.py -v -k test_non_blocking_with_eager_init
test_non_blocking_with_eager_init (main.ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init) ... ok
Ran 1 test in 10.152s
OK
`
Though this potentially mean that even if your change lands on main, the upstream CI may still hang, due to the potential OS related issue. I would double check on this front (ubuntu 20.04 or Amazon Linux 2023 + SM75) distributed.
Below is what I get from ubuntu 20.04 based host + ghcr.io/pytorch/pytorch-test:2.7.1-cuda12.6-cudnn9-devel on T4x2
root@d6abe3d5c3dd:/workspace/pytorch/test/distributed# time timeout 30 python3 test_c10d_nccl.py -v -k test_non_blocking_with_eager_init
test_non_blocking_with_eager_init (main.ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init) ...
real 0m30.040s (i.e. hang)
user 0m4.055s
sys 0m2.664s
|
@pytorchbot merge -f "Unblocking an urgent issue" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.7 -c critical |
…NCCL 2.26 (#154055) Work around issues like #153960, #152623 NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead. Pull Request resolved: #154055 Approved by: https://github.com/atalman (cherry picked from commit 87fc5af)
Cherry picking #154055The cherry pick PR is at #154085 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
…NCCL 2.26 (#154085) [c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (#154055) Work around issues like #153960, #152623 NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead. Pull Request resolved: #154055 Approved by: https://github.com/atalman (cherry picked from commit 87fc5af) Co-authored-by: Ke Wen <[email protected]>
Stack from ghstack (oldest at bottom):
Work around issues like #153960, #152623
NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e.
device_idpassed) to avoid init overhead.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k