Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1102,9 +1102,12 @@ bool ProcessGroupNCCL::useNonblocking() {
useNonblocking_ = nbEnv;
}
// 3rd priority: automatically use nonblocking if we are in eager init mode
else if (getBoundDeviceId()) {
useNonblocking_ = true;
}
// Note: this automatic selection is disabled in torch 2.7.1 to work around a
// hang in NCCL 2.26 in non-blocking mode. We can revisit if NCCL fixes the
// bug. See https://github.com/pytorch/pytorch/issues/153960
// else if (getBoundDeviceId()) {
// useNonblocking_ = true;
Copy link
Copy Markdown
Collaborator

@nWEIdia nWEIdia May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any existing unit test that would need to be adjusted when this option is toggled on or off?

Copy link
Copy Markdown
Collaborator Author

@kwen2501 kwen2501 May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag is more of an internal choice than contract.
There are several tests that passes device_id, so hopefully they don't break.

Copy link
Copy Markdown
Collaborator

@nWEIdia nWEIdia May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that this would fix the test_non_blocking_with_eager_init
but with v2.7.1RC, (docker pull http://ghcr.io/pytorch/pytorch-test:2.7.1-cuda12.6-cudnn9-runtime), I am still reproducing the timeout/hang:

root@d70999cd4c34:/my_workspace/wei-pytorch/test/distributed# python test_c10d_nccl.py -v -k test_non_blocking_with_eager_init test_non_blocking_with_eager_init (__main__.ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init) ...

Copy link
Copy Markdown
Collaborator

@nWEIdia nWEIdia May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am switching to a platform that has better OS (was previously using a ubuntu 20.04 based system and there could be known issues).
But now encountering with v2.7.1RC:
ModuleNotFoundError: No module named 'torch.distributed._spmd'
update: used wrong (runtime) container, should use devel container. Never mind on this command.
cc @atalman

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test didn't hang for me. On H100 machine.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I confirm this test does not hang for me as well on H100. I would follow up internally on the potential issues with the ubuntu 20.04 stack.
Below is on H100
`python3 test_c10d_nccl.py -v -k test_non_blocking_with_eager_init
test_non_blocking_with_eager_init (main.ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init) ... ok


Ran 1 test in 10.152s

OK
`

Though this potentially mean that even if your change lands on main, the upstream CI may still hang, due to the potential OS related issue. I would double check on this front (ubuntu 20.04 or Amazon Linux 2023 + SM75) distributed.
Below is what I get from ubuntu 20.04 based host + ghcr.io/pytorch/pytorch-test:2.7.1-cuda12.6-cudnn9-devel on T4x2

root@d6abe3d5c3dd:/workspace/pytorch/test/distributed# time timeout 30 python3 test_c10d_nccl.py -v -k test_non_blocking_with_eager_init
test_non_blocking_with_eager_init (main.ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init) ...
real 0m30.040s (i.e. hang)
user 0m4.055s
sys 0m2.664s

// }
// 4th priority: otherwise, nonblocking = false to preserve old behavior
else {
useNonblocking_ = false;
Expand Down
Loading