[c10d] Fix extra CUDA context created by barrier#149144
Conversation
Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149144
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 731b4cd with merge base e9e1aac ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 258020d Pull Request resolved: #149144
| elif group.bound_device_id is not None: | ||
| # Use device id from `init_process_group(device_id=...)` | ||
| opts.device = group.bound_device_id | ||
| elif device.type == "cpu" or get_backend(group) == Backend.GLOO: |
There was a problem hiding this comment.
Is there a way to avoid depending on specific backend names/types? This makes it hard to add new ones that are compatible with core PT -- I've been trying to clean these up for torchft
There was a problem hiding this comment.
Yeah, I hope there is a way. The specific code is for a case where the user is on a GPU machine but only want to use CPU to do some stuff...
| the default process group will be used. | ||
| async_op (bool, optional): Whether this op should be an async op | ||
| device_ids ([int], optional): List of device/GPU ids. | ||
| device_ids ([int], optional): List of device/GPU ids. Only one id is expected. |
There was a problem hiding this comment.
Can we change this to
Only the first ID is used.
There was a problem hiding this comment.
I do mean only one is expected, because now we are expecting one device per thread. Some of the API signatures came from the old days.
| # Use device id from `init_process_group(device_id=...)` | ||
| opts.device = group.bound_device_id | ||
| elif device.type == "cpu" or get_backend(group) == Backend.GLOO: | ||
| opts.device = torch.device("cpu") |
There was a problem hiding this comment.
Will Gloo fail if it's not a CPU device?
Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 428f13a Pull Request resolved: #149144
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_9-cuda12_8-test / test Details for Dev Infra teamRaised by workflow job |
|
Failure seems to be an issue of CI instance and unrelated. |
Merge startedYour change will be merged while ignoring the following 1 checks: linux-binary-manywheel / manywheel-py3_9-cuda12_8-test / test Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "Unrelated failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 96c32b9 Pull Request resolved: #149144
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 96c32b9 Pull Request resolved: #149144
|
@pytorchbot revert -m 'Internal failure looks legit' -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 457fa82. Reverted #149144 on behalf of https://github.com/huydhn due to Internal failure looks legit ([comment](#149144 (comment)))
|
@kwen2501 your PR has been successfully reverted. |
|
@pytorchbot merge -f "Internal test was wrong; OSS version of barrier tests passed" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Refer pytorch/pytorch#149144, Currently, `dist.barrier` accepts `device_ids` as a parameter that doesn't have to be a list. When `device_ids` is not provided or another value is passed, `barrier` will use the device associated with the process group at initialization to perform the synchronization.
Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 96c32b9 Pull Request resolved: #149144
Stack from ghstack (oldest at bottom):
Fixes #149119.
In ProcessGroup.hpp, we create a dummy tensor for dispatching. This
requires a correct device index. This PR uses
device_idgiven by userwhen calling
init_process_group.This PR also uses
torch._C._get_accelerator()to determine the devicetype.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o