[device_mesh] improve device selection logic #150897

wanchaol · 2025-04-09T03:13:36Z

Stack from ghstack (oldest at bottom):

as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
If not above, then we throw warning to users about situation, and fallback to the old heuristic.

[ghstack-poisoned]

pytorch-bot · 2025-04-09T03:13:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150897

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 9d5b0ca with merge base 6f6fac6 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/distributed/device_mesh.py:
pull / linux-focal-py3_9-clang9-xla / build (gh)
ninja: build stopped: subcommand failed

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. ghstack-source-id: 8d27c0d Pull Request resolved: #150897

[ghstack-poisoned]

as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. ghstack-source-id: 7967a39 Pull Request resolved: #150897

[ghstack-poisoned]

as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. ghstack-source-id: 55e85d1 Pull Request resolved: #150897

tianyu-l

sorry not having enough context on DeviceMesh, so asking some questions before I can review. Meanwhile @fegin if he could unblock.

torch/distributed/device_mesh.py

[ghstack-poisoned]

as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. The behavior of set_device before: If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device. This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES So this PR improves the device selection logic to: If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue) If not above, then we fallback to the old heuristic. ghstack-source-id: a96dc0b Pull Request resolved: #150897

[ghstack-poisoned]

as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. The behavior of set_device before: If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device. This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES So this PR improves the device selection logic to: If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue) If not above, then we fallback to the old heuristic. ghstack-source-id: 2baca3c Pull Request resolved: #150897

[ghstack-poisoned]

as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. The behavior of set_device before: If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device. This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES So this PR improves the device selection logic to: If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue) If not above, then we fallback to the old heuristic. ghstack-source-id: 3f555ea Pull Request resolved: #150897

Update

a7fb0e9

[ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 9, 2025

This was referenced Apr 9, 2025

Fix DTensorTestBase to barrier with device ids #150896

Closed

[device_mesh] replace dim_group_info with group_name #150898

Open

wanchaol added the release notes: distributed (dtensor) release notes category label Apr 9, 2025

pytorchbot added the open source label Apr 9, 2025

wanchaol added 2 commits April 9, 2025 18:13

Update

24f2ab1

[ghstack-poisoned]

Update

b392624

[ghstack-poisoned]

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 21, 2025

Update

979cd45

[ghstack-poisoned]

Update

0f1daaa

[ghstack-poisoned]

wanchaol requested review from fegin, wconstab, wz337 and tianyu-l April 21, 2025 20:59

wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 21, 2025

tianyu-l reviewed Apr 29, 2025

View reviewed changes

torch/distributed/device_mesh.py Show resolved Hide resolved

torch/distributed/device_mesh.py Show resolved Hide resolved

fegin reviewed Apr 30, 2025

View reviewed changes

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

fegin reviewed Apr 30, 2025

View reviewed changes

torch/distributed/device_mesh.py Outdated Show resolved Hide resolved

Update

29374eb

[ghstack-poisoned]

wanchaol requested review from fegin and tianyu-l May 10, 2025 23:20

Update

0667d5e

[ghstack-poisoned]

Update

9d5b0ca

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[device_mesh] improve device selection logic #150897

[device_mesh] improve device selection logic #150897

wanchaol commented Apr 9, 2025 •

edited

Loading

pytorch-bot bot commented Apr 9, 2025 •

edited

Loading

tianyu-l left a comment

[device_mesh] improve device selection logic #150897

Are you sure you want to change the base?

[device_mesh] improve device selection logic #150897

Conversation

wanchaol commented Apr 9, 2025 • edited Loading

pytorch-bot bot commented Apr 9, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150897

❌ 2 New Failures, 1 Unrelated Failure

tianyu-l left a comment

Choose a reason for hiding this comment

wanchaol commented Apr 9, 2025 •

edited

Loading

pytorch-bot bot commented Apr 9, 2025 •

edited

Loading