Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[device_mesh] improve device selection logic #150897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: gh/wanchaol/370/base
Choose a base branch
from

Conversation

wanchaol
Copy link
Collaborator

@wanchaol wanchaol commented Apr 9, 2025

Stack from ghstack (oldest at bottom):

as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

  • If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
  • If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
    This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

  • If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
  • If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
  • If not above, then we throw warning to users about situation, and fallback to the old heuristic.

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150897

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 9d5b0ca with merge base 6f6fac6 (image):

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 9, 2025
@wanchaol wanchaol added the release notes: distributed (dtensor) release notes category label Apr 9, 2025
wanchaol added 2 commits April 9, 2025 18:13
[ghstack-poisoned]
[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 21, 2025
wanchaol added a commit that referenced this pull request Apr 21, 2025
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

ghstack-source-id: 8d27c0d
Pull Request resolved: #150897
[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Apr 21, 2025
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

ghstack-source-id: 7967a39
Pull Request resolved: #150897
[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Apr 21, 2025
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

ghstack-source-id: 55e85d1
Pull Request resolved: #150897
@wanchaol wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 21, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry not having enough context on DeviceMesh, so asking some questions before I can review. Meanwhile @fegin if he could unblock.

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request May 10, 2025
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
If not above, then we fallback to the old heuristic.

ghstack-source-id: a96dc0b
Pull Request resolved: #150897
wanchaol added a commit that referenced this pull request May 10, 2025
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
If not above, then we fallback to the old heuristic.

ghstack-source-id: a96dc0b
Pull Request resolved: #150897
@wanchaol wanchaol requested review from fegin and tianyu-l May 10, 2025 23:20
[ghstack-poisoned]
wanchaol added a commit that referenced this pull request May 10, 2025
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
If not above, then we fallback to the old heuristic.

ghstack-source-id: 2baca3c
Pull Request resolved: #150897
[ghstack-poisoned]
wanchaol added a commit that referenced this pull request May 11, 2025
as titled, this PR improves the device selection logic when user did not
set the device before calling the DeviceMesh constructor, as a device
manager, DeviceMesh should try to set the device for users in a good
way.

The behavior of set_device before:

If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user
If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device.
This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES

So this PR improves the device selection logic to:

If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves
If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue)
If not above, then we fallback to the old heuristic.

ghstack-source-id: 3f555ea
Pull Request resolved: #150897
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request module: cpu CPU specific problem (e.g., perf, algorithm) oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (dtensor) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants