Thanks to visit codestin.com
Credit goes to github.com

Skip to content

TorchRun: Option to specify which GPUs to run on #152822

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weโ€™ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bjourne opened this issue May 5, 2025 · 2 comments
Open

TorchRun: Option to specify which GPUs to run on #152822

bjourne opened this issue May 5, 2025 · 2 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@bjourne
Copy link

bjourne commented May 5, 2025

๐Ÿš€ The feature, motivation and pitch

TorchRun has an --nproc-per-node option to specify how many processes/gpus to use. But it has no option for specifying which gpus to use. So if you run torchrun multiple times the same gpus will be used. You can get around that as follows:

CUDA_VISIBLE_DEVICES="2,4,7" torchrun --nnodes=1 --nproc-per-node=3

This works if you have a single-node setup (perhaps not if you have multiple nodes?), but is not intuitive and error prone because you are passing some configuration in an environment variable and some in options. I think it would better if torchrun had an option such as --bind-devices=2,4,7 for it, supplanting/replacing --nproc-per-node.

Alternatives

No response

Additional context

No response

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@jinyouzhi
Copy link
Contributor

Is it possible to consider this option in conjunction with #115305

@drisspg drisspg added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 6, 2025
@bjourne
Copy link
Author

bjourne commented May 7, 2025

Would be nice if --nproc-per-node=gpu:1,3,5 could work for spawning three processes tied to GPUs with ids #1, #3, and #5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

3 participants