Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[hotfix] disable problematic cuda tests on rocm builds#45435

Closed
walterddr wants to merge 1 commit into
pytorch:masterfrom
walterddr:hotfix_rocm_test2
Closed

[hotfix] disable problematic cuda tests on rocm builds#45435
walterddr wants to merge 1 commit into
pytorch:masterfrom
walterddr:hotfix_rocm_test2

Conversation

@walterddr
Copy link
Copy Markdown
Contributor

@walterddr walterddr commented Sep 28, 2020

Disable the recent 3 problematic cuda tests with constant timeout issues on amd rocm build/tests

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@walterddr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@walterddr merged this pull request in 48d29c8.

jaglinux added a commit to jaglinux/pytorch that referenced this pull request Nov 13, 2020
If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in pytorch#45435 and pytorch#47629

For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer pytorch#45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.

This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2

Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.
facebook-github-bot pushed a commit that referenced this pull request Nov 14, 2020
Summary:
If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in #45435 and #47629

For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer PR #45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.

This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2

Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.

Fixes #47629

Pull Request resolved: #47898

Reviewed By: smessmer

Differential Revision: D24956021

Pulled By: rohan-varma

fbshipit-source-id: a88257f22a7991ba36566329766c106d3360bb4e
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Disable the recent 3 cuda tests on amd rocm build/tests

Pull Request resolved: pytorch#45435

Reviewed By: malfet

Differential Revision: D23962881

Pulled By: walterddr

fbshipit-source-id: ad4ea1f835b4722cdbdce685806cfd64376cc16f
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in pytorch#45435 and pytorch#47629

For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer PR pytorch#45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.

This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2

Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.

Fixes pytorch#47629

Pull Request resolved: pytorch#47898

Reviewed By: smessmer

Differential Revision: D24956021

Pulled By: rohan-varma

fbshipit-source-id: a88257f22a7991ba36566329766c106d3360bb4e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants