Enable XPU distributed test for PT2.8 #149916

daisyden · 2025-03-25T03:27:23Z

Fixes #ISSUE_NUMBER

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @kwen2501 @c-p-i-o

This reverts commit f5cbd50.

This reverts commit c791db9.

This reverts commit d0d8271.

…l/12/files#diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98

…ibuted_2.8

…l/12/files#diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98

Signed-off-by: Cheng Penghui <[email protected]>

…into distributed_2.8

Signed-off-by: Cheng Penghui <[email protected]>

…into distributed_2.8

pytorch-bot · 2025-03-25T03:27:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149916

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CI workflows being skipped on PR

✅ No Failures

As of commit 0e7a7b6 with merge base a09a3f4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Cheng Penghui <[email protected]>

guangyey · 2025-03-26T03:48:31Z

test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py

@@ -19,6 +19,7 @@
    TransformerBlock,
 )

+device_type = torch.accelerator.current_accelerator().type


Here will raise an Error if the current accelerator returns None.

guangyey · 2025-03-26T03:49:47Z

test/distributed/_composable/fsdp/test_fully_shard_comm.py

-        default_stream = torch.cuda.current_stream()
-        stream = torch.cuda.Stream()
+        default_stream = torch.accelerator.current_stream()
+        stream = torch.xpu.Stream() if device_type == "xpu" else torch.cuda.Stream()


Suggested change

stream = torch.xpu.Stream() if device_type == "xpu" else torch.cuda.Stream()

stream = torch.Stream()

guangyey · 2025-03-26T04:00:57Z

@daisyden This PR involves a significant code change, so I prefer to mark it as a draft until it's CI-ready and the internal review comments have been addressed. The main goal of this PR is to generalize the unit tests to be device-agnostic and enable XPU support on top of it. Would it be possible to split this PR into two separate ones: one focused on the generalization, and the other aimed at enabling XPU?

Signed-off-by: Cheng Penghui <[email protected]>

…equires_nccl_or and requires_nccl_version_or to replace requires_nccl and requires_nccl_version when xccl test is enabled on a test

Signed-off-by: Cheng Penghui <[email protected]>

…ted_2.8

Signed-off-by: Cheng Penghui <[email protected]>

…into distributed_2.8

Signed-off-by: Cheng, Penghui <[email protected]>

…into distributed_2.8

Signed-off-by: Cheng Penghui <[email protected]>

…ted_2.8

Signed-off-by: Cheng, Penghui <[email protected]>

…d_2.8

Signed-off-by: Cheng, Penghui <[email protected]>

pytorch-bot · 2025-04-25T02:18:12Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Signed-off-by: Cheng, Penghui <[email protected]>

daisyden and others added 17 commits May 10, 2024 19:43

make skipXPU work

d0d8271

enabled torch-xpu ops in op_db

c791db9

clean up code

f5cbd50

Revert "clean up code"

4d94417

This reverts commit f5cbd50.

Revert "enabled torch-xpu ops in op_db"

6844101

This reverts commit c791db9.

Revert "make skipXPU work"

5051e3c

This reverts commit d0d8271.

merge common code update from https://github.com/Chao1Han/pytorch/pul…

e2aa92a

…l/12/files#diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98

Merge branch 'main' of https://github.com/daisyden/pytorch into distr…

9e83095

…ibuted_2.8

merge common code update from https://github.com/Chao1Han/pytorch/pul…

06dd2aa

…l/12/files#diff-5bb26bd3a23ee3bb540aeadcc0385df2a4e48de39f87ed9ea76b21990738fe98

Add XPU support for distributed

a4a732b

Signed-off-by: Cheng Penghui <[email protected]>

Merge branch 'distributed_2.8' of https://github.com/daisyden/pytorch …

6e3f6b8

…into distributed_2.8

Merge remote-tracking branch 'upstream/main' into distributed_2.8

5f47367

ported fsdp and _composable/fsdp cases

345d7e6

Support XPU device for DDP test cases

4a5a522

Signed-off-by: Cheng Penghui <[email protected]>

Support XPU device for pipeline cases

20a4456

Signed-off-by: Cheng Penghui <[email protected]>

ported fsdp tests

a90a603

Merge branch 'distributed_2.8' of https://github.com/daisyden/pytorch …

5b1aff7

…into distributed_2.8

daisyden requested review from mruberry and a team as code owners March 25, 2025 03:27

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Mar 25, 2025

pytorchbot added the open source label Mar 25, 2025

fixed backend mapping error for register_backend function

44d55b9

Signed-off-by: Cheng Penghui <[email protected]>

guangyey added this to PyTorch Intel Mar 26, 2025

guangyey requested a review from EikanWang March 26, 2025 03:44

guangyey moved this to Pre-Review Required in PyTorch Intel Mar 26, 2025

guangyey reviewed Mar 26, 2025

View reviewed changes

guangyey marked this pull request as draft March 26, 2025 03:57

Update distributed UT cases

7dade1f

Signed-off-by: Cheng Penghui <[email protected]>

pytorch-bot bot added the module: dynamo label Apr 1, 2025

daisyden and others added 11 commits March 31, 2025 22:38

remove fsdp_kwargs in test_fsdp_memory.py to align with cuda, added r…

580aaee

…equires_nccl_or and requires_nccl_version_or to replace requires_nccl and requires_nccl_version when xccl test is enabled on a test

Merge branch 'upstream_main4' into distributed_2.8

c0f5713

Add test_dynamo_distributed cases

6dedbe3

Signed-off-by: Cheng Penghui <[email protected]>

Merge remote-tracking branch 'upstream/distributed_2.8' into distribu…

20d074c

…ted_2.8

update test_tp_random_state.py

124ff16

Signed-off-by: Cheng Penghui <[email protected]>

Merge from main branch

0bea112

Signed-off-by: Cheng Penghui <[email protected]>

support xccl in with_comms

7409ade

Merge branch 'distributed_2.8' of https://github.com/daisyden/pytorch …

636cbff

…into distributed_2.8

Merge branch 'upstream_main3' into distributed_2.8

0d5a86b

Enabled UT in test/distributed/tensor

3826e30

Signed-off-by: Cheng, Penghui <[email protected]>

Merge branch 'distributed_2.8' of https://github.com/daisyden/pytorch …

cb711b7

…into distributed_2.8

pytorch-bot bot added the module: inductor label Apr 9, 2025

daisyden and others added 9 commits April 9, 2025 05:09

refine fsdp2 test case for xpu

d6cd1b3

Merge branch 'distributed_2.8' of https://github.com/daisyden/pytorch …

624be3a

…into distributed_2.8

fix some issues in test case, cuda specific code, world_size 8, etc.

8d8c5fe

merge from main branch

1cf7887

Signed-off-by: Cheng Penghui <[email protected]>

Merge remote-tracking branch 'upstream/distributed_2.8' into distribu…

41475ac

…ted_2.8

Change world size in test_device_mesh.py

0628c76

Signed-off-by: Cheng, Penghui <[email protected]>

Merge remote-tracking branch 'origin/distributed_2.8' into distribute…

b0d935d

…d_2.8

Merge remote-tracking branch 'upstream/main' into distributed_2.8

58eb87e

Enabled some UT cases of distributed

e558eaa

Signed-off-by: Cheng, Penghui <[email protected]>

etaf added the ciflow/xpu Run XPU CI tasks label Apr 25, 2025

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Apr 25, 2025

PenghuiCheng added 2 commits April 29, 2025 03:07

enable UT case in _shard and _tool folder

83ac56e

Signed-off-by: Cheng, Penghui <[email protected]>

Fixed hard code error for world_size 8

0e7a7b6

Signed-off-by: Cheng, Penghui <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable XPU distributed test for PT2.8 #149916

Enable XPU distributed test for PT2.8 #149916

daisyden commented Mar 25, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 25, 2025 •

edited

Loading

guangyey Mar 26, 2025 •

edited

Loading

guangyey Mar 26, 2025

guangyey commented Mar 26, 2025

pytorch-bot bot commented Apr 25, 2025

	stream = torch.xpu.Stream() if device_type == "xpu" else torch.cuda.Stream()
	stream = torch.Stream()

Enable XPU distributed test for PT2.8 #149916

Are you sure you want to change the base?

Enable XPU distributed test for PT2.8 #149916

Conversation

daisyden commented Mar 25, 2025 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Mar 25, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149916

❗ 1 Active SEVs

✅ No Failures

guangyey Mar 26, 2025 • edited Loading

Choose a reason for hiding this comment

guangyey Mar 26, 2025

Choose a reason for hiding this comment

guangyey commented Mar 26, 2025

pytorch-bot bot commented Apr 25, 2025

daisyden commented Mar 25, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 25, 2025 •

edited

Loading

guangyey Mar 26, 2025 •

edited

Loading