[FSDP1] fix _same_storage check for DTensor #123617

weifengpy · 2024-04-09T01:45:36Z

for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, DTensor.untyped_storage().data_ptr() does not work in _same_storage. Thus desugar to DTensor._local_tensor.untyped_storage().data_ptr() #123272

credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2024-04-09T01:45:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123617

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e9861b0 with merge base 61be884 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

awgu

LGTM! We may need to move the DTensor import into _same_storage() to avoid breaking internal.

awgu · 2024-04-09T17:06:14Z

torch/distributed/fsdp/_flat_param.py

 import torch.nn as nn
 import torch.nn.functional as F
 from torch import Tensor
+from torch.distributed._tensor import DTensor


This is not a great state to be in, but I always remember that we cannot import DTensor at the top-level of this file, or else we may break some internal torch package or torch deploy thing.

I am not too familiar with the issue though :/

Is the breakage caused by circular dependency, if you can recall?

I honestly cannot remember :(

got you. will import DTensor inside the function

awgu · 2024-04-09T17:25:08Z

test/distributed/fsdp/test_fsdp_tp_integration.py

        fsdp_world_size = self.world_size // tp_world_size
        assert (
-            type(tp_fsdp_model) is FSDP and len(list(tp_fsdp_model.parameters())) == 1
+            type(tp_fsdp_model) is FSDP


IIUC, this change is to make the check stricter to more accurately reflect our assumptions?

this is to make it work for use_orig=True when tp_fsdp_model.parameters() > 1

awgu · 2024-04-09T17:26:05Z

test/distributed/fsdp/test_fsdp_tp_integration.py

+            torch.cat(
+                [
+                    torch.flatten(param.grad)
+                    if param.grad is not None


Is this change needed for use_orig_params=True specifically?

yes, for use_orig_params=True specifically

awgu · 2024-04-09T17:26:37Z

test/distributed/fsdp/test_fsdp_tp_integration.py

-        flat_param.grad[~sharded_mask] = grad[~sharded_mask]
-        # Average *all* gradient elements to match the FSDP only semantics
-        flat_param.grad /= tp_world_size
+        for flat_param in tp_fsdp_model.params:


Is len(tp_fsdp_model.params) > 1 iff use_orig_params=True?

that's right

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-04-10T07:22:49Z

@pytorchmergebot merge

pytorchmergebot · 2024-04-10T07:24:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@bigning

for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` pytorch#123272 credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files Pull Request resolved: pytorch#123617 Approved by: https://github.com/awgu

mvpatel2000 · 2024-05-02T21:04:23Z

@weifengpy do you think we can include in torch 2.3.1?
#125425

weifengpy · 2024-05-02T21:23:16Z

for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, DTensor.untyped_storage().data_ptr() does not work in _same_storage. Thus desugar to DTensor._local_tensor.untyped_storage().data_ptr() #123272

credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

@mvpatel2000 Just checked I need to cherry-pick this commit otherwiese torch 2.3.1 won't include this fix. Will file a PR to see if we can make it

@bigning

for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` pytorch#123272 credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files Pull Request resolved: pytorch#123617 Approved by: https://github.com/awgu

@bigning

for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` pytorch#123272 credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files Pull Request resolved: pytorch#123617 Approved by: https://github.com/awgu

@bigning

for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` pytorch#123272 credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files Pull Request resolved: pytorch#123617 Approved by: https://github.com/awgu

@bigning

for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` pytorch#123272 credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files Pull Request resolved: pytorch#123617 Approved by: https://github.com/awgu

@bigning

for FSDP (SHARD_GRAD_OP + use_orig_params) + TP, params in the backward are DTensors. However, ``DTensor.untyped_storage().data_ptr()`` does not work in ``_same_storage``. Thus desugar to ``DTensor._local_tensor.untyped_storage().data_ptr()`` #123272 credit to @bigning for the original fix. after landing, we would not need patching in mosaic composer https://github.com/mosaicml/composer/pull/3175/files Pull Request resolved: #123617 Approved by: https://github.com/awgu

[FSDP1] fix _same_storage check for DTensor

845cfe9

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy requested a review from a team as a code owner April 9, 2024 01:45

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Apr 9, 2024

weifengpy marked this pull request as draft April 9, 2024 01:45

remove world_size overwrite

b6d81f8

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy marked this pull request as ready for review April 9, 2024 16:52

weifengpy requested a review from awgu April 9, 2024 16:52

awgu approved these changes Apr 9, 2024

View reviewed changes

import DTensor inside function

e9861b0

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 10, 2024

pytorchmergebot added the merging label Apr 10, 2024

pytorchmergebot added the Merged label Apr 10, 2024

pytorchmergebot closed this in d60135e Apr 10, 2024

pytorchmergebot removed the merging label Apr 10, 2024

weifengpy mentioned this pull request Apr 11, 2024

FSDP + DTensor is not working with SHARD_GRAD_OP + use_orig_params #123272

Closed

This was referenced May 16, 2024

[FSDP1] fix _same_storage check for DTensor (#123617) #126464

Closed

[v2.3.1] Release Tracker #125425

Closed

weifengpy mentioned this pull request May 23, 2024

[FSDP1] fix _same_storage check for DTensor (#123617) #126955

Closed

weifengpy mentioned this pull request May 23, 2024

[FSDP1] fix _same_storage check for DTensor (#123617) #126957

Merged

[FSDP1] fix _same_storage check for DTensor #123617

[FSDP1] fix _same_storage check for DTensor #123617

Uh oh!

Conversation

weifengpy commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123617

✅ No Failures

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weifengpy Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weifengpy commented Apr 10, 2024

Uh oh!

pytorchmergebot commented Apr 10, 2024

Merge started

Uh oh!

mvpatel2000 commented May 2, 2024

Uh oh!

weifengpy commented May 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

weifengpy commented Apr 9, 2024 •

edited

Loading

pytorch-bot bot commented Apr 9, 2024 •

edited

Loading

weifengpy Apr 9, 2024 •

edited

Loading