[FSDP2] Accumulated in `reduce_dtype` if not syncing grads #125191

awgu · 2024-04-29T20:52:03Z

Stack from ghstack (oldest at bottom):

For microbatching use cases (e.g. PP), we may use fp32 reduce-scatter (i.e. MixedPrecisionPolicy(reduce_dtype=torch.float32)), where we want to accumulate the unsharded gradients in fp32 across microbatches until reduce-scattering in fp32 upon the last microbatch.

Note that the unsharded_param is in bf16, so we must save the fp32 accumulated gradient to an attribute different from .grad. Moreover, saving a new attribute on the torch.Tensor leads to some annoying type checking issues (where the attribute may not be defined), so this PR prefers to save the attribute on the FSDPParam class instead.

One could argue that this behavior should be configurable, but since I think for large-scale training, everyone is leaning toward fp32 accumulation across microbatches, let us avoid adding another argument for now.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-04-29T20:52:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125191

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Unrelated Failures

As of commit 0505ac7 with merge base 935a946 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-py3.11-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration
pull / linux-focal-py3.8-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration
pull / linux-focal-py3.12-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration
pull / linux-focal-py3.8-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration
trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh)
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration

This comment was automatically generated by Dr. CI and updates every 15 minutes.

weifengpy

I had a confusion about cpu offloading + fsdp_param.unsharded_accumulated_grad_data but it turns out does not matter. cpu offloading is only applied to shareded params

For microbatching use cases (e.g. PP), we may use fp32 reduce-scatter (i.e. `MixedPrecisionPolicy(reduce_dtype=torch.float32)`), where we want to accumulate the unsharded gradients in fp32 across microbatches until reduce-scattering in fp32 upon the last microbatch. Note that the `unsharded_param` is in bf16, so we must save the fp32 accumulated gradient to an attribute different from `.grad`. Moreover, saving a new attribute on the `torch.Tensor` leads to some annoying type checking issues (where the attribute may not be defined), so this PR prefers to save the attribute on the `FSDPParam` class instead. One could argue that this behavior should be configurable, but since I think for large-scale training, everyone is leaning toward fp32 accumulation across microbatches, let us avoid adding another argument for now. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

ghstack-source-id: 36f4c51 Pull Request resolved: #125191

wanchaol · 2024-04-30T01:35:37Z

torch/distributed/_composable/fsdp/_fsdp_param.py

    _sharded_post_forward_param_data: Optional[torch.Tensor]  # 1D
    _sharded_post_forward_param: Optional[nn.Parameter]  # ND
    _unsharded_param: nn.Parameter  # ND
+    unsharded_accumulated_grad: Optional[torch.Tensor]  # ND


nit: keep this be a private variable to align with all other fields?

Sorry, I think the notion of private and public is non-obvious for the FSDPParam class. The way I have is is that if FSDPParamGroup should access an attribute on FSDPParam, then that attribute should be public. (Note that FSDPParam will never be accessed publicly by the user, as there is no way to do that today and should not be a way in the future.)

For this attribute and the current implementation, FSDPParamGroup needs to check it if it is None or not to know if there is an unsharded accumulated gradient to reduce-scatter or not, so I have it as public.

awgu · 2024-04-30T02:16:55Z

@pytorchbot merge -i

pytorchmergebot · 2024-04-30T02:18:49Z

Merge started

Your change will be merged while ignoring the following 6 checks: pull / linux-focal-py3.12-clang10 / test (default, 1, 3, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (crossref, 1, 2, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (default, 1, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (crossref, 1, 2, linux.2xlarge), trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

awgu · 2024-04-30T02:21:18Z

Failures were all from the open registration test.

…125269) 1. This PR removes the logic for saving and removing the pre-backward hook handles (which is registered via `register_multi_grad_hook(mode="any")`). 2. This PR removes the logic for _trying_ to guard against mistargeted prefetches that relies on querying if the engine will execute the module output tensors' `grad_fn`s. (See #118118 for original motivation.) For 1, the logic was error prone since it relied on `set_is_last_backward(False)` being set correctly or else pre-backward hooks could be de-registered too early. We would prefer to match the hook lifetimes with that of the autograd graph. This solves a bug with a 1f1b interleaved schedule. If we directly remove the manual saving/removing hook handle logic, then we have a ref cycle where the tensors' `grad_fn`s are passed to the hook function. We decide to simply remove this `grad_fn` logic since (1) it cannot perfectly prevent mistargeted prefetches and (2) it introduces undesired complexity. In the future, we may prefer a different mechanism to override the prefetching for more complex/dynamic use cases. Pull Request resolved: #125269 Approved by: https://github.com/weifengpy ghstack dependencies: #125190, #125191

…25191) For microbatching use cases (e.g. PP), we may use fp32 reduce-scatter (i.e. `MixedPrecisionPolicy(reduce_dtype=torch.float32)`), where we want to accumulate the unsharded gradients in fp32 across microbatches until reduce-scattering in fp32 upon the last microbatch. Note that the `unsharded_param` is in bf16, so we must save the fp32 accumulated gradient to an attribute different from `.grad`. Moreover, saving a new attribute on the `torch.Tensor` leads to some annoying type checking issues (where the attribute may not be defined), so this PR prefers to save the attribute on the `FSDPParam` class instead. One could argue that this behavior should be configurable, but since I think for large-scale training, everyone is leaning toward fp32 accumulation across microbatches, let us avoid adding another argument for now. Pull Request resolved: pytorch#125191 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#125190

…ytorch#125269) 1. This PR removes the logic for saving and removing the pre-backward hook handles (which is registered via `register_multi_grad_hook(mode="any")`). 2. This PR removes the logic for _trying_ to guard against mistargeted prefetches that relies on querying if the engine will execute the module output tensors' `grad_fn`s. (See pytorch#118118 for original motivation.) For 1, the logic was error prone since it relied on `set_is_last_backward(False)` being set correctly or else pre-backward hooks could be de-registered too early. We would prefer to match the hook lifetimes with that of the autograd graph. This solves a bug with a 1f1b interleaved schedule. If we directly remove the manual saving/removing hook handle logic, then we have a ref cycle where the tensors' `grad_fn`s are passed to the hook function. We decide to simply remove this `grad_fn` logic since (1) it cannot perfectly prevent mistargeted prefetches and (2) it introduces undesired complexity. In the future, we may prefer a different mechanism to override the prefetching for more complex/dynamic use cases. Pull Request resolved: pytorch#125269 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#125190, pytorch#125191

[FSDP2] Accumulated in reduce_dtype if not syncing grads

eb7f5a2

[ghstack-poisoned]

awgu mentioned this pull request Apr 29, 2024

[FSDP2] Fixed fp32 param dtype/bf16 reduce dtype test #125190

Closed

pytorch-bot bot added ci-td-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Apr 29, 2024

awgu added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels Apr 29, 2024

awgu marked this pull request as ready for review April 29, 2024 21:06

awgu requested review from wanchaol and weifengpy April 29, 2024 21:06

weifengpy approved these changes Apr 29, 2024

View reviewed changes

awgu pushed a commit that referenced this pull request Apr 29, 2024

[FSDP2] Accumulated in reduce_dtype if not syncing grads

cd32257

ghstack-source-id: 36f4c51 Pull Request resolved: #125191

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 29, 2024

wanchaol reviewed Apr 30, 2024

View reviewed changes

pytorchmergebot added the merging label Apr 30, 2024

pytorchmergebot added the Merged label Apr 30, 2024

pytorchmergebot closed this in 0969f01 Apr 30, 2024

pytorchmergebot removed the merging label Apr 30, 2024

awgu mentioned this pull request Apr 30, 2024

[FSDP2] Removed logic to save and remove pre-backward hook handles #125269

Closed

github-actions bot deleted the gh/awgu/575/head branch June 4, 2024 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP2] Accumulated in `reduce_dtype` if not syncing grads #125191

[FSDP2] Accumulated in `reduce_dtype` if not syncing grads #125191

Uh oh!

awgu commented Apr 29, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 29, 2024 •

edited

Loading

Uh oh!

weifengpy left a comment

Uh oh!

wanchaol Apr 30, 2024

Uh oh!

awgu Apr 30, 2024

Uh oh!

awgu commented Apr 30, 2024

Uh oh!

pytorchmergebot commented Apr 30, 2024

Uh oh!

awgu commented Apr 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FSDP2] Accumulated in reduce_dtype if not syncing grads #125191

[FSDP2] Accumulated in reduce_dtype if not syncing grads #125191

Uh oh!

Conversation

awgu commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125191

❌ 2 New Failures, 4 Unrelated Failures

Uh oh!

weifengpy left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

awgu commented Apr 30, 2024

Uh oh!

pytorchmergebot commented Apr 30, 2024

Merge started

Uh oh!

awgu commented Apr 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[FSDP2] Accumulated in `reduce_dtype` if not syncing grads #125191

[FSDP2] Accumulated in `reduce_dtype` if not syncing grads #125191

awgu commented Apr 29, 2024 •

edited

Loading

pytorch-bot bot commented Apr 29, 2024 •

edited

Loading