Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

wz337
Copy link
Contributor

@wz337 wz337 commented Mar 13, 2024

Stack from ghstack (oldest at bottom):

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @tianyu-l @wconstab @yf225 @chauhang @LucasLLC

Thanks @fegin for removing the fsdp root module check in DCP to unblock test updates. #121544

This PR adds "optimzer_class" as a kwarg for the subtests of the following tests to add AdamW as an option.

  • test_fsdp
  • test_compiled_fsdp
  • test_fsdp2
  • test_ddp
  • test_fsdp_ddp
  • test_cpu_offload_full_state_dict

In addition, we temporarily remove the two _verify_osd_by_load in _test_save_load, as state dict loading seems affect parameters. Creating an issue #121186 to keep track.

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Mar 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121774

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 0e57f15 with merge base 522d972 (image):

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 13, 2024
wz337 added a commit that referenced this pull request Mar 13, 2024
ghstack-source-id: 3d62e82
Pull Request resolved: #121774
@wz337 wz337 changed the title add AdamW to test_state_dict [DCP][DSD] Add AdamW to distributed state dict unit tests Mar 13, 2024
@github-actions github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue module: distributed_checkpoint labels Mar 13, 2024
@wz337 wz337 requested review from awgu and fegin March 13, 2024 00:24
@wz337 wz337 added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 13, 2024
if compile_model:
dist_model = torch.compile(dist_model)
dist_optim = torch.optim.Adam(dist_model.parameters(), lr=1e-3)
dist_optim = optimizer_class(dist_model.parameters(), lr=1e-3)
Copy link
Collaborator

@awgu awgu Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these unit tests, are we comparing FSDP2 against a non-FSDP2 (e.g. DDP) model?

If so, maybe we need to pass foreach=True for now to the FSDP2 optimizer to get closer numeric parity since otherwise the FSDP2 optimizer would use foreach=False path, which is a different implementation. (just mentioning in case this might be affecting numerics here)



cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj tianyu-l wconstab yf225 chauhang LucasLLC

Thanks fegin for removing the fsdp root module check in DCP to unblock test updates. #121544

This PR adds "optimzer_class" as a kwarg for the subtests of the following tests to add AdamW as an option.

- test_fsdp
- test_compiled_fsdp
- test_fsdp2
- test_ddp
- test_fsdp_ddp
- test_cpu_offload_full_state_dict

In addition, we temporarily remove the two _verify_osd_by_load in _test_save_load, as state dict loading seems affect parameters. Creating an issue #121186 to keep track.

[ghstack-poisoned]


cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj tianyu-l wconstab yf225 chauhang LucasLLC

Thanks fegin for removing the fsdp root module check in DCP to unblock test updates. #121544

This PR adds "optimzer_class" as a kwarg for the subtests of the following tests to add AdamW as an option.

- test_fsdp
- test_compiled_fsdp
- test_fsdp2
- test_ddp
- test_fsdp_ddp
- test_cpu_offload_full_state_dict

In addition, we temporarily remove the two _verify_osd_by_load in _test_save_load, as state dict loading seems affect parameters. Creating an issue #121186 to keep track.

[ghstack-poisoned]
wz337 added a commit that referenced this pull request Mar 14, 2024
ghstack-source-id: ed42d1b
Pull Request resolved: #121774
@wz337
Copy link
Contributor Author

wz337 commented Mar 15, 2024

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/wz337/6/head branch April 15, 2024 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants