Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[fsdp] add an experimental allocator hook for buffers that participate in collective communication #147146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: gh/yifuwang/196/base
Choose a base branch
from

Conversation

yifuwang
Copy link
Collaborator

@yifuwang yifuwang commented Feb 13, 2025

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Feb 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147146

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 16bd23e with merge base 899066e (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 13, 2025
yifuwang pushed a commit that referenced this pull request Feb 13, 2025
ghstack-source-id: 542f11b
Pull Request resolved: #147146
@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Feb 13, 2025
cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Feb 13, 2025
…e in collective communication

ghstack-source-id: 7e65d98
Pull Request resolved: #147146
@yifuwang yifuwang changed the title temp [fsdp] add an experimental allocator hook for buffers that participate in collective communication Feb 13, 2025
@yifuwang yifuwang requested a review from awgu February 13, 2025 23:22
@yifuwang
Copy link
Collaborator Author

@awgu can you please give suggestions on the API name and how it should be exposed?

@yifuwang yifuwang requested review from weifengpy and minsii February 13, 2025 23:26
Comment on lines +194 to +203
def _set_fsdp_comm_allocator(allocator: _Allocator):
global _fsdp_comm_allocator
_fsdp_comm_allocator = allocator


def _get_fsdp_comm_allocator() -> _Allocator:
if _fsdp_comm_allocator is not None:
return _fsdp_comm_allocator
else:
return torch.empty
Copy link
Collaborator

@awgu awgu Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming is hard 😅

if you make these global public, I am okay with that

that might be better than passing as args into fully_shard since then user is expected to pass the same allocator to all calls of fully_shard (or I think it does not make too much sense)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might be better than passing as args into fully_shard since then user is expected to pass the same allocator to all calls of fully_shard (or I think it does not make too much sense)

Makes sense. I made it global mainly because some allocations are performed in custom ops created for tracing, and I don't want to mess them up.

Hmm I think we also need to expose the process group on which the collective will be performed 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm right now it seems like the allocator doesn't know when the memory is not needed? My understanding is tensor doesn't have callback mechanism to tell allocator so, so wonder how GC is supposed to work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xunnanxu you can pass your memory deallocation logic to at::from_blob (example). Do you think this would work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yifuwang Got it. I got slightly confused by the example as the allocation is done in python but this makes sense.

jiayulu added a commit to jiayulu/pytorch that referenced this pull request Mar 13, 2025
…e in collective communication

Summary: pytorch#147146

Test Plan: unit test

Differential Revision: D69694585
jiayulu added a commit to jiayulu/pytorch that referenced this pull request Apr 1, 2025
…e in collective communication (pytorch#149150)

Summary:

pytorch#147146

Test Plan: unit test

Differential Revision: D69694585
jiayulu added a commit to jiayulu/pytorch that referenced this pull request Apr 1, 2025
…e in collective communication (pytorch#149150)

Summary:
Pull Request resolved: pytorch#149150

pytorch#147146

Test Plan: unit test

Differential Revision: D69694585
jiayulu added a commit to jiayulu/pytorch that referenced this pull request Apr 1, 2025
…e in collective communication (pytorch#149150)

Summary:

pytorch#147146

Test Plan: unit test

Differential Revision: D69694585
Copy link
Contributor

github-actions bot commented May 5, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label May 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (fsdp) release notes category Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants