[fsdp] add an experimental allocator hook for buffers that participate in collective communication #147146

yifuwang · 2025-02-13T21:54:36Z

Stack from ghstack (oldest at bottom):

-> [fsdp] add an experimental allocator hook for buffers that participate in collective communication #147146

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2025-02-13T21:54:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147146

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 16bd23e with merge base 899066e ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / unit-test / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_avx2, 2, 2, linux.10xlarge.avx2) (gh) (similar failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 542f11b Pull Request resolved: #147146

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…e in collective communication ghstack-source-id: 7e65d98 Pull Request resolved: #147146

yifuwang · 2025-02-13T23:23:28Z

@awgu can you please give suggestions on the API name and how it should be exposed?

awgu · 2025-02-13T23:29:01Z

torch/distributed/fsdp/_fully_shard/_fsdp_common.py

+def _set_fsdp_comm_allocator(allocator: _Allocator):
+    global _fsdp_comm_allocator
+    _fsdp_comm_allocator = allocator
+
+
+def _get_fsdp_comm_allocator() -> _Allocator:
+    if _fsdp_comm_allocator is not None:
+        return _fsdp_comm_allocator
+    else:
+        return torch.empty


naming is hard 😅

if you make these ~~global~~ public, I am okay with that

that might be better than passing as args into fully_shard since then user is expected to pass the same allocator to all calls of fully_shard (or I think it does not make too much sense)

that might be better than passing as args into fully_shard since then user is expected to pass the same allocator to all calls of fully_shard (or I think it does not make too much sense)

Makes sense. I made it global mainly because some allocations are performed in custom ops created for tracing, and I don't want to mess them up.

Hmm I think we also need to expose the process group on which the collective will be performed 🤔

Hmm right now it seems like the allocator doesn't know when the memory is not needed? My understanding is tensor doesn't have callback mechanism to tell allocator so, so wonder how GC is supposed to work?

@xunnanxu you can pass your memory deallocation logic to at::from_blob (example). Do you think this would work?

@yifuwang Got it. I got slightly confused by the example as the allocation is done in python but this makes sense.

…e in collective communication Summary: pytorch#147146 Test Plan: unit test Differential Revision: D69694585

…e in collective communication (pytorch#149150) Summary: pytorch#147146 Test Plan: unit test Differential Revision: D69694585

…e in collective communication (pytorch#149150) Summary: Pull Request resolved: pytorch#149150 pytorch#147146 Test Plan: unit test Differential Revision: D69694585

…e in collective communication (pytorch#149150) Summary: pytorch#147146 Test Plan: unit test Differential Revision: D69694585

github-actions · 2025-05-05T03:02:04Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

temp

ef31325

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 13, 2025

yifuwang pushed a commit that referenced this pull request Feb 13, 2025

temp

ef9ae20

ghstack-source-id: 542f11b Pull Request resolved: #147146

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Feb 13, 2025

Update on "temp"

16bd23e

cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

yifuwang pushed a commit that referenced this pull request Feb 13, 2025

[fsdp] add an experimental allocator hook for buffers that participat…

478bc6f

…e in collective communication ghstack-source-id: 7e65d98 Pull Request resolved: #147146

yifuwang changed the title ~~temp~~ [fsdp] add an experimental allocator hook for buffers that participate in collective communication Feb 13, 2025

yifuwang requested a review from awgu February 13, 2025 23:22

yifuwang requested review from weifengpy and minsii February 13, 2025 23:26

awgu reviewed Feb 13, 2025

View reviewed changes

pytorchbot added the open source label Mar 6, 2025

jiayulu added a commit to jiayulu/pytorch that referenced this pull request Mar 13, 2025

[fsdp] add an experimental allocator hook for buffers that participat…

f0db0c2

…e in collective communication Summary: pytorch#147146 Test Plan: unit test Differential Revision: D69694585

jiayulu mentioned this pull request Mar 13, 2025

[fsdp] add an experimental allocator hook for buffers that participate in collective communication #149150

Open

lw mentioned this pull request Apr 8, 2025

Experiment with user buffer registration for FSDP2 #150564

Open

github-actions bot added the Stale label May 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fsdp] add an experimental allocator hook for buffers that participate in collective communication #147146

[fsdp] add an experimental allocator hook for buffers that participate in collective communication #147146

yifuwang commented Feb 13, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 13, 2025 •

edited

Loading

yifuwang commented Feb 13, 2025

awgu Feb 13, 2025 •

edited

Loading

yifuwang Feb 13, 2025

xunnanxu Feb 14, 2025

yifuwang Feb 14, 2025

xunnanxu Feb 14, 2025

github-actions bot commented May 5, 2025

[fsdp] add an experimental allocator hook for buffers that participate in collective communication #147146

Are you sure you want to change the base?

[fsdp] add an experimental allocator hook for buffers that participate in collective communication #147146

Conversation

yifuwang commented Feb 13, 2025 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Feb 13, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147146

✅ You can merge normally! (1 Unrelated Failure)

yifuwang commented Feb 13, 2025

awgu Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

yifuwang Feb 13, 2025

Choose a reason for hiding this comment

xunnanxu Feb 14, 2025

Choose a reason for hiding this comment

yifuwang Feb 14, 2025

Choose a reason for hiding this comment

xunnanxu Feb 14, 2025

Choose a reason for hiding this comment

github-actions bot commented May 5, 2025

yifuwang commented Feb 13, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 13, 2025 •

edited

Loading

awgu Feb 13, 2025 •

edited

Loading