-
Notifications
You must be signed in to change notification settings - Fork 24.1k
[Inductor] Fix CUDA memory usage for CPU only compile #150669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/leslie-fang-intel/190/base
Are you sure you want to change the base?
[Inductor] Fix CUDA memory usage for CPU only compile #150669
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150669
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (7 Unrelated Failures)As of commit d7bb824 with merge base f252f9d ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @eellison, since this PR might register the same pattern twice for both CPU and GPU, I added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most/all of these patterns the pattern is the same regardless of device. we shouldn't be initializing them twice.
torch/_dynamo/utils.py
Outdated
inputs: collections.abc.Sequence[object], | ||
) -> list[Optional[torch.device]]: | ||
return list( | ||
OrderedSet( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use all devices in graph, not just input ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. Added FX Graph Module in to account.
@@ -89,7 +90,7 @@ def freezing_passes(gm: torch.fx.GraphModule, aot_example_inputs): | |||
|
|||
|
|||
@init_once_fakemode | |||
def lazy_init(): | |||
def lazy_init(input_device: Optional[torch.device] = None): | |||
if torch._C._has_mkldnn and config.cpp.weight_prepack: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these patterns are the same regardless of device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one, I think it will not init twice, as we only use input_device
as None
.
Hi @eellison, thanks a lot for your comments. I think for most of the patterns it will only be initialized once with default
As for these 3 passes, since originally they will use the device to generate patterns as:
So, I guess the patterns between CPU/CUDA will be slightly different for these 3 passes, and we need to initialize again to register more patterns when input device changed (nevertheless, there are some duplicated). Looking forward to your further suggestions. Thanks. |
The patterns are the same here, actually. We remove the device after tracing the pattern. And mm/etc do not dispatch differently based on device. |
Stack from ghstack (oldest at bottom):
Summary
Fix #150622. The root-cause is CUDA device used by default when CUDA is available to generate pattern for a CPU specific compilation. The original PR comes from @vfdev-5 in #124722 and combine the comments from @lezcano in #129131 (comment)
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov