Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Inductor] Fix CUDA memory usage for CPU only compile #150669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: gh/leslie-fang-intel/190/base
Choose a base branch
from

Conversation

leslie-fang-intel
Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel commented Apr 4, 2025

Stack from ghstack (oldest at bottom):

Summary
Fix #150622. The root-cause is CUDA device used by default when CUDA is available to generate pattern for a CPU specific compilation. The original PR comes from @vfdev-5 in #124722 and combine the comments from @lezcano in #129131 (comment)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150669

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (7 Unrelated Failures)

As of commit d7bb824 with merge base f252f9d (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Apr 4, 2025
ghstack-source-id: 34a3561
Pull Request resolved: #150669
@leslie-fang-intel leslie-fang-intel marked this pull request as ready for review April 4, 2025 09:36
@leslie-fang-intel leslie-fang-intel requested a review from jansel April 5, 2025 00:20
@leslie-fang-intel
Copy link
Collaborator Author

Hi @eellison, since this PR might register the same pattern twice for both CPU and GPU, I added the skip_duplicates=True flag when registering the pattern. Could you kindly take a look if this change makes sense to you?

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most/all of these patterns the pattern is the same regardless of device. we shouldn't be initializing them twice.

inputs: collections.abc.Sequence[object],
) -> list[Optional[torch.device]]:
return list(
OrderedSet(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use all devices in graph, not just input ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Added FX Graph Module in to account.

@@ -89,7 +90,7 @@ def freezing_passes(gm: torch.fx.GraphModule, aot_example_inputs):


@init_once_fakemode
def lazy_init():
def lazy_init(input_device: Optional[torch.device] = None):
if torch._C._has_mkldnn and config.cpp.weight_prepack:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these patterns are the same regardless of device.

Copy link
Collaborator Author

@leslie-fang-intel leslie-fang-intel Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, I think it will not init twice, as we only use input_device as None.

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Apr 17, 2025
ghstack-source-id: 9371cd4
Pull Request resolved: #150669
@leslie-fang-intel
Copy link
Collaborator Author

For most/all of these patterns the pattern is the same regardless of device. we shouldn't be initializing them twice.

Hi @eellison, thanks a lot for your comments. I think for most of the patterns it will only be initialized once with default input_device as None. There are 3 patterns pass as exceptional which registered through joint_graph_passes:

  • _pad_mm_init
  • _sfdp_init
  • _misc_patterns_init

As for these 3 passes, since originally they will use the device to generate patterns as:

    if torch.cuda.is_available():
        # workaround https://github.com/pytorch/pytorch/issues/97894
        device = "cuda"
    else:
        device = "cpu"

So, I guess the patterns between CPU/CUDA will be slightly different for these 3 passes, and we need to initialize again to register more patterns when input device changed (nevertheless, there are some duplicated). Looking forward to your further suggestions. Thanks.

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Apr 17, 2025
ghstack-source-id: 26722fb
Pull Request resolved: #150669
@eellison
Copy link
Contributor

So, I guess the patterns between CPU/CUDA will be slightly different for these 3 passes, and we need to initialize again to register more patterns when input device changed (nevertheless, there are some duplicated).

The patterns are the same here, actually. We remove the device after tracing the pattern. And mm/etc do not dispatch differently based on device.

@eellison eellison removed their request for review April 28, 2025 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants