[Inductor] Fix CUDA memory usage for CPU only compile #150669

leslie-fang-intel · 2025-04-04T06:51:32Z

Stack from ghstack (oldest at bottom):

Summary
Fix #150622. The root-cause is CUDA device used by default when CUDA is available to generate pattern for a CPU specific compilation. The original PR comes from @vfdev-5 in #124722 and combine the comments from @lezcano in #129131 (comment)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-04-04T06:51:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150669

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (7 Unrelated Failures)

As of commit d7bb824 with merge base f252f9d ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 1, 5, lf.ephemeral.linux.4xlarge.nvidia.gpu) (gh) (disabled by #148093 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 2, 5, lf.ephemeral.linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (disabled by #148093 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed
pull / linux-focal-py3.13-clang10 / test (default, 5, 5, lf.ephemeral.linux.4xlarge) (gh) (disabled by #148093 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed
pull / linux-focal-py3.9-clang10 / test (default, 4, 5, lf.ephemeral.linux.4xlarge) (gh) (disabled by #148093 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed
pull / linux-jammy-py3.10-clang15-asan / test (default, 4, 6, lf.ephemeral.linux.4xlarge) (gh) (disabled by #148093 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed
pull / linux-jammy-py3.9-gcc11 / test (default, 2, 5, lf.ephemeral.linux.2xlarge) (gh) (disabled by #148093 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed
trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh) (disabled by #148093 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_utils.py::TestDynamoTimed::test_dynamo_timed

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 04d02f6 Pull Request resolved: #150669

[ghstack-poisoned]

ghstack-source-id: 34a3561 Pull Request resolved: #150669

torch/_inductor/fx_passes/freezing_patterns.py

leslie-fang-intel · 2025-04-08T02:22:19Z

Hi @eellison, since this PR might register the same pattern twice for both CPU and GPU, I added the skip_duplicates=True flag when registering the pattern. Could you kindly take a look if this change makes sense to you?

eellison

For most/all of these patterns the pattern is the same regardless of device. we shouldn't be initializing them twice.

eellison · 2025-04-16T20:17:39Z

torch/_dynamo/utils.py

+    inputs: collections.abc.Sequence[object],
+) -> list[Optional[torch.device]]:
+    return list(
+        OrderedSet(


Shouldn't we use all devices in graph, not just input ?

Thanks for the comment. Added FX Graph Module in to account.

eellison · 2025-04-16T20:18:22Z

torch/_inductor/fx_passes/freezing_patterns.py

@@ -89,7 +90,7 @@ def freezing_passes(gm: torch.fx.GraphModule, aot_example_inputs):


 @init_once_fakemode
-def lazy_init():
+def lazy_init(input_device: Optional[torch.device] = None):
    if torch._C._has_mkldnn and config.cpp.weight_prepack:


All of these patterns are the same regardless of device.

For this one, I think it will not init twice, as we only use input_device as None.

[ghstack-poisoned]

ghstack-source-id: 9371cd4 Pull Request resolved: #150669

leslie-fang-intel · 2025-04-17T06:27:33Z

For most/all of these patterns the pattern is the same regardless of device. we shouldn't be initializing them twice.

Hi @eellison, thanks a lot for your comments. I think for most of the patterns it will only be initialized once with default input_device as None. There are 3 patterns pass as exceptional which registered through joint_graph_passes:

_pad_mm_init
_sfdp_init
_misc_patterns_init

As for these 3 passes, since originally they will use the device to generate patterns as:

    if torch.cuda.is_available():
        # workaround https://github.com/pytorch/pytorch/issues/97894
        device = "cuda"
    else:
        device = "cpu"

So, I guess the patterns between CPU/CUDA will be slightly different for these 3 passes, and we need to initialize again to register more patterns when input device changed (nevertheless, there are some duplicated). Looking forward to your further suggestions. Thanks.

[ghstack-poisoned]

ghstack-source-id: 26722fb Pull Request resolved: #150669

eellison · 2025-04-28T19:57:13Z

So, I guess the patterns between CPU/CUDA will be slightly different for these 3 passes, and we need to initialize again to register more patterns when input device changed (nevertheless, there are some duplicated).

The patterns are the same here, actually. We remove the device after tracing the pattern. And mm/etc do not dispatch differently based on device.

Update

3172b3e

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo labels Apr 4, 2025

leslie-fang-intel added a commit that referenced this pull request Apr 4, 2025

[Inductor] Fix CUDA memory usage for CPU only compile

0cf2073

ghstack-source-id: 04d02f6 Pull Request resolved: #150669

pytorch-bot bot added the module: inductor label Apr 4, 2025

leslie-fang-intel added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Apr 4, 2025

leslie-fang-intel marked this pull request as draft April 4, 2025 06:52

pytorchbot added the open source label Apr 4, 2025

leslie-fang-intel mentioned this pull request Apr 4, 2025

torch.compile creates a CUDA context even for CPU based code #150622

Open

Update

52dd2c5

[ghstack-poisoned]

leslie-fang-intel added a commit that referenced this pull request Apr 4, 2025

[Inductor] Fix CUDA memory usage for CPU only compile

3fb23c8

ghstack-source-id: 34a3561 Pull Request resolved: #150669

leslie-fang-intel requested review from jansel, eellison, jgong5 and lezcano April 4, 2025 09:33

leslie-fang-intel marked this pull request as ready for review April 4, 2025 09:36

leslie-fang-intel requested a review from vfdev-5 April 4, 2025 09:37

jansel requested changes Apr 4, 2025

View reviewed changes

torch/_inductor/fx_passes/freezing_patterns.py Show resolved Hide resolved

leslie-fang-intel requested a review from jansel April 5, 2025 00:20

jansel approved these changes Apr 5, 2025

View reviewed changes

eellison reviewed Apr 16, 2025

View reviewed changes

Update

ca810ee

[ghstack-poisoned]

leslie-fang-intel added a commit that referenced this pull request Apr 17, 2025

[Inductor] Fix CUDA memory usage for CPU only compile

4dafeda

ghstack-source-id: 9371cd4 Pull Request resolved: #150669

leslie-fang-intel requested a review from eellison April 17, 2025 06:28

Update

d7bb824

[ghstack-poisoned]

leslie-fang-intel mentioned this pull request Apr 17, 2025

[Inductor] Suppress cuda init error for CPU only Inductor #151528

Closed

leslie-fang-intel added a commit that referenced this pull request Apr 17, 2025

[Inductor] Fix CUDA memory usage for CPU only compile

9918a82

ghstack-source-id: 26722fb Pull Request resolved: #150669

eellison removed their request for review April 28, 2025 20:40

jgong5 approved these changes May 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor] Fix CUDA memory usage for CPU only compile #150669

[Inductor] Fix CUDA memory usage for CPU only compile #150669

leslie-fang-intel commented Apr 4, 2025 •

edited

Loading

pytorch-bot bot commented Apr 4, 2025 •

edited

Loading

leslie-fang-intel commented Apr 8, 2025

eellison left a comment

eellison Apr 16, 2025

leslie-fang-intel Apr 17, 2025

eellison Apr 16, 2025

leslie-fang-intel Apr 17, 2025 •

edited

Loading

leslie-fang-intel commented Apr 17, 2025

eellison commented Apr 28, 2025

[Inductor] Fix CUDA memory usage for CPU only compile #150669

Are you sure you want to change the base?

[Inductor] Fix CUDA memory usage for CPU only compile #150669

Conversation

leslie-fang-intel commented Apr 4, 2025 • edited Loading

pytorch-bot bot commented Apr 4, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150669

✅ You can merge normally! (7 Unrelated Failures)

leslie-fang-intel commented Apr 8, 2025

eellison left a comment

Choose a reason for hiding this comment

eellison Apr 16, 2025

Choose a reason for hiding this comment

leslie-fang-intel Apr 17, 2025

Choose a reason for hiding this comment

eellison Apr 16, 2025

Choose a reason for hiding this comment

leslie-fang-intel Apr 17, 2025 • edited Loading

Choose a reason for hiding this comment

leslie-fang-intel commented Apr 17, 2025

eellison commented Apr 28, 2025

leslie-fang-intel commented Apr 4, 2025 •

edited

Loading

pytorch-bot bot commented Apr 4, 2025 •

edited

Loading

leslie-fang-intel Apr 17, 2025 •

edited

Loading