Thanks to visit codestin.com
Credit goes to github.com

Skip to content

cpp_wrapper: persist autotune example tensors until last use #146706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

benjaminglass1
Copy link
Collaborator

@benjaminglass1 benjaminglass1 commented Feb 7, 2025

Stack from ghstack (oldest at bottom):

Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in test_torchinductor_opinfo.py when run with compile-time autotuning, test_comprehensive_nanquantile_cuda_float64.

For clarity, the situation triggering this PR looks like kernels A -> BCDE -> F (BCDE is fused), where one of the outputs from A is a boolean tensor describing some of the input data. Previously, we randomly regenerated that boolean tensor and the input data before passing them to BCDE, so that they no longer matched. This caused a tl.device_assert call in BCDE to fail. With this PR, we reuse the random data input to A and the output Boolean tensor, such that they match and pass the device assertion in BCDE.

Fixes #147799.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Feb 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146706

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 3 Unrelated Failures

As of commit 266b699 with merge base a89bdc0 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@benjaminglass1
Copy link
Collaborator Author

Notes to reviewers:

  1. It is currently unclear to me whether this raises the average memory usage of autotuning, and what affects that may have on the output kernels.
  2. This effectively makes autotuning order-dependent, so tuning could not be done in parallel in the general case. We don't currently tune in parallel, but we theoretically could before this PR.
  3. This doesn't handle situations where the inputs to the first kernel are a priori invalid, or where fallback kernels run between the different kernels are responsible for making the inputs to BCDE valid.

This solution feels a bit hacky to me, but it solves the test breakage prompting this PR and seems strictly more correct than what we had before, although it doesn't solve every case.

@benjaminglass1 benjaminglass1 requested review from desertfire and eellison and removed request for amjames February 7, 2025 16:44
@benjaminglass1 benjaminglass1 marked this pull request as ready for review February 7, 2025 16:45
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@benjaminglass1 benjaminglass1 marked this pull request as ready for review March 15, 2025 18:30
@benjaminglass1
Copy link
Collaborator Author

@eellison @desertfire I finally had time to revisit this, and used the delayed line resolution idea.

WRT the concerns about memory usage when keeping tensors, I've spent some time thinking about it, and I've reached the conclusion that we should be fine. Since the tensors we're persisting are the same size as tensors that would be persisted when running the final model, we shouldn't fail to run based on that issue if the model itself is capable of running. I think this is ready to merge, with approval.

@benjaminglass1 benjaminglass1 added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 15, 2025
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
NaderAlAwar pushed a commit to NaderAlAwar/PyTorch that referenced this pull request Mar 21, 2025
Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py`, `test_comprehensive_nanquantile_cuda_float64`.

Note to reviewers: it is currently unclear to me whether this raises the average memory usage of autotuning, and what affects that may have on the output kernels.

ghstack-source-id: fb56d6c
Pull Request resolved: pytorch/pytorch#146706
[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #149350

1 similar comment
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #149350

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #149350

pytorchmergebot pushed a commit that referenced this pull request Mar 25, 2025
Pull Request resolved: #147225
Approved by: https://github.com/desertfire
ghstack dependencies: #146706
pytorchmergebot pushed a commit that referenced this pull request Mar 25, 2025
… RAIIPyObject interface (#149350)

Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject.

Closes #142005.

Pull Request resolved: #149350
Approved by: https://github.com/desertfire
ghstack dependencies: #146706, #147225
@desertfire
Copy link
Contributor

xref #150522

@benjaminglass1 , I remember you said this one fixed some unit test. Is the test included in your next PR in this stack?

@benjaminglass1
Copy link
Collaborator Author

@desertfire No, but I will add that. Generally, we need to find a way to finish enabling all the cpp_wrapper tests anyway. We're close to being able to, and possibly all it would take is a longer timeout now.

amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…#146706)

Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py` when run with compile-time autotuning, `test_comprehensive_nanquantile_cuda_float64`.

For clarity, the situation triggering this PR looks like kernels `A -> BCDE -> F` (`BCDE` is fused), where one of the outputs from `A` is a boolean tensor describing some of the input data. Previously, we randomly regenerated that boolean tensor and the input data before passing them to `BCDE`, so that they no longer matched. This caused a `tl.device_assert` call in `BCDE` to fail. With this PR, we reuse the random data input to `A` and the output Boolean tensor, such that they match and pass the device assertion in `BCDE`.

Fixes pytorch#147799.

Pull Request resolved: pytorch#146706
Approved by: https://github.com/desertfire
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
… RAIIPyObject interface (pytorch#149350)

Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject.

Closes pytorch#142005.

Pull Request resolved: pytorch#149350
Approved by: https://github.com/desertfire
ghstack dependencies: pytorch#146706, pytorch#147225
Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull request Apr 22, 2025
Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py`, `test_comprehensive_nanquantile_cuda_float64`.

Note to reviewers: it is currently unclear to me whether this raises the average memory usage of autotuning, and what affects that may have on the output kernels.

ghstack-source-id: 2e7c452
Pull Request resolved: pytorch/pytorch#146706
@github-actions github-actions bot deleted the gh/benjaminglass1/67/head branch May 12, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants