-
Notifications
You must be signed in to change notification settings - Fork 24.7k
cpp_wrapper: persist autotune example tensors until last use #146706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpp_wrapper: persist autotune example tensors until last use #146706
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146706
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 1 Pending, 3 Unrelated FailuresAs of commit 266b699 with merge base a89bdc0 ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Notes to reviewers:
This solution feels a bit hacky to me, but it solves the test breakage prompting this PR and seems strictly more correct than what we had before, although it doesn't solve every case. |
@eellison @desertfire I finally had time to revisit this, and used the delayed line resolution idea. WRT the concerns about memory usage when keeping tensors, I've spent some time thinking about it, and I've reached the conclusion that we should be fine. Since the tensors we're persisting are the same size as tensors that would be persisted when running the final model, we shouldn't fail to run based on that issue if the model itself is capable of running. I think this is ready to merge, with approval. |
Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py`, `test_comprehensive_nanquantile_cuda_float64`. Note to reviewers: it is currently unclear to me whether this raises the average memory usage of autotuning, and what affects that may have on the output kernels. ghstack-source-id: fb56d6c Pull Request resolved: pytorch/pytorch#146706
Starting merge as part of PR stack under #149350 |
1 similar comment
Starting merge as part of PR stack under #149350 |
Starting merge as part of PR stack under #149350 |
Pull Request resolved: #147225 Approved by: https://github.com/desertfire ghstack dependencies: #146706
… RAIIPyObject interface (#149350) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes #142005. Pull Request resolved: #149350 Approved by: https://github.com/desertfire ghstack dependencies: #146706, #147225
xref #150522 @benjaminglass1 , I remember you said this one fixed some unit test. Is the test included in your next PR in this stack? |
@desertfire No, but I will add that. Generally, we need to find a way to finish enabling all the |
…#146706) Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py` when run with compile-time autotuning, `test_comprehensive_nanquantile_cuda_float64`. For clarity, the situation triggering this PR looks like kernels `A -> BCDE -> F` (`BCDE` is fused), where one of the outputs from `A` is a boolean tensor describing some of the input data. Previously, we randomly regenerated that boolean tensor and the input data before passing them to `BCDE`, so that they no longer matched. This caused a `tl.device_assert` call in `BCDE` to fail. With this PR, we reuse the random data input to `A` and the output Boolean tensor, such that they match and pass the device assertion in `BCDE`. Fixes pytorch#147799. Pull Request resolved: pytorch#146706 Approved by: https://github.com/desertfire
Pull Request resolved: pytorch#147225 Approved by: https://github.com/desertfire ghstack dependencies: pytorch#146706
… RAIIPyObject interface (pytorch#149350) Add includes for torch.device, torch.dtype, torch.layout, and torch.memory_format to the cpp_wrapper common header, so that they get precompiled. Additionally, add move constructors and operator bool to RAIIPyObject. Closes pytorch#142005. Pull Request resolved: pytorch#149350 Approved by: https://github.com/desertfire ghstack dependencies: pytorch#146706, pytorch#147225
Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in `test_torchinductor_opinfo.py`, `test_comprehensive_nanquantile_cuda_float64`. Note to reviewers: it is currently unclear to me whether this raises the average memory usage of autotuning, and what affects that may have on the output kernels. ghstack-source-id: 2e7c452 Pull Request resolved: pytorch/pytorch#146706
Stack from ghstack (oldest at bottom):
Patches over an issue where randomly generated example tensors can cause kernel autotuning to fail, when those tensors would not be possible outputs from previous kernels in the sequence. This fixes a failure in
test_torchinductor_opinfo.py
when run with compile-time autotuning,test_comprehensive_nanquantile_cuda_float64
.For clarity, the situation triggering this PR looks like kernels
A -> BCDE -> F
(BCDE
is fused), where one of the outputs fromA
is a boolean tensor describing some of the input data. Previously, we randomly regenerated that boolean tensor and the input data before passing them toBCDE
, so that they no longer matched. This caused atl.device_assert
call inBCDE
to fail. With this PR, we reuse the random data input toA
and the output Boolean tensor, such that they match and pass the device assertion inBCDE
.Fixes #147799.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @desertfire @chauhang @aakhundov