[MAGMA][CUDA] cholesky_solve: deprecate MAGMA and dispatch to cuSolver unconditionally#174769
[MAGMA][CUDA] cholesky_solve: deprecate MAGMA and dispatch to cuSolver unconditionally#174769gderossi wants to merge 3 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174769
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 409a19f with merge base 596dbc5 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…batched inputs (#175898) cuSOLVER status: `potrsBatched` only supports `nrsh=1`, while looped `potrs` is slower compared to two batched calls to `solve_triangular`. This PR should should unblock #174769, since it makes cuSOLVER backend faster than MAGMA (since cuSOLVER triangular solve is faster). Benchmarks: ``` [------ torch.cholesky_solve -----] | new | old 1 threads: --------------------------- (16, 16, 16) | 29.7 | 157.6 (16, 128, 128) | 120.8 | 713.2 (16, 512, 512) | 730.9 | 3718.6 (16, 2048, 2048) | 11594.7 | 25689.6 (64, 16, 16) | 29.7 | 608.8 (64, 128, 128) | 139.3 | 2831.9 (64, 512, 512) | 1828.7 | 15173.4 (64, 2048, 2048) | 41961.1 | 102721.2 Times are in microseconds (us). ``` Thanks to @gderossi for the benchmarks comparing cuSOLVER kernels vs MAGMA, and his analysis of cuSOLVER's limitations. All this ultimately led to creating this PR. Pull Request resolved: #175898 Approved by: https://github.com/eqy, https://github.com/Aidyn-A
7b14777 to
8ce4135
Compare
|
@pytorchmergebot merge |
eqy
left a comment
There was a problem hiding this comment.
looks like rocm build is broken?
Merge failedReason: Approvers from one of the following sets are needed:
|
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
There was a problem hiding this comment.
The build failure is real. Here it is attempting to call apply_cholesky_solve you have removed.
There was a problem hiding this comment.
Okay, it makes sense, as the PR #174681 should land first.
There was a problem hiding this comment.
Yep, that failure is not a surprise. #174681 has now landed so hopefully everything should work.
|
@pytorchmergebot label ciflow/trunk ciflow/rocm-mi300 |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…batched inputs (pytorch#175898) cuSOLVER status: `potrsBatched` only supports `nrsh=1`, while looped `potrs` is slower compared to two batched calls to `solve_triangular`. This PR should should unblock pytorch#174769, since it makes cuSOLVER backend faster than MAGMA (since cuSOLVER triangular solve is faster). Benchmarks: ``` [------ torch.cholesky_solve -----] | new | old 1 threads: --------------------------- (16, 16, 16) | 29.7 | 157.6 (16, 128, 128) | 120.8 | 713.2 (16, 512, 512) | 730.9 | 3718.6 (16, 2048, 2048) | 11594.7 | 25689.6 (64, 16, 16) | 29.7 | 608.8 (64, 128, 128) | 139.3 | 2831.9 (64, 512, 512) | 1828.7 | 15173.4 (64, 2048, 2048) | 41961.1 | 102721.2 Times are in microseconds (us). ``` Thanks to @gderossi for the benchmarks comparing cuSOLVER kernels vs MAGMA, and his analysis of cuSOLVER's limitations. All this ultimately led to creating this PR. Pull Request resolved: pytorch#175898 Approved by: https://github.com/eqy, https://github.com/Aidyn-A
…r unconditionally (pytorch#174769) Both cuSolver and hipSolver implement potrs, so this just removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for the performance improvements in pytorch#175898! Note that cholesky_inverse depends on cholesky_solve functions and so pytorch#174681 should be merged before this. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] for b, n in product(batches, sizes): shape = b + (n, n) print(f"Testing shape={shape}") label = "torch.cholesky_solve" sub_label = f"{shape}" A = torch.rand(*shape, device="cuda") A = A @ A.mT + torch.eye(n, device="cuda") B = torch.rand(*shape, device="cuda") L = torch.linalg.cholesky(A) stmt = "torch.cholesky_solve(B, L)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'L': L, 'B': B}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmarking results on RTX Pro 6000: ``` [------------ torch.cholesky_solve -----------] | magma | cusolver | speedup 1 threads: ------------------------------------ | (16, 16) | 1579.0 | 21.3 | 74.1 (128, 128) | 1593.5 | 50.1 | 31.8 (512, 512) | 1731.0 | 226.3 | 7.6 (2048, 2048) | 2956.9 | 1587.6 | 1.9 (16, 16, 16) | 28.7 | 28.3 | 1.0 (16, 128, 128) | 601.8 | 121.8 | 4.9 (16, 512, 512) | 3246.6 | 735.5 | 4.4 (16, 2048, 2048) | 23108.0 | 11620.9 | 2.0 (64, 16, 16) | 29.5 | 28.2 | 1.0 (64, 128, 128) | 644.3 | 140.5 | 4.6 (64, 512, 512) | 4571.2 | 1840.4 | 2.5 (64, 2048, 2048) | 65104.7 | 42007.7 | 1.5 Times are in microseconds (us). ``` Pull Request resolved: pytorch#174769 Approved by: https://github.com/nikitaved, https://github.com/eqy
Both cuSolver and hipSolver implement potrs, so this just removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for the performance improvements in #175898! Note that cholesky_inverse depends on cholesky_solve functions and so #174681 should be merged before this.
Benchmarking script:
Benchmarking results on RTX Pro 6000:
cc @nikitaved @eqy