[MAGMA][CUDA] cholesky_inverse: deprecate MAGMA and dispatch to cuSolver unconditionally#174681
[MAGMA][CUDA] cholesky_inverse: deprecate MAGMA and dispatch to cuSolver unconditionally#174681gderossi wants to merge 3 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174681
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit 9388ad4 with merge base 2f0a6bd ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
c0e57ea to
5bbd895
Compare
|
@nikitaved I just duplicated the |
|
@pytorchmergebot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: Lint / lintrunner-noclang-all / linux-job, trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ver unconditionally (pytorch#174681) Both cuSolver and hipSolver implement potrs, so this removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for implementing relevant performance improvements in pytorch#175898- this PR takes advantage of the triangular solve path as well and benchmarking results are much better than before. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] for b, n in product(batches, sizes): shape = b + (n, n) print(f"Testing shape={shape}") label = "torch.cholesky_inverse" sub_label = f"{shape}" X = torch.rand(*shape, device="cuda") X = X @ X.mT + torch.eye(n, device="cuda") L = torch.linalg.cholesky(X) stmt = "torch.cholesky_inverse(L)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'L': L}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmarking results on RTX Pro 6000: ``` [----------- torch.cholesky_inverse ----------] | magma | cusolver | speedup 1 threads: ------------------------------------ | (16, 16) | 37.2 | 31.4 | 1.2 (128, 128) | 463.7 | 66.9 | 6.9 (512, 512) | 1632.4 | 283.8 | 5.8 (2048, 2048) | 7354.0 | 1779.6 | 4.1 (16, 16, 16) | 39.1 | 57.8 | 0.7 (16, 128, 128) | 597.9 | 159.7 | 3.7 (16, 512, 512) | 2947.6 | 852.0 | 3.5 (16, 2048, 2048) | 21657.2 | 12394.5 | 1.7 (64, 16, 16) | 39.9 | 58.1 | 0.7 (64, 128, 128) | 629.5 | 179.2 | 3.5 (64, 512, 512) | 4267.9 | 2095.4 | 2.0 (64, 2048, 2048) | 63178.9 | 44213.6 | 1.4 Times are in microseconds (us). ``` Pull Request resolved: pytorch#174681 Approved by: https://github.com/eqy
…r unconditionally (#174769) Both cuSolver and hipSolver implement potrs, so this just removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for the performance improvements in #175898! Note that cholesky_inverse depends on cholesky_solve functions and so #174681 should be merged before this. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] for b, n in product(batches, sizes): shape = b + (n, n) print(f"Testing shape={shape}") label = "torch.cholesky_solve" sub_label = f"{shape}" A = torch.rand(*shape, device="cuda") A = A @ A.mT + torch.eye(n, device="cuda") B = torch.rand(*shape, device="cuda") L = torch.linalg.cholesky(A) stmt = "torch.cholesky_solve(B, L)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'L': L, 'B': B}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmarking results on RTX Pro 6000: ``` [------------ torch.cholesky_solve -----------] | magma | cusolver | speedup 1 threads: ------------------------------------ | (16, 16) | 1579.0 | 21.3 | 74.1 (128, 128) | 1593.5 | 50.1 | 31.8 (512, 512) | 1731.0 | 226.3 | 7.6 (2048, 2048) | 2956.9 | 1587.6 | 1.9 (16, 16, 16) | 28.7 | 28.3 | 1.0 (16, 128, 128) | 601.8 | 121.8 | 4.9 (16, 512, 512) | 3246.6 | 735.5 | 4.4 (16, 2048, 2048) | 23108.0 | 11620.9 | 2.0 (64, 16, 16) | 29.5 | 28.2 | 1.0 (64, 128, 128) | 644.3 | 140.5 | 4.6 (64, 512, 512) | 4571.2 | 1840.4 | 2.5 (64, 2048, 2048) | 65104.7 | 42007.7 | 1.5 Times are in microseconds (us). ``` Pull Request resolved: #174769 Approved by: https://github.com/nikitaved, https://github.com/eqy
…ver unconditionally (pytorch#174681) Both cuSolver and hipSolver implement potrs, so this removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for implementing relevant performance improvements in pytorch#175898- this PR takes advantage of the triangular solve path as well and benchmarking results are much better than before. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] for b, n in product(batches, sizes): shape = b + (n, n) print(f"Testing shape={shape}") label = "torch.cholesky_inverse" sub_label = f"{shape}" X = torch.rand(*shape, device="cuda") X = X @ X.mT + torch.eye(n, device="cuda") L = torch.linalg.cholesky(X) stmt = "torch.cholesky_inverse(L)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'L': L}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmarking results on RTX Pro 6000: ``` [----------- torch.cholesky_inverse ----------] | magma | cusolver | speedup 1 threads: ------------------------------------ | (16, 16) | 37.2 | 31.4 | 1.2 (128, 128) | 463.7 | 66.9 | 6.9 (512, 512) | 1632.4 | 283.8 | 5.8 (2048, 2048) | 7354.0 | 1779.6 | 4.1 (16, 16, 16) | 39.1 | 57.8 | 0.7 (16, 128, 128) | 597.9 | 159.7 | 3.7 (16, 512, 512) | 2947.6 | 852.0 | 3.5 (16, 2048, 2048) | 21657.2 | 12394.5 | 1.7 (64, 16, 16) | 39.9 | 58.1 | 0.7 (64, 128, 128) | 629.5 | 179.2 | 3.5 (64, 512, 512) | 4267.9 | 2095.4 | 2.0 (64, 2048, 2048) | 63178.9 | 44213.6 | 1.4 Times are in microseconds (us). ``` Pull Request resolved: pytorch#174681 Approved by: https://github.com/eqy
…r unconditionally (pytorch#174769) Both cuSolver and hipSolver implement potrs, so this just removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for the performance improvements in pytorch#175898! Note that cholesky_inverse depends on cholesky_solve functions and so pytorch#174681 should be merged before this. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] for b, n in product(batches, sizes): shape = b + (n, n) print(f"Testing shape={shape}") label = "torch.cholesky_solve" sub_label = f"{shape}" A = torch.rand(*shape, device="cuda") A = A @ A.mT + torch.eye(n, device="cuda") B = torch.rand(*shape, device="cuda") L = torch.linalg.cholesky(A) stmt = "torch.cholesky_solve(B, L)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'L': L, 'B': B}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmarking results on RTX Pro 6000: ``` [------------ torch.cholesky_solve -----------] | magma | cusolver | speedup 1 threads: ------------------------------------ | (16, 16) | 1579.0 | 21.3 | 74.1 (128, 128) | 1593.5 | 50.1 | 31.8 (512, 512) | 1731.0 | 226.3 | 7.6 (2048, 2048) | 2956.9 | 1587.6 | 1.9 (16, 16, 16) | 28.7 | 28.3 | 1.0 (16, 128, 128) | 601.8 | 121.8 | 4.9 (16, 512, 512) | 3246.6 | 735.5 | 4.4 (16, 2048, 2048) | 23108.0 | 11620.9 | 2.0 (64, 16, 16) | 29.5 | 28.2 | 1.0 (64, 128, 128) | 644.3 | 140.5 | 4.6 (64, 512, 512) | 4571.2 | 1840.4 | 2.5 (64, 2048, 2048) | 65104.7 | 42007.7 | 1.5 Times are in microseconds (us). ``` Pull Request resolved: pytorch#174769 Approved by: https://github.com/nikitaved, https://github.com/eqy
Both cuSolver and hipSolver implement potrs, so this removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for implementing relevant performance improvements in #175898- this PR takes advantage of the triangular solve path as well and benchmarking results are much better than before.
Benchmarking script:
Benchmarking results on RTX Pro 6000:
cc @nikitaved @eqy