Codestin Search App

gderossi · 2026-02-10T15:43:30Z

Both cuSolver and hipSolver implement potrs, so this removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for implementing relevant performance improvements in #175898- this PR takes advantage of the triangular solve path as well and benchmarking results are much better than before.

Benchmarking script:

import torch
import torch.utils.benchmark as benchmark

from itertools import product

results = []

batches = [(), (16,), (64,)]
sizes = [16, 128, 512,  2048]

for b, n in product(batches, sizes):
    shape = b + (n, n)
    print(f"Testing shape={shape}")
    label = "torch.cholesky_inverse"
    sub_label = f"{shape}"
    X = torch.rand(*shape, device="cuda")
    X = X @ X.mT + torch.eye(n, device="cuda")
    L = torch.linalg.cholesky(X)
    stmt = "torch.cholesky_inverse(L)"
    for backend in ("magma", "cusolver"):
        torch.backends.cuda.preferred_linalg_library(backend)
        # warm-up
        for _ in range(5):
            exec(stmt)

        results.append(benchmark.Timer(
            stmt=stmt,
            globals={'L': L},
            label=label,
            sub_label=sub_label,
            description=backend,
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.print()

Benchmarking results on RTX Pro 6000:

[----------- torch.cholesky_inverse ----------]
                        |   magma   |  cusolver | speedup
1 threads: ------------------------------------ |
      (16, 16)          |     37.2  |     31.4  | 1.2
      (128, 128)        |    463.7  |     66.9  | 6.9
      (512, 512)        |   1632.4  |    283.8  | 5.8
      (2048, 2048)      |   7354.0  |   1779.6  | 4.1
      (16, 16, 16)      |     39.1  |     57.8  | 0.7
      (16, 128, 128)    |    597.9  |    159.7  | 3.7
      (16, 512, 512)    |   2947.6  |    852.0  | 3.5
      (16, 2048, 2048)  |  21657.2  |  12394.5  | 1.7
      (64, 16, 16)      |     39.9  |     58.1  | 0.7
      (64, 128, 128)    |    629.5  |    179.2  | 3.5
      (64, 512, 512)    |   4267.9  |   2095.4  | 2.0
      (64, 2048, 2048)  |  63178.9  |  44213.6  | 1.4

Times are in microseconds (us).

cc @nikitaved @eqy

pytorch-bot · 2026-02-10T15:43:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174681

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 9388ad4 with merge base 2f0a6bd ():

NEW FAILURE - The following job has failed:

Lint / lintrunner-noclang-all / linux-job (gh)
>>> Lint for torch/distributed/tensor/_ops/_math_ops.py:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (disabled by #176123 but the issue was closed recently and a rebase is needed to make it pass)
test/test_indexing.py::TestIndexingMPS::test_index_reduce_reduce_mean_mps_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

gderossi · 2026-03-02T22:14:39Z

@nikitaved I just duplicated the cholesky_solve code from #175898 and made the necessary tweaks for cholesky_inverse- it's mostly redundant and can probably be combined into that existing function (or shared logic can be extracted into a helper), do you have any preferences as to implementation?

eqy · 2026-03-04T15:59:47Z

@pytorchmergebot merge -i

pytorchmergebot · 2026-03-04T16:01:58Z

Merge started

Your change will be merged while ignoring the following 2 checks: Lint / lintrunner-noclang-all / linux-job, trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@nikitaved

…ver unconditionally (pytorch#174681) Both cuSolver and hipSolver implement potrs, so this removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for implementing relevant performance improvements in pytorch#175898- this PR takes advantage of the triangular solve path as well and benchmarking results are much better than before. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] for b, n in product(batches, sizes): shape = b + (n, n) print(f"Testing shape={shape}") label = "torch.cholesky_inverse" sub_label = f"{shape}" X = torch.rand(*shape, device="cuda") X = X @ X.mT + torch.eye(n, device="cuda") L = torch.linalg.cholesky(X) stmt = "torch.cholesky_inverse(L)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'L': L}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmarking results on RTX Pro 6000: ``` [----------- torch.cholesky_inverse ----------] | magma | cusolver | speedup 1 threads: ------------------------------------ | (16, 16) | 37.2 | 31.4 | 1.2 (128, 128) | 463.7 | 66.9 | 6.9 (512, 512) | 1632.4 | 283.8 | 5.8 (2048, 2048) | 7354.0 | 1779.6 | 4.1 (16, 16, 16) | 39.1 | 57.8 | 0.7 (16, 128, 128) | 597.9 | 159.7 | 3.7 (16, 512, 512) | 2947.6 | 852.0 | 3.5 (16, 2048, 2048) | 21657.2 | 12394.5 | 1.7 (64, 16, 16) | 39.9 | 58.1 | 0.7 (64, 128, 128) | 629.5 | 179.2 | 3.5 (64, 512, 512) | 4267.9 | 2095.4 | 2.0 (64, 2048, 2048) | 63178.9 | 44213.6 | 1.4 Times are in microseconds (us). ``` Pull Request resolved: pytorch#174681 Approved by: https://github.com/eqy

@nikitaved

…r unconditionally (#174769) Both cuSolver and hipSolver implement potrs, so this just removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for the performance improvements in #175898! Note that cholesky_inverse depends on cholesky_solve functions and so #174681 should be merged before this. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] for b, n in product(batches, sizes): shape = b + (n, n) print(f"Testing shape={shape}") label = "torch.cholesky_solve" sub_label = f"{shape}" A = torch.rand(*shape, device="cuda") A = A @ A.mT + torch.eye(n, device="cuda") B = torch.rand(*shape, device="cuda") L = torch.linalg.cholesky(A) stmt = "torch.cholesky_solve(B, L)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'L': L, 'B': B}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmarking results on RTX Pro 6000: ``` [------------ torch.cholesky_solve -----------] | magma | cusolver | speedup 1 threads: ------------------------------------ | (16, 16) | 1579.0 | 21.3 | 74.1 (128, 128) | 1593.5 | 50.1 | 31.8 (512, 512) | 1731.0 | 226.3 | 7.6 (2048, 2048) | 2956.9 | 1587.6 | 1.9 (16, 16, 16) | 28.7 | 28.3 | 1.0 (16, 128, 128) | 601.8 | 121.8 | 4.9 (16, 512, 512) | 3246.6 | 735.5 | 4.4 (16, 2048, 2048) | 23108.0 | 11620.9 | 2.0 (64, 16, 16) | 29.5 | 28.2 | 1.0 (64, 128, 128) | 644.3 | 140.5 | 4.6 (64, 512, 512) | 4571.2 | 1840.4 | 2.5 (64, 2048, 2048) | 65104.7 | 42007.7 | 1.5 Times are in microseconds (us). ``` Pull Request resolved: #174769 Approved by: https://github.com/nikitaved, https://github.com/eqy

@nikitaved

…ver unconditionally (pytorch#174681) Both cuSolver and hipSolver implement potrs, so this removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for implementing relevant performance improvements in pytorch#175898- this PR takes advantage of the triangular solve path as well and benchmarking results are much better than before. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] for b, n in product(batches, sizes): shape = b + (n, n) print(f"Testing shape={shape}") label = "torch.cholesky_inverse" sub_label = f"{shape}" X = torch.rand(*shape, device="cuda") X = X @ X.mT + torch.eye(n, device="cuda") L = torch.linalg.cholesky(X) stmt = "torch.cholesky_inverse(L)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'L': L}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmarking results on RTX Pro 6000: ``` [----------- torch.cholesky_inverse ----------] | magma | cusolver | speedup 1 threads: ------------------------------------ | (16, 16) | 37.2 | 31.4 | 1.2 (128, 128) | 463.7 | 66.9 | 6.9 (512, 512) | 1632.4 | 283.8 | 5.8 (2048, 2048) | 7354.0 | 1779.6 | 4.1 (16, 16, 16) | 39.1 | 57.8 | 0.7 (16, 128, 128) | 597.9 | 159.7 | 3.7 (16, 512, 512) | 2947.6 | 852.0 | 3.5 (16, 2048, 2048) | 21657.2 | 12394.5 | 1.7 (64, 16, 16) | 39.9 | 58.1 | 0.7 (64, 128, 128) | 629.5 | 179.2 | 3.5 (64, 512, 512) | 4267.9 | 2095.4 | 2.0 (64, 2048, 2048) | 63178.9 | 44213.6 | 1.4 Times are in microseconds (us). ``` Pull Request resolved: pytorch#174681 Approved by: https://github.com/eqy

@nikitaved

…r unconditionally (pytorch#174769) Both cuSolver and hipSolver implement potrs, so this just removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for the performance improvements in pytorch#175898! Note that cholesky_inverse depends on cholesky_solve functions and so pytorch#174681 should be merged before this. Benchmarking script: ```python import torch import torch.utils.benchmark as benchmark from itertools import product results = [] batches = [(), (16,), (64,)] sizes = [16, 128, 512, 2048] for b, n in product(batches, sizes): shape = b + (n, n) print(f"Testing shape={shape}") label = "torch.cholesky_solve" sub_label = f"{shape}" A = torch.rand(*shape, device="cuda") A = A @ A.mT + torch.eye(n, device="cuda") B = torch.rand(*shape, device="cuda") L = torch.linalg.cholesky(A) stmt = "torch.cholesky_solve(B, L)" for backend in ("magma", "cusolver"): torch.backends.cuda.preferred_linalg_library(backend) # warm-up for _ in range(5): exec(stmt) results.append(benchmark.Timer( stmt=stmt, globals={'L': L, 'B': B}, label=label, sub_label=sub_label, description=backend, ).blocked_autorange(min_run_time=1)) compare = benchmark.Compare(results) compare.print() ``` Benchmarking results on RTX Pro 6000: ``` [------------ torch.cholesky_solve -----------] | magma | cusolver | speedup 1 threads: ------------------------------------ | (16, 16) | 1579.0 | 21.3 | 74.1 (128, 128) | 1593.5 | 50.1 | 31.8 (512, 512) | 1731.0 | 226.3 | 7.6 (2048, 2048) | 2956.9 | 1587.6 | 1.9 (16, 16, 16) | 28.7 | 28.3 | 1.0 (16, 128, 128) | 601.8 | 121.8 | 4.9 (16, 512, 512) | 3246.6 | 735.5 | 4.4 (16, 2048, 2048) | 23108.0 | 11620.9 | 2.0 (64, 16, 16) | 29.5 | 28.2 | 1.0 (64, 128, 128) | 644.3 | 140.5 | 4.6 (64, 512, 512) | 4571.2 | 1840.4 | 2.5 (64, 2048, 2048) | 65104.7 | 42007.7 | 1.5 Times are in microseconds (us). ``` Pull Request resolved: pytorch#174769 Approved by: https://github.com/nikitaved, https://github.com/eqy

pytorch-bot Bot added the release notes: linalg_frontend release notes category label Feb 10, 2026

pytorchbot added the open source label Feb 10, 2026

seemethere mentioned this pull request Feb 15, 2026

Consolidate or retire pytorch/almalinux-builder Docker images after MAGMA deprecation #175045

Open

Remove MAGMA path and add triangular solve fallback

5bbd895

gderossi force-pushed the deprecate-magma-cholesky-inverse branch from c0e57ea to 5bbd895 Compare March 2, 2026 22:04

gderossi marked this pull request as ready for review March 2, 2026 22:14

gderossi requested review from Aidyn-A, IvanYashchuk, eqy, lezcano, nikitaved and syed-ahmed as code owners March 2, 2026 22:14

eqy approved these changes Mar 2, 2026

View reviewed changes

gderossi mentioned this pull request Mar 2, 2026

[MAGMA][CUDA] cholesky_solve: deprecate MAGMA and dispatch to cuSolver unconditionally #174769

Closed

Fix lint

6035d48

Aidyn-A added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Mar 3, 2026

nikitaved reviewed Mar 3, 2026

View reviewed changes

Comment thread aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp Outdated

nikitaved reviewed Mar 3, 2026

View reviewed changes

Comment thread aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp Outdated

Simplify shared code for cholesky solve and inverse

9388ad4

pytorch-bot Bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Mar 3, 2026

nikitaved added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 4, 2026

pytorchmergebot added the merging label Mar 4, 2026

pytorchmergebot added the Merged label Mar 4, 2026

pytorchmergebot closed this in 596dbc5 Mar 4, 2026

pytorchmergebot removed the merging label Mar 4, 2026

gderossi deleted the deprecate-magma-cholesky-inverse branch March 9, 2026 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAGMA][CUDA] cholesky_inverse: deprecate MAGMA and dispatch to cuSolver unconditionally#174681

[MAGMA][CUDA] cholesky_inverse: deprecate MAGMA and dispatch to cuSolver unconditionally#174681
gderossi wants to merge 3 commits into
pytorch:mainfrom
gderossi:deprecate-magma-cholesky-inverse

gderossi commented Feb 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

gderossi commented Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

eqy commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

gderossi commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174681

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

gderossi commented Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

eqy commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gderossi commented Feb 10, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 10, 2026 •

edited

Loading