Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MAGMA][CUDA] cholesky_inverse: deprecate MAGMA and dispatch to cuSolver unconditionally#174681

Closed
gderossi wants to merge 3 commits into
pytorch:mainfrom
gderossi:deprecate-magma-cholesky-inverse
Closed

[MAGMA][CUDA] cholesky_inverse: deprecate MAGMA and dispatch to cuSolver unconditionally#174681
gderossi wants to merge 3 commits into
pytorch:mainfrom
gderossi:deprecate-magma-cholesky-inverse

Conversation

@gderossi
Copy link
Copy Markdown
Contributor

@gderossi gderossi commented Feb 10, 2026

Both cuSolver and hipSolver implement potrs, so this removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for implementing relevant performance improvements in #175898- this PR takes advantage of the triangular solve path as well and benchmarking results are much better than before.

Benchmarking script:

import torch
import torch.utils.benchmark as benchmark

from itertools import product

results = []

batches = [(), (16,), (64,)]
sizes = [16, 128, 512,  2048]

for b, n in product(batches, sizes):
    shape = b + (n, n)
    print(f"Testing shape={shape}")
    label = "torch.cholesky_inverse"
    sub_label = f"{shape}"
    X = torch.rand(*shape, device="cuda")
    X = X @ X.mT + torch.eye(n, device="cuda")
    L = torch.linalg.cholesky(X)
    stmt = "torch.cholesky_inverse(L)"
    for backend in ("magma", "cusolver"):
        torch.backends.cuda.preferred_linalg_library(backend)
        # warm-up
        for _ in range(5):
            exec(stmt)

        results.append(benchmark.Timer(
            stmt=stmt,
            globals={'L': L},
            label=label,
            sub_label=sub_label,
            description=backend,
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.print()

Benchmarking results on RTX Pro 6000:

[----------- torch.cholesky_inverse ----------]
                        |   magma   |  cusolver | speedup
1 threads: ------------------------------------ |
      (16, 16)          |     37.2  |     31.4  | 1.2
      (128, 128)        |    463.7  |     66.9  | 6.9
      (512, 512)        |   1632.4  |    283.8  | 5.8
      (2048, 2048)      |   7354.0  |   1779.6  | 4.1
      (16, 16, 16)      |     39.1  |     57.8  | 0.7
      (16, 128, 128)    |    597.9  |    159.7  | 3.7
      (16, 512, 512)    |   2947.6  |    852.0  | 3.5
      (16, 2048, 2048)  |  21657.2  |  12394.5  | 1.7
      (64, 16, 16)      |     39.9  |     58.1  | 0.7
      (64, 128, 128)    |    629.5  |    179.2  | 3.5
      (64, 512, 512)    |   4267.9  |   2095.4  | 2.0
      (64, 2048, 2048)  |  63178.9  |  44213.6  | 1.4

Times are in microseconds (us).

cc @nikitaved @eqy

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174681

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 9388ad4 with merge base 2f0a6bd (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@gderossi gderossi force-pushed the deprecate-magma-cholesky-inverse branch from c0e57ea to 5bbd895 Compare March 2, 2026 22:04
@gderossi
Copy link
Copy Markdown
Contributor Author

gderossi commented Mar 2, 2026

@nikitaved I just duplicated the cholesky_solve code from #175898 and made the necessary tweaks for cholesky_inverse- it's mostly redundant and can probably be combined into that existing function (or shared logic can be extracted into a helper), do you have any preferences as to implementation?

@gderossi gderossi marked this pull request as ready for review March 2, 2026 22:14
@Aidyn-A Aidyn-A added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Mar 3, 2026
Comment thread aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp Outdated
Comment thread aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp Outdated
@pytorch-bot pytorch-bot Bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Mar 3, 2026
@nikitaved nikitaved added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 4, 2026
@eqy
Copy link
Copy Markdown
Collaborator

eqy commented Mar 4, 2026

@pytorchmergebot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: Lint / lintrunner-noclang-all / linux-job, trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit to anatoliylitv/pytorch that referenced this pull request Mar 4, 2026
…ver unconditionally (pytorch#174681)

Both cuSolver and hipSolver implement potrs, so this removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for implementing relevant performance improvements in pytorch#175898- this PR takes advantage of the triangular solve path as well and benchmarking results are much better than before.

Benchmarking script:
```python
import torch
import torch.utils.benchmark as benchmark

from itertools import product

results = []

batches = [(), (16,), (64,)]
sizes = [16, 128, 512,  2048]

for b, n in product(batches, sizes):
    shape = b + (n, n)
    print(f"Testing shape={shape}")
    label = "torch.cholesky_inverse"
    sub_label = f"{shape}"
    X = torch.rand(*shape, device="cuda")
    X = X @ X.mT + torch.eye(n, device="cuda")
    L = torch.linalg.cholesky(X)
    stmt = "torch.cholesky_inverse(L)"
    for backend in ("magma", "cusolver"):
        torch.backends.cuda.preferred_linalg_library(backend)
        # warm-up
        for _ in range(5):
            exec(stmt)

        results.append(benchmark.Timer(
            stmt=stmt,
            globals={'L': L},
            label=label,
            sub_label=sub_label,
            description=backend,
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.print()
```

Benchmarking results on RTX Pro 6000:
```
[----------- torch.cholesky_inverse ----------]
                        |   magma   |  cusolver | speedup
1 threads: ------------------------------------ |
      (16, 16)          |     37.2  |     31.4  | 1.2
      (128, 128)        |    463.7  |     66.9  | 6.9
      (512, 512)        |   1632.4  |    283.8  | 5.8
      (2048, 2048)      |   7354.0  |   1779.6  | 4.1
      (16, 16, 16)      |     39.1  |     57.8  | 0.7
      (16, 128, 128)    |    597.9  |    159.7  | 3.7
      (16, 512, 512)    |   2947.6  |    852.0  | 3.5
      (16, 2048, 2048)  |  21657.2  |  12394.5  | 1.7
      (64, 16, 16)      |     39.9  |     58.1  | 0.7
      (64, 128, 128)    |    629.5  |    179.2  | 3.5
      (64, 512, 512)    |   4267.9  |   2095.4  | 2.0
      (64, 2048, 2048)  |  63178.9  |  44213.6  | 1.4

Times are in microseconds (us).
```

Pull Request resolved: pytorch#174681
Approved by: https://github.com/eqy
pytorchmergebot pushed a commit that referenced this pull request Mar 4, 2026
…r unconditionally (#174769)

Both cuSolver and hipSolver implement potrs, so this just removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for the performance improvements in #175898! Note that cholesky_inverse depends on cholesky_solve functions and so #174681 should be merged before this.

Benchmarking script:
```python
import torch
import torch.utils.benchmark as benchmark

from itertools import product

results = []

batches = [(), (16,), (64,)]
sizes = [16, 128, 512,  2048]

for b, n in product(batches, sizes):
    shape = b + (n, n)
    print(f"Testing shape={shape}")
    label = "torch.cholesky_solve"
    sub_label = f"{shape}"
    A = torch.rand(*shape, device="cuda")
    A = A @ A.mT + torch.eye(n, device="cuda")
    B = torch.rand(*shape, device="cuda")
    L = torch.linalg.cholesky(A)
    stmt = "torch.cholesky_solve(B, L)"
    for backend in ("magma", "cusolver"):
        torch.backends.cuda.preferred_linalg_library(backend)
        # warm-up
        for _ in range(5):
            exec(stmt)

        results.append(benchmark.Timer(
            stmt=stmt,
            globals={'L': L, 'B': B},
            label=label,
            sub_label=sub_label,
            description=backend,
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.print()
```

Benchmarking results on RTX Pro 6000:
```
[------------ torch.cholesky_solve -----------]
                        |   magma   |  cusolver | speedup
1 threads: ------------------------------------ |
      (16, 16)          |   1579.0  |     21.3  | 74.1
      (128, 128)        |   1593.5  |     50.1  | 31.8
      (512, 512)        |   1731.0  |    226.3  | 7.6
      (2048, 2048)      |   2956.9  |   1587.6  | 1.9
      (16, 16, 16)      |     28.7  |     28.3  | 1.0
      (16, 128, 128)    |    601.8  |    121.8  | 4.9
      (16, 512, 512)    |   3246.6  |    735.5  | 4.4
      (16, 2048, 2048)  |  23108.0  |  11620.9  | 2.0
      (64, 16, 16)      |     29.5  |     28.2  | 1.0
      (64, 128, 128)    |    644.3  |    140.5  | 4.6
      (64, 512, 512)    |   4571.2  |   1840.4  | 2.5
      (64, 2048, 2048)  |  65104.7  |  42007.7  | 1.5

Times are in microseconds (us).
```

Pull Request resolved: #174769
Approved by: https://github.com/nikitaved, https://github.com/eqy
@gderossi gderossi deleted the deprecate-magma-cholesky-inverse branch March 9, 2026 14:25
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…ver unconditionally (pytorch#174681)

Both cuSolver and hipSolver implement potrs, so this removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for implementing relevant performance improvements in pytorch#175898- this PR takes advantage of the triangular solve path as well and benchmarking results are much better than before.

Benchmarking script:
```python
import torch
import torch.utils.benchmark as benchmark

from itertools import product

results = []

batches = [(), (16,), (64,)]
sizes = [16, 128, 512,  2048]

for b, n in product(batches, sizes):
    shape = b + (n, n)
    print(f"Testing shape={shape}")
    label = "torch.cholesky_inverse"
    sub_label = f"{shape}"
    X = torch.rand(*shape, device="cuda")
    X = X @ X.mT + torch.eye(n, device="cuda")
    L = torch.linalg.cholesky(X)
    stmt = "torch.cholesky_inverse(L)"
    for backend in ("magma", "cusolver"):
        torch.backends.cuda.preferred_linalg_library(backend)
        # warm-up
        for _ in range(5):
            exec(stmt)

        results.append(benchmark.Timer(
            stmt=stmt,
            globals={'L': L},
            label=label,
            sub_label=sub_label,
            description=backend,
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.print()
```

Benchmarking results on RTX Pro 6000:
```
[----------- torch.cholesky_inverse ----------]
                        |   magma   |  cusolver | speedup
1 threads: ------------------------------------ |
      (16, 16)          |     37.2  |     31.4  | 1.2
      (128, 128)        |    463.7  |     66.9  | 6.9
      (512, 512)        |   1632.4  |    283.8  | 5.8
      (2048, 2048)      |   7354.0  |   1779.6  | 4.1
      (16, 16, 16)      |     39.1  |     57.8  | 0.7
      (16, 128, 128)    |    597.9  |    159.7  | 3.7
      (16, 512, 512)    |   2947.6  |    852.0  | 3.5
      (16, 2048, 2048)  |  21657.2  |  12394.5  | 1.7
      (64, 16, 16)      |     39.9  |     58.1  | 0.7
      (64, 128, 128)    |    629.5  |    179.2  | 3.5
      (64, 512, 512)    |   4267.9  |   2095.4  | 2.0
      (64, 2048, 2048)  |  63178.9  |  44213.6  | 1.4

Times are in microseconds (us).
```

Pull Request resolved: pytorch#174681
Approved by: https://github.com/eqy
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…r unconditionally (pytorch#174769)

Both cuSolver and hipSolver implement potrs, so this just removes the MAGMA path entirely and adds a deprecation warning, and updates the tests to skip only if missing cuSolver. Thanks @nikitaved for the performance improvements in pytorch#175898! Note that cholesky_inverse depends on cholesky_solve functions and so pytorch#174681 should be merged before this.

Benchmarking script:
```python
import torch
import torch.utils.benchmark as benchmark

from itertools import product

results = []

batches = [(), (16,), (64,)]
sizes = [16, 128, 512,  2048]

for b, n in product(batches, sizes):
    shape = b + (n, n)
    print(f"Testing shape={shape}")
    label = "torch.cholesky_solve"
    sub_label = f"{shape}"
    A = torch.rand(*shape, device="cuda")
    A = A @ A.mT + torch.eye(n, device="cuda")
    B = torch.rand(*shape, device="cuda")
    L = torch.linalg.cholesky(A)
    stmt = "torch.cholesky_solve(B, L)"
    for backend in ("magma", "cusolver"):
        torch.backends.cuda.preferred_linalg_library(backend)
        # warm-up
        for _ in range(5):
            exec(stmt)

        results.append(benchmark.Timer(
            stmt=stmt,
            globals={'L': L, 'B': B},
            label=label,
            sub_label=sub_label,
            description=backend,
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results)
compare.print()
```

Benchmarking results on RTX Pro 6000:
```
[------------ torch.cholesky_solve -----------]
                        |   magma   |  cusolver | speedup
1 threads: ------------------------------------ |
      (16, 16)          |   1579.0  |     21.3  | 74.1
      (128, 128)        |   1593.5  |     50.1  | 31.8
      (512, 512)        |   1731.0  |    226.3  | 7.6
      (2048, 2048)      |   2956.9  |   1587.6  | 1.9
      (16, 16, 16)      |     28.7  |     28.3  | 1.0
      (16, 128, 128)    |    601.8  |    121.8  | 4.9
      (16, 512, 512)    |   3246.6  |    735.5  | 4.4
      (16, 2048, 2048)  |  23108.0  |  11620.9  | 2.0
      (64, 16, 16)      |     29.5  |     28.2  | 1.0
      (64, 128, 128)    |    644.3  |    140.5  | 4.6
      (64, 512, 512)    |   4571.2  |   1840.4  | 2.5
      (64, 2048, 2048)  |  65104.7  |  42007.7  | 1.5

Times are in microseconds (us).
```

Pull Request resolved: pytorch#174769
Approved by: https://github.com/nikitaved, https://github.com/eqy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: linalg_frontend release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants