[TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ #152814

Aidyn-A · 2025-05-05T09:28:18Z

The float8 row-wise scaled matmuls are not supported on Blackwell yet. This PR adds skips to those tests to decrease the noise on sm_120+ machines.

cc @ptrblck @msaroufim @eqy @jerryzh168

pytorch-bot · 2025-05-05T09:28:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152814

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures

As of commit 3a18ffa with merge base 5796212 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.6-py3.10-gcc11 / test (default, 2, 5, ephemeral.linux.4xlarge.nvidia.gpu) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / test (default, 2, 5, ephemeral.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!
pull / linux-focal-py3.13-clang10 / test (crossref, 2, 2, ephemeral.linux.2xlarge) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!
pull / linux-focal-py3.13-clang10 / test (default, 2, 5, ephemeral.linux.4xlarge) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 3, 3, ephemeral.linux.2xlarge) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!
pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, ephemeral.linux.2xlarge) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!
pull / linux-focal-py3.9-clang10 / test (default, 2, 5, ephemeral.linux.4xlarge) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 3, 3, ephemeral.linux.2xlarge) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!
pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, ephemeral.linux.4xlarge) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!
pull / linux-jammy-py3.9-gcc11 / test (default, 2, 5, ephemeral.linux.2xlarge) (gh)
RuntimeError: test_matmul_cuda 1/1 failed!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy

Is this still needed after #148421 ?
I'm seeing that on a fresh source build
test_float8_rowwise_scaling_sanity_use_fast_accum_True_cuda
test_float8_rowwise_scaling_sanity_use_fast_accum_False_cuda
test_scaled_mm_vs_emulated_row_wise
all pass

Aidyn-A · 2025-05-06T08:15:57Z

Is this still needed after #148421 ? I'm seeing that on a fresh source build test_float8_rowwise_scaling_sanity_use_fast_accum_True_cuda test_float8_rowwise_scaling_sanity_use_fast_accum_False_cuda test_scaled_mm_vs_emulated_row_wise all pass

My bad, it is needed for sm_120, not sm_100.

Aidyn-A · 2025-05-07T08:59:04Z

@pytorchbot rebase

pytorchmergebot · 2025-05-07T09:00:38Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-05-07T09:00:43Z

Successfully rebased test_matmul_cuda_skip_rowwise_scaled_mm_on_blackwell onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout test_matmul_cuda_skip_rowwise_scaled_mm_on_blackwell && git pull --rebase)

Skylion007 · 2025-05-07T13:06:35Z

Hmm, is this because CUTLASS is missing those specializations? Or because we are missing something minor on our side that would unblock support? Like a missing template specialization, overly restrictive dispatch logic, or missing cmake are to build those kernels from SM120?

Skylion007 · 2025-05-07T13:33:58Z

Doesn't it support it here? Are we missing dispatch logic?

pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu

Line 305 in f393ee5

template <

Skylion007 · 2025-05-07T13:38:25Z

Does just adding SM120 here fix it or is SM100 not compatible with SM120?

pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu

Line 718 in f393ee5

const bool sm10x = properties != nullptr && properties->major == 10;

Aidyn-A · 2025-05-07T14:00:09Z

cc @eqy @malfet to review

Aidyn-A · 2025-05-07T14:01:16Z

Hmm, is this because CUTLASS is missing those specializations? Or because we are missing something minor on our side that would unblock support? Like a missing template specialization, overly restrictive dispatch logic, or missing cmake are to build those kernels from SM120?

No, it is not as trivial as adding sm_120 to those places. I have tried that, CUTLASS just fails as "uninitialized" (whatever that means).

test/test_matmul_cuda.py

pytorch-bot bot added the topic: not user facing topic category label May 5, 2025

Aidyn-A requested a review from eqy May 5, 2025 09:28

pytorchbot added the open source label May 5, 2025

eqy reviewed May 6, 2025

View reviewed changes

Aidyn-A changed the title ~~[TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on Blackwell~~ [TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_12x May 6, 2025

Aidyn-A changed the title ~~[TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_12x~~ [TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ May 6, 2025

Aidyn-A requested a review from eqy May 6, 2025 10:08

drisspg added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 6, 2025

Aidyn-A added 2 commits May 7, 2025 09:00

skip rowwise scaled_mm on blackwell

b4a8a17

skip for sm_12x

c02fe40

pytorchmergebot force-pushed the test_matmul_cuda_skip_rowwise_scaled_mm_on_blackwell branch from dec9eab to c02fe40 Compare May 7, 2025 09:00

eqy approved these changes May 7, 2025

View reviewed changes

test/test_matmul_cuda.py Outdated Show resolved Hide resolved

Aidyn-A added 2 commits May 7, 2025 19:16

Use expectedFailure

5012d76

Forgot comment

3a18ffa

eqy approved these changes May 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ #152814

[TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ #152814

Aidyn-A commented May 5, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 5, 2025 •

edited

Loading

eqy left a comment

Aidyn-A commented May 6, 2025

Aidyn-A commented May 7, 2025

pytorchmergebot commented May 7, 2025

pytorchmergebot commented May 7, 2025

Skylion007 commented May 7, 2025 •

edited

Loading

Skylion007 commented May 7, 2025

Skylion007 commented May 7, 2025

Aidyn-A commented May 7, 2025

Aidyn-A commented May 7, 2025

[TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ #152814

Are you sure you want to change the base?

[TEST][ATen][CUDA] Skip row-wise scaled matrix mmultiplication tests on sm_120+ #152814

Conversation

Aidyn-A commented May 5, 2025 • edited by pytorch-bot bot Loading

pytorch-bot bot commented May 5, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152814

❌ 10 New Failures

eqy left a comment

Choose a reason for hiding this comment

Aidyn-A commented May 6, 2025

Aidyn-A commented May 7, 2025

pytorchmergebot commented May 7, 2025

pytorchmergebot commented May 7, 2025

Skylion007 commented May 7, 2025 • edited Loading

Skylion007 commented May 7, 2025

Skylion007 commented May 7, 2025

Aidyn-A commented May 7, 2025

Aidyn-A commented May 7, 2025

Aidyn-A commented May 5, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 5, 2025 •

edited

Loading

Skylion007 commented May 7, 2025 •

edited

Loading