Thanks to visit codestin.com
Credit goes to github.com

Skip to content

API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ #150536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

tinglvv
Copy link
Collaborator

@tinglvv tinglvv commented Apr 2, 2025

Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more.

For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum cusparseLtSplitKMode_t and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t

Error we see without the change

RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))`

To execute this test, run the following from the base repo dir:
    python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8

cc @ezyang @gchanan @eqy @ptrblck @malfet @atalman @nWEIdia

@tinglvv tinglvv requested review from eqy and syed-ahmed as code owners April 2, 2025 13:28
@pytorch-bot pytorch-bot bot added the release notes: sparse release notes category label Apr 2, 2025
Copy link

pytorch-bot bot commented Apr 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150536

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit f4531f0 with merge base 6e8602b (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

github-actions bot commented Apr 2, 2025

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.


Caused by:

@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 3, 2025
@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 7, 2025

Splitting this PR into C++ functionality first due to comment "Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. "

@tinglvv tinglvv changed the title Blackwell support for cusparseLT [backend] Blackwell support for cusparseLT Apr 7, 2025
@j4yan
Copy link

j4yan commented Apr 7, 2025

add myself

@j4yan
Copy link

j4yan commented Apr 10, 2025

Hi @tinglvv @eqy @nWEIdia
After the splitting, do you expect all the tests will pass?
I doubt the unchanged python wrappers will still work with the new C++ APIs.

@tinglvv tinglvv changed the title [backend] Blackwell support for cusparseLT blackwell support for cusparseLT Apr 11, 2025
@malfet malfet added the module: bc-breaking Related to a BC-breaking change label Apr 11, 2025
@pytorch-bot pytorch-bot bot added the topic: bc breaking topic category label Apr 11, 2025
@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 11, 2025

Hi @j4yan, upon discussion with @malfet, could you provide more justification for the change? was wondering if we could add someone from the cusparseLt team for review as well.

Would the change be compatible with older version < 0.7.0? (e.g. cusparseLt 0.6.3).

@j4yan
Copy link

j4yan commented Apr 15, 2025

@tinglvv Yes the change is compatible with older version.
I am not sure how to better justify the change. Without the change, there's no way to denote the newly added cusparseLtSplitKMode_t values like CUSPARSELT_STREAMK which is a perf parameter returned by the tuning routine _cslt_sparse_mm_search .

@j4yan
Copy link

j4yan commented Apr 15, 2025

@tinglvv you can add me as a reviewer.

Skylion007
Skylion007 previously approved these changes Apr 18, 2025
@malfet malfet requested review from supriyar and jcaip April 18, 2025 19:26
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a BC breaking change, asking Jesse to have a look at it

@malfet malfet dismissed Skylion007’s stale review April 18, 2025 19:31

This is a BC breaking change, let's undersand

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found the usage in

sparse_result = torch._cslt_sparse_mm(

@tinglvv tinglvv changed the title blackwell support for cusparseLT API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.1 Apr 18, 2025
@tinglvv tinglvv changed the title API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.1 API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ Apr 18, 2025
@nWEIdia
Copy link
Collaborator

nWEIdia commented Apr 18, 2025

I am noticing that our CI (and binary) are still using cusparselt 0.6.3.2.
https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_cuda.sh#L241

Would this PR depend on a PR to bump cusparseLt to v0.7.0+ (e.g. v0.7.1)?

Copy link
Contributor

@jcaip jcaip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a BC breaking change. However I believe it's relatively low risk - _cslt_sparse_mm is a private API and split_k_one_kernel is not a commonly used param.

I think the better way to make this change would be to add in a new kwarg split_k_mode and throw a deprecation warning when split_k_one_kernel is used. That way we can mantain BC for any use cases I am not aware of that do use this flag. We can deprecate in a subsequent release (breaking BC).

This also has the added benefit in that it makes the upgrade to cuSPARSELt 0.7.0 safer - As currently written, the version bump would depend on a bc-breaking change. If we add a new kwarg instead, then the BC breaking change happens after the version bump.

@@ -3371,7 +3371,7 @@
dispatch:
CUDA: _cslt_compress

- func: _cslt_sparse_mm(Tensor compressed_A, Tensor dense_B, Tensor? bias=None, Tensor? alpha=None, ScalarType? out_dtype=None, bool transpose_result=False, int alg_id=0, int split_k=1, bool split_k_one_kernel=True) -> Tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this as a new kwarg instead of rename the existing one? Then we can throw a warning when split_k_one_kernel is used. That way we can mantain BC and deprecate the split_k_one_kernel in the subsequent version.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jcaip Could you elaborate on the BC change? is _cslt_sparse_mm a public API?
If new kwargs have to be added, I'd suggest also add split_k_buffers (see https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltmatmulalgattribute-t) because it's also part of the split-k parameter that's tuned by the search routine.

Copy link
Contributor

@jcaip jcaip Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j4yan It's a private API, I'm just suggesting a way that this change could be made safer.

Anything that calls torch._cslt_sparse_mm(... , split_k_one_kernel=True) will break with this change.

EDIT: actually read our bc policy a bit more closely and realized that private ops are specifically excluded so this isn't technically BC-breaking, in the sense that we don't have any guarantees for private ops.

I still have a preference for making the change in a manner that doesn't break existing code but as I said above I feel like this is pretty low risk anyways so I'm approving to unblock.

@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 25, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cusparselt-blackwell onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cusparselt-blackwell && git pull --rebase)

@tinglvv
Copy link
Collaborator Author

tinglvv commented Apr 30, 2025

Hi @jcaip, thanks for reviewing. The change is currently failing 1 test with below error, does this mean we need to adjust the changes? Or can we ignore this test warning? Thanks again.

[WARNING 2025-04-25 20:49:21,601 check_forward_backward_compatibility.py:301] The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 

Broken ops: [
	aten::_cslt_sparse_mm(Tensor compressed_A, Tensor dense_B, Tensor? bias=None, Tensor? alpha=None, ScalarType? out_dtype=None, bool transpose_result=False, int alg_id=0, int split_k=1, bool split_k_one_kernel=True) -> Tensor
]

@Aidyn-A
Copy link
Collaborator

Aidyn-A commented May 5, 2025

@malfet @jcaip if I am not mistaken, we can add aten::_cslt_sparse_mm to the ALLOW_LIST to make CI pass:

ALLOW_LIST = [
("c10_experimental", datetime.date(9999, 1, 1)),
# Internal
("static", datetime.date(9999, 1, 1)),
("prim::ModuleDictIndex", datetime.date(9999, 1, 1)),
("prim::MKLDNNRelu6", datetime.date(9999, 1, 1)),
("prim::MKLDNNRelu6_", datetime.date(9999, 1, 1)),
("prim::is_ort", datetime.date(9999, 1, 1)),
("prim::Concat", datetime.date(9999, 1, 1)),
("aten::_NestedTensor_GeneralizedBMM", datetime.date(9999, 1, 1)),
# Internal, profiler-specific ops
("profiler::_call_end_callbacks_on_jit_fut*", datetime.date(9999, 1, 1)),
("profiler::_record_function_enter", datetime.date(9999, 1, 1)),
("aten::_cholesky_helper", datetime.date(9999, 1, 1)),
("aten::_lstsq_helper", datetime.date(9999, 1, 1)),
("aten::_syevd_helper", datetime.date(9999, 1, 1)),
("aten::_linalg_solve_out_helper_", datetime.date(9999, 1, 1)),
("aten::select_backward", datetime.date(9999, 1, 1)),
("aten::lstsq", datetime.date(9999, 1, 1)),
("aten::lstsq.X", datetime.date(9999, 1, 1)),
("aten::slice_backward", datetime.date(9999, 1, 1)),
("aten::diagonal_backward", datetime.date(9999, 1, 1)),
("aten::rowwise_prune", datetime.date(9999, 1, 1)),
("aten::eig", datetime.date(9999, 1, 1)),
("aten::eig.e", datetime.date(9999, 1, 1)),
("aten::adaptive_avg_pool3d_backward", datetime.date(9999, 1, 1)),
("aten::_embedding_bag_dense_backward", datetime.date(9999, 1, 1)),
("aten::matrix_rank", datetime.date(9999, 1, 1)),
("aten::matrix_rank.tol", datetime.date(9999, 1, 1)),
("aten::randperm", datetime.date(9999, 1, 1)),
("aten::solve", datetime.date(9999, 1, 1)),
("aten::solve.solution", datetime.date(9999, 1, 1)),
("aten::_solve_helper", datetime.date(9999, 1, 1)),
("aten::_convolution_nogroup", datetime.date(9999, 1, 1)),
("aten::miopen_convolution_backward", datetime.date(9999, 1, 1)),
("aten::miopen_convolution_backward_bias", datetime.date(9999, 1, 1)),
("aten::miopen_convolution_backward_input", datetime.date(9999, 1, 1)),
("aten::miopen_convolution_backward_weight", datetime.date(9999, 1, 1)),
("aten::miopen_convolution_transpose_backward", datetime.date(9999, 1, 1)),
("aten::miopen_convolution_transpose_backward_input", datetime.date(9999, 1, 1)),
("aten::miopen_convolution_transpose_backward_weight", datetime.date(9999, 1, 1)),
("aten::miopen_depthwise_convolution_backward", datetime.date(9999, 1, 1)),
("aten::miopen_depthwise_convolution_backward_input", datetime.date(9999, 1, 1)),
("aten::miopen_depthwise_convolution_backward_weight", datetime.date(9999, 1, 1)),
("aten::_nested_tensor", datetime.date(9999, 1, 1)),
("prepacked::unpack_prepacked_sizes_conv2d", datetime.date(9999, 1, 1)),
("prepacked::unpack_prepacked_sizes_linear", datetime.date(9999, 1, 1)),
("aten::_symeig_helper", datetime.date(9999, 1, 1)),
("aten::symeig", datetime.date(9999, 1, 1)),
("aten::symeig.e", datetime.date(9999, 1, 1)),
("aten::native_multi_head_self_attention", datetime.date(9999, 1, 1)),
("aten::_native_multi_head_self_attention", datetime.date(9999, 1, 1)),
("aten::grid_sampler_3d_backward", datetime.date(9999, 1, 1)),
("aten::_transform_bias_rescale_qkv", datetime.date(9999, 1, 1)),
("prim::infer_squeeze_size.dim", datetime.date(9999, 1, 1)),
("prim::infer_squeeze_size", datetime.date(9999, 1, 1)),
("aten::_weight_norm_cuda_interface", datetime.date(9999, 1, 1)),
("aten::_weight_norm_cuda_interface_backward", datetime.date(9999, 1, 1)),
("aten::empty.SymInt", datetime.date(9999, 1, 1)),
# nested tensor temporary auxiliary ops
("aten::_reshape_nested", datetime.date(9999, 1, 1)),
("aten::_reshape_nested_backward", datetime.date(9999, 1, 1)),
("aten::mps_linear", datetime.date(9999, 1, 1)),
("aten::_mps_linear", datetime.date(9999, 1, 1)),
("aten::_mps_max_pool2d", datetime.date(9999, 1, 1)),
("aten::_mps_max_pool2d.out", datetime.date(9999, 1, 1)),
("aten::mps_max_pool2d_backward", datetime.date(9999, 1, 1)),
("aten::mps_max_pool2d_backward.out", datetime.date(9999, 1, 1)),
# TODO: FIXME: prims shouldn't be checked
("prims::.*", datetime.date(9999, 1, 1)),
("aten::_scaled_dot_product_cudnn_attention", datetime.date(9999, 1, 1)),
# BetterTransformer 1.0 internal operators
("aten::_transformer_decoder_only_layer_fwd", datetime.date(9999, 1, 1)),
("aten::_native_decoder_only_multi_head_attention", datetime.date(9999, 1, 1)),
# These ops were moved to python under the c10d_functional namespace
("aten::wait_tensor", datetime.date(9999, 1, 30)),
("aten::reduce_scatter_tensor", datetime.date(9999, 1, 30)),
("aten::all_gather_into_tensor", datetime.date(9999, 1, 30)),
("aten::all_reduce", datetime.date(9999, 1, 30)),
# These ops are defined in torch/csrc/distributed/c10d/Ops.cpp
# TODO: add back restriction when c10d ops can be exported
("c10d::.*", datetime.date(9999, 1, 1)),
]

Would that be enough to settle the bc-breaking change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: bc-breaking Related to a BC-breaking change open source release notes: sparse release notes category topic: bc breaking topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.