-
Notifications
You must be signed in to change notification settings - Fork 24.1k
API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ #150536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150536
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit f4531f0 with merge base 6e8602b ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Attention! native_functions.yaml was changedIf you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info. Caused by: |
Splitting this PR into C++ functionality first due to comment "Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. " |
add myself |
@tinglvv Yes the change is compatible with older version. |
@tinglvv you can add me as a reviewer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a BC breaking change, asking Jesse to have a look at it
This is a BC breaking change, let's undersand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found the usage in
pytorch/torch/sparse/_semi_structured_ops.py
Line 191 in f20a266
sparse_result = torch._cslt_sparse_mm( |
I am noticing that our CI (and binary) are still using cusparselt 0.6.3.2. Would this PR depend on a PR to bump cusparseLt to v0.7.0+ (e.g. v0.7.1)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a BC breaking change. However I believe it's relatively low risk - _cslt_sparse_mm
is a private API and split_k_one_kernel
is not a commonly used param.
I think the better way to make this change would be to add in a new kwarg split_k_mode
and throw a deprecation warning when split_k_one_kernel
is used. That way we can mantain BC for any use cases I am not aware of that do use this flag. We can deprecate in a subsequent release (breaking BC).
This also has the added benefit in that it makes the upgrade to cuSPARSELt 0.7.0 safer - As currently written, the version bump would depend on a bc-breaking change. If we add a new kwarg instead, then the BC breaking change happens after the version bump.
@@ -3371,7 +3371,7 @@ | |||
dispatch: | |||
CUDA: _cslt_compress | |||
|
|||
- func: _cslt_sparse_mm(Tensor compressed_A, Tensor dense_B, Tensor? bias=None, Tensor? alpha=None, ScalarType? out_dtype=None, bool transpose_result=False, int alg_id=0, int split_k=1, bool split_k_one_kernel=True) -> Tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this as a new kwarg instead of rename the existing one? Then we can throw a warning when split_k_one_kernel
is used. That way we can mantain BC and deprecate the split_k_one_kernel
in the subsequent version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jcaip Could you elaborate on the BC change? is _cslt_sparse_mm a public API?
If new kwargs have to be added, I'd suggest also add split_k_buffers
(see https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltmatmulalgattribute-t) because it's also part of the split-k parameter that's tuned by the search routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@j4yan It's a private API, I'm just suggesting a way that this change could be made safer.
Anything that calls torch._cslt_sparse_mm(... , split_k_one_kernel=True)
will break with this change.
EDIT: actually read our bc policy a bit more closely and realized that private ops are specifically excluded so this isn't technically BC-breaking, in the sense that we don't have any guarantees for private ops.
I still have a preference for making the change in a manner that doesn't break existing code but as I said above I feel like this is pretty low risk anyways so I'm approving to unblock.
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
01535ef
to
f4531f0
Compare
Hi @jcaip, thanks for reviewing. The change is currently failing 1 test with below error, does this mean we need to adjust the changes? Or can we ignore this test warning? Thanks again.
|
@malfet @jcaip if I am not mistaken, we can add pytorch/test/forward_backward_compatibility/check_forward_backward_compatibility.py Lines 50 to 132 in 7e637de
Would that be enough to settle the bc-breaking change? |
Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more.
For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum cusparseLtSplitKMode_t and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t
Error we see without the change
cc @ezyang @gchanan @eqy @ptrblck @malfet @atalman @nWEIdia