-
Notifications
You must be signed in to change notification settings - Fork 24.6k
API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ #150536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150536
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 5e71e2a with merge base a13c8f2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Attention! native_functions.yaml was changedIf you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info. Caused by: |
Splitting this PR into C++ functionality first due to comment "Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. " |
add myself |
@tinglvv Yes the change is compatible with older version. |
@tinglvv you can add me as a reviewer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a BC breaking change, asking Jesse to have a look at it
This is a BC breaking change, let's undersand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found the usage in
pytorch/torch/sparse/_semi_structured_ops.py
Line 191 in f20a266
sparse_result = torch._cslt_sparse_mm( |
I am noticing that our CI (and binary) are still using cusparselt 0.6.3.2. Would this PR depend on a PR to bump cusparseLt to v0.7.0+ (e.g. v0.7.1)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a BC breaking change. However I believe it's relatively low risk - _cslt_sparse_mm
is a private API and split_k_one_kernel
is not a commonly used param.
I think the better way to make this change would be to add in a new kwarg split_k_mode
and throw a deprecation warning when split_k_one_kernel
is used. That way we can mantain BC for any use cases I am not aware of that do use this flag. We can deprecate in a subsequent release (breaking BC).
This also has the added benefit in that it makes the upgrade to cuSPARSELt 0.7.0 safer - As currently written, the version bump would depend on a bc-breaking change. If we add a new kwarg instead, then the BC breaking change happens after the version bump.
@pytorchbot rebase |
@malfet @jcaip if I am not mistaken, we can add pytorch/test/forward_backward_compatibility/check_forward_backward_compatibility.py Lines 50 to 132 in 7e637de
Would that be enough to settle the bc-breaking change? |
Sorry for the late response was traveling for ICLR and then on PTO the last week. Yes, I think it should be fine to add here. From the comment above I think you can just put: datetime.date(9999, 1, 1).
|
@pytorchbot rebase -b main |
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
Successfully rebased |
df3cadd
to
5e71e2a
Compare
@pytorchbot merge |
This PR has pending changes requested. Please address the comments and update the PR before merging. |
Resetting the review per Jesse's approval. Merging for now.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more.
For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum cusparseLtSplitKMode_t and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t
Error we see without the change
cc @ezyang @gchanan @eqy @ptrblck @malfet @atalman @nWEIdia