[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size#120925
[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size#120925Aidyn-A wants to merge 6 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120925
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b39bc70 with merge base 7128504 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
malfet
left a comment
There was a problem hiding this comment.
LGTM, but do you know if this might push memory utilization for some smaller GPUs beyond the limit?
|
We can add an additional qualifier that sm < 80 stick with the original 1 MiB, what do you think @Aidyn-A |
|
A dilemma of perf vs memory. I guess it makes sense to decrease it for old/small GPU. Lemme commit it 👍 |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "broke inductor models and caused accuracy regression on nightly dashboard https://hud.pytorch.org/pytorch/pytorch/commit/0a38a6ac8046e4d3f9cfaba86b7ec6517038646f https://github.com/pytorch/pytorch/actions/runs/8118465367/job/22193590228" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…120925)" This reverts commit 0a38a6a. Reverted #120925 on behalf of https://github.com/clee2000 due to broke inductor models and caused accuracy regression on nightly dashboard https://hud.pytorch.org/pytorch/pytorch/commit/0a38a6ac8046e4d3f9cfaba86b7ec6517038646f https://github.com/pytorch/pytorch/actions/runs/8118465367/job/22193590228 ([comment](#120925 (comment)))
|
@Aidyn-A your PR has been successfully reverted. |
|
Aidyn-A A workflow that sets: |
|
@mortzur Could you describe what exactly is failing? This PR changes the cublasLt workspace and does not touch cublas workspace handling. |
|
ptrblck you're right, I can set CUBLASLT_WORKSPACE_SIZE to fix the workspace size. no issue here. |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
8287fa8 to
e55fa56
Compare
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
e55fa56 to
b39bc70
Compare
…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel
…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel
…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel
…es (#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925 + fixes behavior broken behavior with the memtracker; #139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: #145130 Approved by: https://github.com/ngimel
…es (pytorch#145130) As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels. This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits: + caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`) + "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also pytorch#120925 + fixes behavior broken behavior with the memtracker; pytorch#139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it + one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider Pull Request resolved: pytorch#145130 Approved by: https://github.com/ngimel
According to the cuBLAS API Reference the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged.
cc @csarofeen @ptrblck @xwang233