Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

leofang
Copy link
Member

@leofang leofang commented Jul 25, 2025

Description

closes #739

Before this PR:

In [4]: %timeit dev.set_current()
1.4 μs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

With this PR:

In [4]: %timeit dev.set_current()
374 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We want to minimize the number of calls to CUDA APIs, which add up to the overheads (several hundreds of nanoseconds per call). This is achieved by combining two changes:

  1. We lazily retain a reference to the primary context of each device
    • The only scenario where this could fail is if someone calls cudaDeviceReset() somewhere, in which case the pointers to the primary contests would be invalidated and become dangling. However, the CUDA team strongly discourages using this API (it does not solve any real issue, e.g. restoring from sticky errors), and it is impossible to get things right especially in the Python land if it is called. Too many CUDA-related objects floating around.
  2. We unconditionally set the needed primary context to current without extra checks

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Contributor

copy-pr-bot bot commented Jul 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang leofang self-assigned this Jul 25, 2025
@leofang leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jul 25, 2025
@leofang leofang added this to the cuda.core beta 6 milestone Jul 25, 2025
@leofang leofang requested a review from shwina July 25, 2025 04:09
@leofang leofang marked this pull request as ready for review July 25, 2025 04:09
@leofang
Copy link
Member Author

leofang commented Jul 25, 2025

/ok to test 92d24ce

This comment has been minimized.

@github-project-automation github-project-automation bot moved this from Todo to In Review in CCCL Aug 6, 2025
@leofang
Copy link
Member Author

leofang commented Aug 6, 2025

/ok to test b9b3c87

@leofang leofang enabled auto-merge (squash) August 6, 2025 17:07
@leofang leofang merged commit fec95b8 into NVIDIA:main Aug 6, 2025
49 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Aug 6, 2025
Copy link

github-actions bot commented Aug 6, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Device.set_current() is slow
2 participants