Make `Device.set_current()` faster #781

leofang · 2025-07-25T03:33:05Z

Description

closes #739

Before this PR:

In [4]: %timeit dev.set_current()
1.4 μs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

With this PR:

In [4]: %timeit dev.set_current()
374 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

We want to minimize the number of calls to CUDA APIs, which add up to the overheads (several hundreds of nanoseconds per call). This is achieved by combining two changes:

We lazily retain a reference to the primary context of each device
- The only scenario where this could fail is if someone calls cudaDeviceReset() somewhere, in which case the pointers to the primary contests would be invalidated and become dangling. However, the CUDA team strongly discourages using this API (it does not solve any real issue, e.g. restoring from sticky errors), and it is impossible to get things right especially in the Python land if it is called. Too many CUDA-related objects floating around.
We unconditionally set the needed primary context to current without extra checks

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-07-25T03:33:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2025-07-25T04:09:21Z

/ok to test 92d24ce

leofang · 2025-08-06T17:06:54Z

/ok to test b9b3c87

github-actions · 2025-08-06T17:46:57Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

leofang added 2 commits July 23, 2025 17:54

cache primary context

4840061

avoid increasing stack size

c4afd33

github-project-automation bot added this to CCCL Jul 25, 2025

github-project-automation bot moved this to Todo in CCCL Jul 25, 2025

leofang self-assigned this Jul 25, 2025

unconditionally set primary context to current

562340c

leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jul 25, 2025

leofang added this to the cuda.core beta 6 milestone Jul 25, 2025

Merge branch 'main' into faster_set_current

92d24ce

leofang requested a review from shwina July 25, 2025 04:09

leofang marked this pull request as ready for review July 25, 2025 04:09

This comment has been minimized.

Sign in to view

shwina approved these changes Aug 6, 2025

View reviewed changes

github-project-automation bot moved this from Todo to In Review in CCCL Aug 6, 2025

Merge branch 'main' into faster_set_current

b9b3c87

leofang enabled auto-merge (squash) August 6, 2025 17:07

leofang merged commit fec95b8 into NVIDIA:main Aug 6, 2025
49 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Aug 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `Device.set_current()` faster #781

Make `Device.set_current()` faster #781

Uh oh!

leofang commented Jul 25, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jul 25, 2025

Uh oh!

leofang commented Jul 25, 2025

Uh oh!

This comment has been minimized.

leofang commented Aug 6, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

Uh oh!

Make Device.set_current() faster #781

Make Device.set_current() faster #781

Uh oh!

Conversation

leofang commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Jul 25, 2025

Uh oh!

leofang commented Jul 25, 2025

Uh oh!

This comment has been minimized.

leofang commented Aug 6, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

Uh oh!

Make `Device.set_current()` faster #781

Make `Device.set_current()` faster #781

leofang commented Jul 25, 2025 •

edited

Loading