Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

Description

Implements GraphMemoryResource for memory interactions with the graph memory allocator. Allocations from this object succeed only when graph capturing is active. Conversely, allocations from DeviceMemoryResource now raise an exception when graph capturing is active.

A new test module is added.

This change also simplifies and extends the logic for accepting arbitrary stream parameters as objects implementing __cuda_stream__. Support for that protocol was added in several places, allowing GraphBuilder to be used anywhere a stream is expected, including memory resource and buffer methods.

closes #963

@Andy-Jost Andy-Jost added this to the cuda.core beta 9 milestone Nov 12, 2025
@Andy-Jost Andy-Jost self-assigned this Nov 12, 2025
@Andy-Jost Andy-Jost added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Nov 12, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


from cuda.bindings cimport cydriver
from cuda.core.experimental._memory._buffer cimport MemoryResource
from cuda.core.experimental._memory._device_memory_resource cimport DeviceMemoryResource
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the cruft here.

if _settable:
def fset(GraphMemoryResourceAttributes self, uint64_t value):
if value != 0:
raise AttributeError(f"Attribute {stub.__name__!r} may only be set to zero (got {value}).")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver checks for this condition in cuDeviceSetGraphMemAttribute and issues a log message: "High watermark can only be reset to 0"

It's a shame we cannot access that message programmatically for use in the Python error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news: CUDA 13 adds functions for error log management. It looks like cuLogsRegisterCallback might help here.

raise ValueError(
f"stream must either be a Stream object or support __cuda_stream__ (got {type(stream)})"
) from None
stream = Stream._init(stream)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical way to invoke the __cuda_stream__ protocol now is to call Stream._init. It will either succeed in creating a Stream object or raise an exception.

Comment on lines +247 to +254
gb = device.create_graph_builder().begin_building(mode=mode)
with pytest.raises(
RuntimeError,
match=r"DeviceMemoryResource cannot perform memory operations on a capturing "
r"stream \(consider using GraphMemoryResource\)\.",
):
dmr.allocate(1, stream=gb)
gb.end_building().complete()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section illustrates a drawback of not using with contexts. Ignore the fact that the error is caught here (that's just for testing). If an exception is thrown during graph capture, control can easily escape without making a call to gb.end_building. That leaves the surrounding code in an unexpected state (capturing on).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General policy aside, I don't think for this particular case a with context makes sense? During capturing once an error is raised (see the crafted example below) we cannot successfully end capturing.

>>> from cuda.bindings import runtime
>>> runtime.cudaSetDevice(0)
(<cudaError_t.cudaSuccess: 0>,)
>>> _, s = runtime.cudaStreamCreate()
>>> s
<CUstream 0x559fca4376f0>
>>> # capturing in the global mode to make a silly but concise example 
>>> runtime.cudaStreamBeginCapture(s, runtime.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal)
(<cudaError_t.cudaSuccess: 0>,)
>>> runtime.cudaMalloc(1024)  # not allowed in the global mode
(<cudaError_t.cudaErrorStreamCaptureUnsupported: 900>, None)
>>> runtime.cudaStreamEndCapture(s)  # cannot end capture
(<cudaError_t.cudaErrorStreamCaptureInvalidated: 901>, None)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, once an error is raised, the driver switches from capturing on to capturing off, so this statement

That leaves the surrounding code in an unexpected state (capturing on).

does not hold.

Copy link
Contributor Author

@Andy-Jost Andy-Jost Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error could be any Python error, not necessarily a CUDA driver error. What if you replaced runtime.cudaMalloc with something like len(1)? Then the stream capture would never be turned off.

It seems to me that graphs ought to have a context manager and the exit should try to end capture and silently tolerate CUDA_ERROR_STREAM_CAPTURE_INVALIDATED.

Example:

gb = device.create_graph_builder().begin_building()
len(1) # raises, as an example
gb.end_building().complete()

versus

with device.create_graph_builder().begin_building() as gb:
    len(1)

Only the second one guarantees that capture is ended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it still doesn't matter, because when any error happens (CUDA or Python) a Python exception is bubbling up and gb and its internal stream get deallocated nicely.

This with example can be re-written as

gb = device.create_graph_builder().begin_building()
try:
    # do work with gb
except Exception:
    try:
        # I do not believe this is needed given the earlier cudart example. This is 
        # only to illustrate that the with statement has an equivalent way to express. 
        #
        # Regardless if it is a Python or CUDA error, the gb is left in an undefined 
        # state that we should not even attempt to build a graph from. Just let it go
        # out of scope and let destructors kick in.
        gb.end_building().complete()
    except Exception:
        pass
else:
    # gb is ready for use

so there is always a way 🙂 It is only a syntactic sugar.

Forgive me for being repetitive. As a design guideline, under no circumstances in which context managers are a must-have in the design. They can be easily implemented on top of methods like .begin() and .end(), but not the other way around and so in any case we need to implement the latter anyway, meaning we can always add context managers later as a syntactic sugar for the latter. The Python language and standard library has shown us a clear path. So let us punt on this for a little longer.

For the particular case of stream capturing, the stream can be destroyed just fine even before capturing ends, so the destructor will work as intended.

>>> from cuda.bindings import runtime
>>> runtime.cudaSetDevice(0)
(<cudaError_t.cudaSuccess: 0>,)
>>> _, s = runtime.cudaStreamCreate()
>>> runtime.cudaStreamBeginCapture(s, runtime.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal)
(<cudaError_t.cudaSuccess: 0>,)
>>> runtime.cudaStreamDestroy(s)  # destroy stream without ending the capturing state
(<cudaError_t.cudaSuccess: 0>,)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should punt: let's take this discussion offline and follow up at our leisure.

Let me add one observation, which is what motivated the original comment. If I insert a Python error right before line 254, all subsequent tests fail with ValueError: device_id must be within [0, 0), got 0. Here's a sampling:

test_graph_mem.py .............FEE                                           [ 64%]
test_hashable.py EEEEEE                                                      [ 88%]
test_helpers.py FFF                                                          [100%]

This is apparently due to GraphBuilder having no __del__ method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is apparently due to GraphBuilder having no __del__ method.

Yeah we should fix this asap. Hopefully once we have your proposal implemented this class of errors will go away systematically 🙂

Before that happens, I wonder if there is a way we can enforce the linter to check if __del__ is missing 🤔

@Andy-Jost
Copy link
Contributor Author

/ok to test 13e3dfb

@github-actions

This comment has been minimized.

return self._ipc_data._alloc_handle

def allocate(self, size_t size, stream: Stream = None) -> Buffer:
def allocate(self, size_t size, stream: Optional[IsStreamT] = None) -> Buffer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this allocate member function need to take an alignment parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, @rparolin why do you ask?

The answer is not yet: #634.

is used.
"""
DMR_deallocate(self, <uintptr_t>ptr, size, <Stream>stream)
stream = Stream._init(stream) if stream is not None else default_stream()
Copy link
Collaborator

@rparolin rparolin Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to capture the user error of neglecting to pass the stream that memory allocation came from? I'm thinking of a debug assert that verify if the allocated memory address came from the provided stream? Another potential option, is if the user neglects providing a stream, we do a look to determine where the address came from. Not sure if that is possible given the current lower level API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the stream only defines an ordering, it is legal to allocate on one stream and deallocate on another.

That said, I think the overall memory management scheme could be clarified and potentially improved.

@rparolin
Copy link
Collaborator

Comment: I'd recommend in future chunking large PRs into smaller reviewable chunks. IMHO, it makes it more approachable and consumable for reviewers.

@Andy-Jost
Copy link
Contributor Author

/ok to test 7c3ad5f

@Andy-Jost
Copy link
Contributor Author

/ok to test af22c81

@Andy-Jost
Copy link
Contributor Author

/ok to test 0f2057a

@Andy-Jost Andy-Jost force-pushed the dmr-graph-support-2 branch 2 times, most recently from 21ac32f to 78c4261 Compare November 21, 2025 20:47
@Andy-Jost
Copy link
Contributor Author

/ok to test 78c4261

@Andy-Jost
Copy link
Contributor Author

/ok to test a9188c2

with nogil:
HANDLE_RETURN(cydriver.cuMemFreeAsync(devptr, s))
r = cydriver.cuMemFreeAsync(devptr, s)
if r != cydriver.CUDA_ERROR_INVALID_CONTEXT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: mempools are not tied to a CUDA context, so when will this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reactive fix as I was seeing many of these errors at the end of a full pytest run. I think this issue exists prior to this PR, so I could separate this if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Andy-Jost Let's save this to a separate PR. I think this one is clean, and I'd like to merge it asap (after I read it one more time).

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few (quick, I think!) comments!

…orm-dependent errors. Add dependence on mempool_device where needed for certain tests. Touch-ups.
@Andy-Jost
Copy link
Contributor Author

/ok to test 556c6bf

@Andy-Jost Andy-Jost enabled auto-merge (squash) November 22, 2025 00:02
@Andy-Jost
Copy link
Contributor Author

/ok to test 1d07da1

@Andy-Jost Andy-Jost requested a review from leofang November 24, 2025 16:54
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks, Andy!

__all__ = ['GraphMemoryResource']


cdef class GraphMemoryResourceAttributes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pciolkosz this is yet another reason why CUDA graphs should support mempools... We have to use another class to wrap the attribute access, instead of reusing DeviceMemoryResourceAttributes, just because from the driver perspective they are different... 😢

@Andy-Jost do you think apart from making the code base dirty the two attribute classes can be merged into one, and we dispatch to the right driver APIs internally?

Comment on lines +149 to +150
with pytest.warns(DeprecationWarning, match=msg):
launch(StreamWrapper(stream), config, ker)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this test! Thanks! 🙏

Comment on lines +288 to +293
device = Device()

if not device.properties.memory_pools_supported:
pytest.skip("Device does not support mempool operations")

device.set_current()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Any reason we don't wanna move this logic (check attribute first, and then either skip or set current) in the mempool_device fixture, and then use the fixture here?

if not device.properties.memory_pools_supported:
pytest.skip("Device does not support mempool operations")

device.set_current()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@Andy-Jost Andy-Jost merged commit b9c76b3 into NVIDIA:main Nov 24, 2025
118 of 119 checks passed
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

@Andy-Jost Andy-Jost deleted the dmr-graph-support-2 branch November 24, 2025 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA graph phase 2 - memory nodes

3 participants