Implements GraphMemoryResource #1235

Andy-Jost · 2025-11-12T21:13:14Z

Description

Implements GraphMemoryResource for memory interactions with the graph memory allocator. Allocations from this object succeed only when graph capturing is active. Conversely, allocations from DeviceMemoryResource now raise an exception when graph capturing is active.

A new test module is added.

This change also simplifies and extends the logic for accepting arbitrary stream parameters as objects implementing __cuda_stream__. Support for that protocol was added in several places, allowing GraphBuilder to be used anywhere a stream is expected, including memory resource and buffer methods.

closes #963

…h capture state is not as expected.

…source methods to take any kind of stream-providing object. Update graph allocation tests.

…rapper (testing only)

copy-pr-bot · 2025-11-12T21:13:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Andy-Jost · 2025-11-12T21:26:07Z

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pxd

+
+from cuda.bindings cimport cydriver
+from cuda.core.experimental._memory._buffer cimport MemoryResource
+from cuda.core.experimental._memory._device_memory_resource cimport DeviceMemoryResource


I'll remove the cruft here.

Andy-Jost · 2025-11-12T21:29:56Z

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx

+        if _settable:
+            def fset(GraphMemoryResourceAttributes self, uint64_t value):
+                if value != 0:
+                    raise AttributeError(f"Attribute {stub.__name__!r} may only be set to zero (got {value}).")


The driver checks for this condition in cuDeviceSetGraphMemAttribute and issues a log message: "High watermark can only be reset to 0"

It's a shame we cannot access that message programmatically for use in the Python error.

Good news: CUDA 13 adds functions for error log management. It looks like cuLogsRegisterCallback might help here.

cuda_core/cuda/core/experimental/_device.pyx

Andy-Jost · 2025-11-12T21:33:10Z

cuda_core/cuda/core/experimental/_launcher.pyx

-            raise ValueError(
-                f"stream must either be a Stream object or support __cuda_stream__ (got {type(stream)})"
-            ) from None
+    stream = Stream._init(stream)


The canonical way to invoke the __cuda_stream__ protocol now is to call Stream._init. It will either succeed in creating a Stream object or raise an exception.

cuda_core/cuda/core/experimental/_stream.pyx

Andy-Jost · 2025-11-12T21:42:43Z

cuda_core/tests/test_graph_mem.py

+    gb = device.create_graph_builder().begin_building(mode=mode)
+    with pytest.raises(
+        RuntimeError,
+        match=r"DeviceMemoryResource cannot perform memory operations on a capturing "
+        r"stream \(consider using GraphMemoryResource\)\.",
+    ):
+        dmr.allocate(1, stream=gb)
+    gb.end_building().complete()


This section illustrates a drawback of not using with contexts. Ignore the fact that the error is caught here (that's just for testing). If an exception is thrown during graph capture, control can easily escape without making a call to gb.end_building. That leaves the surrounding code in an unexpected state (capturing on).

General policy aside, I don't think for this particular case a with context makes sense? During capturing once an error is raised (see the crafted example below) we cannot successfully end capturing.

>>> from cuda.bindings import runtime >>> runtime.cudaSetDevice(0) (<cudaError_t.cudaSuccess: 0>,) >>> _, s = runtime.cudaStreamCreate() >>> s <CUstream 0x559fca4376f0> >>> # capturing in the global mode to make a silly but concise example >>> runtime.cudaStreamBeginCapture(s, runtime.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal) (<cudaError_t.cudaSuccess: 0>,) >>> runtime.cudaMalloc(1024) # not allowed in the global mode (<cudaError_t.cudaErrorStreamCaptureUnsupported: 900>, None) >>> runtime.cudaStreamEndCapture(s) # cannot end capture (<cudaError_t.cudaErrorStreamCaptureInvalidated: 901>, None)

In other words, once an error is raised, the driver switches from capturing on to capturing off, so this statement

That leaves the surrounding code in an unexpected state (capturing on).

does not hold.

The error could be any Python error, not necessarily a CUDA driver error. What if you replaced runtime.cudaMalloc with something like len(1)? Then the stream capture would never be turned off.

It seems to me that graphs ought to have a context manager and the exit should try to end capture and silently tolerate CUDA_ERROR_STREAM_CAPTURE_INVALIDATED.

Example:

gb = device.create_graph_builder().begin_building() len(1) # raises, as an example gb.end_building().complete()

versus

with device.create_graph_builder().begin_building() as gb: len(1)

Only the second one guarantees that capture is ended.

No it still doesn't matter, because when any error happens (CUDA or Python) a Python exception is bubbling up and gb and its internal stream get deallocated nicely.

This with example can be re-written as

gb = device.create_graph_builder().begin_building() try: # do work with gb except Exception: try: # I do not believe this is needed given the earlier cudart example. This is # only to illustrate that the with statement has an equivalent way to express. # # Regardless if it is a Python or CUDA error, the gb is left in an undefined # state that we should not even attempt to build a graph from. Just let it go # out of scope and let destructors kick in. gb.end_building().complete() except Exception: pass else: # gb is ready for use

so there is always a way 🙂 It is only a syntactic sugar.

Forgive me for being repetitive. As a design guideline, under no circumstances in which context managers are a must-have in the design. They can be easily implemented on top of methods like .begin() and .end(), but not the other way around and so in any case we need to implement the latter anyway, meaning we can always add context managers later as a syntactic sugar for the latter. The Python language and standard library has shown us a clear path. So let us punt on this for a little longer.

For the particular case of stream capturing, the stream can be destroyed just fine even before capturing ends, so the destructor will work as intended.

>>> from cuda.bindings import runtime >>> runtime.cudaSetDevice(0) (<cudaError_t.cudaSuccess: 0>,) >>> _, s = runtime.cudaStreamCreate() >>> runtime.cudaStreamBeginCapture(s, runtime.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal) (<cudaError_t.cudaSuccess: 0>,) >>> runtime.cudaStreamDestroy(s) # destroy stream without ending the capturing state (<cudaError_t.cudaSuccess: 0>,)

I agree that we should punt: let's take this discussion offline and follow up at our leisure.

Let me add one observation, which is what motivated the original comment. If I insert a Python error right before line 254, all subsequent tests fail with ValueError: device_id must be within [0, 0), got 0. Here's a sampling:

test_graph_mem.py .............FEE [ 64%] test_hashable.py EEEEEE [ 88%] test_helpers.py FFF [100%]

This is apparently due to GraphBuilder having no __del__ method.

This is apparently due to GraphBuilder having no __del__ method.

Yeah we should fix this asap. Hopefully once we have your proposal implemented this class of errors will go away systematically 🙂

Before that happens, I wonder if there is a way we can enforce the linter to check if __del__ is missing 🤔

cuda_core/cuda/core/experimental/_launcher.pyx

Andy-Jost · 2025-11-12T22:49:07Z

/ok to test 13e3dfb

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

rparolin · 2025-11-13T00:27:15Z

cuda_core/cuda/core/experimental/_memory/_device_memory_resource.pyx

        return self._ipc_data._alloc_handle

-    def allocate(self, size_t size, stream: Stream = None) -> Buffer:
+    def allocate(self, size_t size, stream: Optional[IsStreamT] = None) -> Buffer:


Does this allocate member function need to take an alignment parameter?

Great question, @rparolin why do you ask?

The answer is not yet: #634.

rparolin · 2025-11-13T00:28:59Z

cuda_core/cuda/core/experimental/_memory/_device_memory_resource.pyx

            is used.
        """
-        DMR_deallocate(self, <uintptr_t>ptr, size, <Stream>stream)
+        stream = Stream._init(stream) if stream is not None else default_stream()


Is there a way to capture the user error of neglecting to pass the stream that memory allocation came from? I'm thinking of a debug assert that verify if the allocated memory address came from the provided stream? Another potential option, is if the user neglects providing a stream, we do a look to determine where the address came from. Not sure if that is possible given the current lower level API.

Since the stream only defines an ordering, it is legal to allocate on one stream and deallocate on another.

That said, I think the overall memory management scheme could be clarified and potentially improved.

cuda_core/cuda/core/experimental/_memory/_device_memory_resource.pyx

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx

rparolin · 2025-11-13T00:50:52Z

Comment: I'd recommend in future chunking large PRs into smaller reviewable chunks. IMHO, it makes it more approachable and consumable for reviewers.

Andy-Jost · 2025-11-20T22:37:37Z

/ok to test 7c3ad5f

Andy-Jost · 2025-11-20T22:38:08Z

/ok to test af22c81

Andy-Jost · 2025-11-21T19:52:42Z

/ok to test 0f2057a

Andy-Jost · 2025-11-21T20:48:09Z

/ok to test 78c4261

Andy-Jost · 2025-11-21T22:51:42Z

/ok to test a9188c2

leofang · 2025-11-21T23:14:52Z

cuda_core/cuda/core/experimental/_memory/_device_memory_resource.pyx

    with nogil:
-        HANDLE_RETURN(cydriver.cuMemFreeAsync(devptr, s))
+        r = cydriver.cuMemFreeAsync(devptr, s)
+        if r != cydriver.CUDA_ERROR_INVALID_CONTEXT:


Q: mempools are not tied to a CUDA context, so when will this happen?

This is a reactive fix as I was seeing many of these errors at the end of a full pytest run. I think this issue exists prior to this PR, so I could separate this if you prefer.

@Andy-Jost Let's save this to a separate PR. I think this one is clean, and I'd like to merge it asap (after I read it one more time).

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx

leofang

LGTM with a few (quick, I think!) comments!

cuda_core/cuda/core/experimental/_memory/_virtual_memory_resource.py

cuda_core/cuda/core/experimental/_stream.pyx

cuda_core/tests/helpers/misc.py

cuda_core/tests/test_graph_mem.py

…orm-dependent errors. Add dependence on mempool_device where needed for certain tests. Touch-ups.

Andy-Jost · 2025-11-22T00:01:57Z

/ok to test 556c6bf

cuda_core/tests/test_graph_mem.py

Andy-Jost · 2025-11-22T00:14:55Z

/ok to test 1d07da1

leofang

LGTM! Thanks, Andy!

leofang · 2025-11-24T19:20:30Z

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx

+__all__ = ['GraphMemoryResource']
+
+
+cdef class GraphMemoryResourceAttributes:


@pciolkosz this is yet another reason why CUDA graphs should support mempools... We have to use another class to wrap the attribute access, instead of reusing DeviceMemoryResourceAttributes, just because from the driver perspective they are different... 😢

@Andy-Jost do you think apart from making the code base dirty the two attribute classes can be merged into one, and we dispatch to the right driver APIs internally?

leofang · 2025-11-24T19:23:08Z

cuda_core/tests/test_launcher.py

+    with pytest.warns(DeprecationWarning, match=msg):
+        launch(StreamWrapper(stream), config, ker)


I missed this test! Thanks! 🙏

leofang · 2025-11-24T19:24:44Z

cuda_core/tests/test_memory.py

+    device = Device()
+
+    if not device.properties.memory_pools_supported:
+        pytest.skip("Device does not support mempool operations")
+
+    device.set_current()


Q: Any reason we don't wanna move this logic (check attribute first, and then either skip or set current) in the mempool_device fixture, and then use the fixture here?

leofang · 2025-11-24T19:25:05Z

cuda_core/tests/test_memory.py

+    if not device.properties.memory_pools_supported:
+        pytest.skip("Device does not support mempool operations")
+
+    device.set_current()


github-actions · 2025-11-24T19:36:34Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

Andy-Jost added 11 commits November 12, 2025 12:48

Implement non-pooling memory allocation.

befa768

Add GraphMemoryResource.

88834f7

Remove mempool_enabled option now that GraphMemoryResource is ready.

57855a1

Add docstring and make GraphMemoryResource a singleton.

0375941

Move tests to a separate file.

0b82b1f

Add errors for DeviceMemoryResource and GraphMemoryResource when grap…

53b1c58

…h capture state is not as expected.

Add tests for attributes and memory allocation escaping graphs.

1b8409b

Simplify logic for converting IsStreamT arguments.

e9422b2

Standardize Stream arguments to IsStreamT. Update Buffer and MemoryRe…

3e21b9b

…source methods to take any kind of stream-providing object. Update graph allocation tests.

Add tests for IsStreamT conversions.

98ccfc7

Expand files named _gmr.*. Add __eq__ and __hash__ support to StreamW…

183f7af

…rapper (testing only)

Andy-Jost added this to the cuda.core beta 9 milestone Nov 12, 2025

Andy-Jost requested review from leofang and rparolin November 12, 2025 21:13

Andy-Jost self-assigned this Nov 12, 2025

Andy-Jost added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Nov 12, 2025

Fix format/lint issues.

13e3dfb

Andy-Jost commented Nov 12, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_launcher.pyx Show resolved Hide resolved

This comment has been minimized.

Sign in to view

rparolin reviewed Nov 13, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory/_buffer.pyx Outdated Show resolved Hide resolved

rparolin reviewed Nov 13, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory/_device_memory_resource.pyx Show resolved Hide resolved

rparolin reviewed Nov 13, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx Show resolved Hide resolved

Disable additional tests for platform-dependent behavior.

7c3ad5f

Andy-Jost force-pushed the dmr-graph-support-2 branch from 41d7268 to 7c3ad5f Compare November 20, 2025 22:37

Merge branch 'main' into dmr-graph-support-2

af22c81

Andy-Jost force-pushed the dmr-graph-support-2 branch 2 times, most recently from 21ac32f to 78c4261 Compare November 21, 2025 20:47

Andy-Jost force-pushed the dmr-graph-support-2 branch from 78c4261 to a9188c2 Compare November 21, 2025 22:50

leofang reviewed Nov 21, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx Outdated Show resolved Hide resolved

leofang reviewed Nov 21, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx Outdated Show resolved Hide resolved

leofang reviewed Nov 21, 2025

View reviewed changes

Andy-Jost commented Nov 21, 2025

View reviewed changes

cuda_core/tests/test_graph_mem.py Show resolved Hide resolved

Adjust deallocation stream for legacy memory resources to avoid platf…

556c6bf

…orm-dependent errors. Add dependence on mempool_device where needed for certain tests. Touch-ups.

Andy-Jost force-pushed the dmr-graph-support-2 branch from a9188c2 to 556c6bf Compare November 22, 2025 00:01

Andy-Jost enabled auto-merge (squash) November 22, 2025 00:02

Andy-Jost force-pushed the dmr-graph-support-2 branch from 05f3195 to d7a67b7 Compare November 22, 2025 00:10

Adjust test_graph_alloc to launch the graph multiple times.

1d07da1

Andy-Jost force-pushed the dmr-graph-support-2 branch from d7a67b7 to 1d07da1 Compare November 22, 2025 00:11

Andy-Jost commented Nov 22, 2025

View reviewed changes

cuda_core/tests/test_graph_mem.py Show resolved Hide resolved

Andy-Jost requested a review from leofang November 24, 2025 16:54

leofang approved these changes Nov 24, 2025

View reviewed changes

Andy-Jost merged commit b9c76b3 into NVIDIA:main Nov 24, 2025
118 of 119 checks passed

Andy-Jost deleted the dmr-graph-support-2 branch November 24, 2025 19:50

		__all__ = ['GraphMemoryResource']


		cdef class GraphMemoryResourceAttributes:

		with pytest.warns(DeprecationWarning, match=msg):
		launch(StreamWrapper(stream), config, ker)

Implements GraphMemoryResource #1235

Implements GraphMemoryResource #1235

Uh oh!

Conversation

Andy-Jost commented Nov 12, 2025

Description

Uh oh!

copy-pr-bot bot commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Andy-Jost commented Nov 12, 2025

Uh oh!

This comment has been minimized.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rparolin Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rparolin commented Nov 13, 2025

Uh oh!

Andy-Jost commented Nov 20, 2025

Uh oh!

Andy-Jost commented Nov 20, 2025

Uh oh!

Andy-Jost commented Nov 21, 2025

Uh oh!

Andy-Jost commented Nov 21, 2025

Uh oh!

Andy-Jost commented Nov 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Andy-Jost Nov 20, 2025 •

edited

Loading

rparolin Nov 13, 2025 •

edited

Loading