-
Notifications
You must be signed in to change notification settings - Fork 235
Implements GraphMemoryResource #1235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…h capture state is not as expected.
…source methods to take any kind of stream-providing object. Update graph allocation tests.
…rapper (testing only)
|
|
||
| from cuda.bindings cimport cydriver | ||
| from cuda.core.experimental._memory._buffer cimport MemoryResource | ||
| from cuda.core.experimental._memory._device_memory_resource cimport DeviceMemoryResource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove the cruft here.
| if _settable: | ||
| def fset(GraphMemoryResourceAttributes self, uint64_t value): | ||
| if value != 0: | ||
| raise AttributeError(f"Attribute {stub.__name__!r} may only be set to zero (got {value}).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver checks for this condition in cuDeviceSetGraphMemAttribute and issues a log message: "High watermark can only be reset to 0"
It's a shame we cannot access that message programmatically for use in the Python error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news: CUDA 13 adds functions for error log management. It looks like cuLogsRegisterCallback might help here.
| raise ValueError( | ||
| f"stream must either be a Stream object or support __cuda_stream__ (got {type(stream)})" | ||
| ) from None | ||
| stream = Stream._init(stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The canonical way to invoke the __cuda_stream__ protocol now is to call Stream._init. It will either succeed in creating a Stream object or raise an exception.
| gb = device.create_graph_builder().begin_building(mode=mode) | ||
| with pytest.raises( | ||
| RuntimeError, | ||
| match=r"DeviceMemoryResource cannot perform memory operations on a capturing " | ||
| r"stream \(consider using GraphMemoryResource\)\.", | ||
| ): | ||
| dmr.allocate(1, stream=gb) | ||
| gb.end_building().complete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section illustrates a drawback of not using with contexts. Ignore the fact that the error is caught here (that's just for testing). If an exception is thrown during graph capture, control can easily escape without making a call to gb.end_building. That leaves the surrounding code in an unexpected state (capturing on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General policy aside, I don't think for this particular case a with context makes sense? During capturing once an error is raised (see the crafted example below) we cannot successfully end capturing.
>>> from cuda.bindings import runtime
>>> runtime.cudaSetDevice(0)
(<cudaError_t.cudaSuccess: 0>,)
>>> _, s = runtime.cudaStreamCreate()
>>> s
<CUstream 0x559fca4376f0>
>>> # capturing in the global mode to make a silly but concise example
>>> runtime.cudaStreamBeginCapture(s, runtime.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal)
(<cudaError_t.cudaSuccess: 0>,)
>>> runtime.cudaMalloc(1024) # not allowed in the global mode
(<cudaError_t.cudaErrorStreamCaptureUnsupported: 900>, None)
>>> runtime.cudaStreamEndCapture(s) # cannot end capture
(<cudaError_t.cudaErrorStreamCaptureInvalidated: 901>, None)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, once an error is raised, the driver switches from capturing on to capturing off, so this statement
That leaves the surrounding code in an unexpected state (capturing on).
does not hold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error could be any Python error, not necessarily a CUDA driver error. What if you replaced runtime.cudaMalloc with something like len(1)? Then the stream capture would never be turned off.
It seems to me that graphs ought to have a context manager and the exit should try to end capture and silently tolerate CUDA_ERROR_STREAM_CAPTURE_INVALIDATED.
Example:
gb = device.create_graph_builder().begin_building()
len(1) # raises, as an example
gb.end_building().complete()
versus
with device.create_graph_builder().begin_building() as gb:
len(1)
Only the second one guarantees that capture is ended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it still doesn't matter, because when any error happens (CUDA or Python) a Python exception is bubbling up and gb and its internal stream get deallocated nicely.
This with example can be re-written as
gb = device.create_graph_builder().begin_building()
try:
# do work with gb
except Exception:
try:
# I do not believe this is needed given the earlier cudart example. This is
# only to illustrate that the with statement has an equivalent way to express.
#
# Regardless if it is a Python or CUDA error, the gb is left in an undefined
# state that we should not even attempt to build a graph from. Just let it go
# out of scope and let destructors kick in.
gb.end_building().complete()
except Exception:
pass
else:
# gb is ready for useso there is always a way 🙂 It is only a syntactic sugar.
Forgive me for being repetitive. As a design guideline, under no circumstances in which context managers are a must-have in the design. They can be easily implemented on top of methods like .begin() and .end(), but not the other way around and so in any case we need to implement the latter anyway, meaning we can always add context managers later as a syntactic sugar for the latter. The Python language and standard library has shown us a clear path. So let us punt on this for a little longer.
For the particular case of stream capturing, the stream can be destroyed just fine even before capturing ends, so the destructor will work as intended.
>>> from cuda.bindings import runtime
>>> runtime.cudaSetDevice(0)
(<cudaError_t.cudaSuccess: 0>,)
>>> _, s = runtime.cudaStreamCreate()
>>> runtime.cudaStreamBeginCapture(s, runtime.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal)
(<cudaError_t.cudaSuccess: 0>,)
>>> runtime.cudaStreamDestroy(s) # destroy stream without ending the capturing state
(<cudaError_t.cudaSuccess: 0>,)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should punt: let's take this discussion offline and follow up at our leisure.
Let me add one observation, which is what motivated the original comment. If I insert a Python error right before line 254, all subsequent tests fail with ValueError: device_id must be within [0, 0), got 0. Here's a sampling:
test_graph_mem.py .............FEE [ 64%]
test_hashable.py EEEEEE [ 88%]
test_helpers.py FFF [100%]
This is apparently due to GraphBuilder having no __del__ method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is apparently due to
GraphBuilderhaving no__del__method.
Yeah we should fix this asap. Hopefully once we have your proposal implemented this class of errors will go away systematically 🙂
Before that happens, I wonder if there is a way we can enforce the linter to check if __del__ is missing 🤔
|
/ok to test 13e3dfb |
This comment has been minimized.
This comment has been minimized.
| return self._ipc_data._alloc_handle | ||
|
|
||
| def allocate(self, size_t size, stream: Stream = None) -> Buffer: | ||
| def allocate(self, size_t size, stream: Optional[IsStreamT] = None) -> Buffer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this allocate member function need to take an alignment parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| is used. | ||
| """ | ||
| DMR_deallocate(self, <uintptr_t>ptr, size, <Stream>stream) | ||
| stream = Stream._init(stream) if stream is not None else default_stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to capture the user error of neglecting to pass the stream that memory allocation came from? I'm thinking of a debug assert that verify if the allocated memory address came from the provided stream? Another potential option, is if the user neglects providing a stream, we do a look to determine where the address came from. Not sure if that is possible given the current lower level API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the stream only defines an ordering, it is legal to allocate on one stream and deallocate on another.
That said, I think the overall memory management scheme could be clarified and potentially improved.
|
Comment: I'd recommend in future chunking large PRs into smaller reviewable chunks. IMHO, it makes it more approachable and consumable for reviewers. |
41d7268 to
7c3ad5f
Compare
|
/ok to test 7c3ad5f |
|
/ok to test af22c81 |
|
/ok to test 0f2057a |
21ac32f to
78c4261
Compare
|
/ok to test 78c4261 |
78c4261 to
a9188c2
Compare
|
/ok to test a9188c2 |
| with nogil: | ||
| HANDLE_RETURN(cydriver.cuMemFreeAsync(devptr, s)) | ||
| r = cydriver.cuMemFreeAsync(devptr, s) | ||
| if r != cydriver.CUDA_ERROR_INVALID_CONTEXT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: mempools are not tied to a CUDA context, so when will this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a reactive fix as I was seeing many of these errors at the end of a full pytest run. I think this issue exists prior to this PR, so I could separate this if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Andy-Jost Let's save this to a separate PR. I think this one is clean, and I'd like to merge it asap (after I read it one more time).
cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx
Outdated
Show resolved
Hide resolved
cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx
Outdated
Show resolved
Hide resolved
leofang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few (quick, I think!) comments!
cuda_core/cuda/core/experimental/_memory/_virtual_memory_resource.py
Outdated
Show resolved
Hide resolved
cuda_core/cuda/core/experimental/_memory/_virtual_memory_resource.py
Outdated
Show resolved
Hide resolved
…orm-dependent errors. Add dependence on mempool_device where needed for certain tests. Touch-ups.
a9188c2 to
556c6bf
Compare
|
/ok to test 556c6bf |
05f3195 to
d7a67b7
Compare
d7a67b7 to
1d07da1
Compare
|
/ok to test 1d07da1 |
leofang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks, Andy!
| __all__ = ['GraphMemoryResource'] | ||
|
|
||
|
|
||
| cdef class GraphMemoryResourceAttributes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pciolkosz this is yet another reason why CUDA graphs should support mempools... We have to use another class to wrap the attribute access, instead of reusing DeviceMemoryResourceAttributes, just because from the driver perspective they are different... 😢
@Andy-Jost do you think apart from making the code base dirty the two attribute classes can be merged into one, and we dispatch to the right driver APIs internally?
| with pytest.warns(DeprecationWarning, match=msg): | ||
| launch(StreamWrapper(stream), config, ker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this test! Thanks! 🙏
| device = Device() | ||
|
|
||
| if not device.properties.memory_pools_supported: | ||
| pytest.skip("Device does not support mempool operations") | ||
|
|
||
| device.set_current() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Any reason we don't wanna move this logic (check attribute first, and then either skip or set current) in the mempool_device fixture, and then use the fixture here?
| if not device.properties.memory_pools_supported: | ||
| pytest.skip("Device does not support mempool operations") | ||
|
|
||
| device.set_current() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
Description
Implements
GraphMemoryResourcefor memory interactions with the graph memory allocator. Allocations from this object succeed only when graph capturing is active. Conversely, allocations fromDeviceMemoryResourcenow raise an exception when graph capturing is active.A new test module is added.
This change also simplifies and extends the logic for accepting arbitrary stream parameters as objects implementing
__cuda_stream__. Support for that protocol was added in several places, allowingGraphBuilderto be used anywhere a stream is expected, including memory resource and buffer methods.closes #963