Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[CUDA] Recycle CUDA events#2604

Merged
zcbenz merged 7 commits intoml-explore:mainfrom
zcbenz:cuda-event-cache
Sep 23, 2025
Merged

[CUDA] Recycle CUDA events#2604
zcbenz merged 7 commits intoml-explore:mainfrom
zcbenz:cuda-event-cache

Conversation

@zcbenz
Copy link
Collaborator

@zcbenz zcbenz commented Sep 17, 2025

While the cudaEventCreate call is very cheap (takes about 1~2µs) and we almost get no performance gain by eliminating it during inference, it is disturbing when doing profiling because measuring the kernel/graph time requires a pair of events for each kernel/graph.

This PR makes the CudaEvent class cache the underlying handle so the native CUDA events are reused instead of being recreated every time. For command mlx_lm.generate --model meta-llama/Llama-3.2-1B-Instruct --prompt 'Write a story about Einstein' -m 64, the number of cudaEventCreate calls reduced from 69 to 7, and there is no observable change in performance.

@awni
Copy link
Member

awni commented Sep 18, 2025

PR looks good!

I just have some questions about all the event classes which are hard to keep track of and the names are not so descriptive.

RawCudaEvent (header): Used by CudaEvent
CudaEvent (header): Used by Worker and CudaEventWrapper
CudaEventWrapper (hidden): Used by EventImpl
SharedEvent (header): Used by EventImpl and Fence
EventImpl (hidden): Used by Event
Event (header): Used at high level by array etc.

  • Why do we need RawCudaEvent, CudaEvent and CudaEventWrapper? Can we merge 2 of these or maybe all 3?
  • Is it possible to merge CudaEventWrapper and CudaEvent?
  • Can Fence just use Event or CudaEvent? I'm guessing no, but I'm just curious why not.
  • Maybe we should rename SharedEvent to AtomicEvent .. unless you think that is an implementation detail?

@zcbenz
Copy link
Collaborator Author

zcbenz commented Sep 19, 2025

  • Why do we need RawCudaEvent, CudaEvent and CudaEventWrapper? Can we merge 2 of these or maybe all 3?
  • Is it possible to merge CudaEventWrapper and CudaEvent?

The RawCudaEvent can be removed by storing cudaEvent_t and int flags directly in CudaEvent, but it would be hard to clean up the cached events on exit which are stored in a static vector. Things would become simpler if we just leak the events on exit (PyTorch does that so it should be safe).

The difference between CudaEvent and CudaEventWrapper is that the latter is copiable because it needs to be copied to CPU stream to implement the wait(stream) API. It should be possible to merge CudaEvent and CudaEventWrapper into one class by using a custom deleter with std::shared_ptr, but ref-counting has a little overhead (heap allocation and atomic counter), and I want to reduce overhead as much as possible when using CudaEvent for profiling.

Possible better names for the classes might be:

  • RawCudaEvent => CudaEventHandle
  • CudaEventWrapper => RefCountedCudaEvent or CopiableCudaEvent
  • Can Fence just use Event or CudaEvent? I'm guessing no, but I'm just curious why not.

Fence needs to be able to wait on a CPU stream in GPU stream, and it uses the count parameter of wait/signal, while CudaEvent supports neither. Using Event directly should be possible, but for most cases I think it would just redirect to SharedEvent.

  • Maybe we should rename SharedEvent to AtomicEvent .. unless you think that is an implementation detail?

AtomicEvent is a better name.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@zcbenz zcbenz merged commit ae438d0 into ml-explore:main Sep 23, 2025
7 checks passed
@zcbenz zcbenz deleted the cuda-event-cache branch September 23, 2025 01:42
faisalmemon pushed a commit to faisalmemon/mlx that referenced this pull request Oct 30, 2025
* Make CudaEvent a CudaHandle

* Add caching for CudaEvent

* Make sure cuda events are destroyed at last

* Fix headers

* SharedEvent => AtomicEvent

* RawCudaEvent => CudaEventHandle, CudaEventWrapper => CopyableCudaEvent

* Remove unneeded asserts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants