Codestin Search App

zcbenz · 2025-09-17T11:16:36Z

While the cudaEventCreate call is very cheap (takes about 1~2µs) and we almost get no performance gain by eliminating it during inference, it is disturbing when doing profiling because measuring the kernel/graph time requires a pair of events for each kernel/graph.

This PR makes the CudaEvent class cache the underlying handle so the native CUDA events are reused instead of being recreated every time. For command mlx_lm.generate --model meta-llama/Llama-3.2-1B-Instruct --prompt 'Write a story about Einstein' -m 64, the number of cudaEventCreate calls reduced from 69 to 7, and there is no observable change in performance.

mlx/backend/cuda/eval.cpp

awni · 2025-09-18T13:35:21Z

PR looks good!

I just have some questions about all the event classes which are hard to keep track of and the names are not so descriptive.

RawCudaEvent (header): Used by CudaEvent
CudaEvent (header): Used by Worker and CudaEventWrapper
CudaEventWrapper (hidden): Used by EventImpl
SharedEvent (header): Used by EventImpl and Fence
EventImpl (hidden): Used by Event
Event (header): Used at high level by array etc.

Why do we need RawCudaEvent, CudaEvent and CudaEventWrapper? Can we merge 2 of these or maybe all 3?
Is it possible to merge CudaEventWrapper and CudaEvent?
Can Fence just use Event or CudaEvent? I'm guessing no, but I'm just curious why not.
Maybe we should rename SharedEvent to AtomicEvent .. unless you think that is an implementation detail?

zcbenz · 2025-09-19T00:06:47Z

Why do we need RawCudaEvent, CudaEvent and CudaEventWrapper? Can we merge 2 of these or maybe all 3?

Is it possible to merge CudaEventWrapper and CudaEvent?

The RawCudaEvent can be removed by storing cudaEvent_t and int flags directly in CudaEvent, but it would be hard to clean up the cached events on exit which are stored in a static vector. Things would become simpler if we just leak the events on exit (PyTorch does that so it should be safe).

The difference between CudaEvent and CudaEventWrapper is that the latter is copiable because it needs to be copied to CPU stream to implement the wait(stream) API. It should be possible to merge CudaEvent and CudaEventWrapper into one class by using a custom deleter with std::shared_ptr, but ref-counting has a little overhead (heap allocation and atomic counter), and I want to reduce overhead as much as possible when using CudaEvent for profiling.

Possible better names for the classes might be:

RawCudaEvent => CudaEventHandle
CudaEventWrapper => RefCountedCudaEvent or CopiableCudaEvent

Can Fence just use Event or CudaEvent? I'm guessing no, but I'm just curious why not.

Fence needs to be able to wait on a CPU stream in GPU stream, and it uses the count parameter of wait/signal, while CudaEvent supports neither. Using Event directly should be possible, but for most cases I think it would just redirect to SharedEvent.

Maybe we should rename SharedEvent to AtomicEvent .. unless you think that is an implementation detail?

AtomicEvent is a better name.

awni

Looks great, thanks!

* Make CudaEvent a CudaHandle * Add caching for CudaEvent * Make sure cuda events are destroyed at last * Fix headers * SharedEvent => AtomicEvent * RawCudaEvent => CudaEventHandle, CudaEventWrapper => CopyableCudaEvent * Remove unneeded asserts

awni reviewed Sep 18, 2025

View reviewed changes

mlx/backend/cuda/eval.cpp Outdated Show resolved Hide resolved

zcbenz force-pushed the cuda-event-cache branch from f4e60d4 to 9c55fc5 Compare September 19, 2025 02:07

awni approved these changes Sep 22, 2025

View reviewed changes

zcbenz added 5 commits September 22, 2025 16:25

Make CudaEvent a CudaHandle

c523641

Add caching for CudaEvent

36e66a9

Make sure cuda events are destroyed at last

afbea5d

Fix headers

ad5ebf4

SharedEvent => AtomicEvent

a50fb5c

zcbenz force-pushed the cuda-event-cache branch from 17bc454 to 100e8ef Compare September 22, 2025 23:34

RawCudaEvent => CudaEventHandle, CudaEventWrapper => CopyableCudaEvent

a20fb07

zcbenz force-pushed the cuda-event-cache branch from 100e8ef to a20fb07 Compare September 22, 2025 23:37

Remove unneeded asserts

9c3be31

zcbenz force-pushed the cuda-event-cache branch from 8cba644 to 9c3be31 Compare September 23, 2025 00:13

zcbenz merged commit ae438d0 into ml-explore:main Sep 23, 2025
7 checks passed

zcbenz deleted the cuda-event-cache branch September 23, 2025 01:42

BrewTestBot mentioned this pull request Nov 20, 2025

mlx 0.30.0 Homebrew/homebrew-core#255173

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Recycle CUDA events#2604

[CUDA] Recycle CUDA events#2604
zcbenz merged 7 commits intoml-explore:mainfrom
zcbenz:cuda-event-cache

zcbenz commented Sep 17, 2025

Uh oh!

Uh oh!

awni commented Sep 18, 2025

Uh oh!

zcbenz commented Sep 19, 2025

Uh oh!

awni left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zcbenz commented Sep 17, 2025

Uh oh!

Uh oh!

awni commented Sep 18, 2025

Uh oh!

zcbenz commented Sep 19, 2025

Uh oh!

awni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants