Conversation
|
PR looks good! I just have some questions about all the event classes which are hard to keep track of and the names are not so descriptive.
|
The The difference between Possible better names for the classes might be:
|
f4e60d4 to
9c55fc5
Compare
17bc454 to
100e8ef
Compare
100e8ef to
a20fb07
Compare
8cba644 to
9c3be31
Compare
* Make CudaEvent a CudaHandle * Add caching for CudaEvent * Make sure cuda events are destroyed at last * Fix headers * SharedEvent => AtomicEvent * RawCudaEvent => CudaEventHandle, CudaEventWrapper => CopyableCudaEvent * Remove unneeded asserts
While the
cudaEventCreatecall is very cheap (takes about 1~2µs) and we almost get no performance gain by eliminating it during inference, it is disturbing when doing profiling because measuring the kernel/graph time requires a pair of events for each kernel/graph.This PR makes the
CudaEventclass cache the underlying handle so the native CUDA events are reused instead of being recreated every time. For commandmlx_lm.generate --model meta-llama/Llama-3.2-1B-Instruct --prompt 'Write a story about Einstein' -m 64, the number ofcudaEventCreatecalls reduced from 69 to 7, and there is no observable change in performance.