-
Notifications
You must be signed in to change notification settings - Fork 12.1k
ggml-backend : fix async copy from CPU #8897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@matteoserva please let me know if this fixes the issue in your system. I already tested this on @JohannesGaessler machine, so I expect it works there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the destination backend need to be synchronized in ggml_backend_tensor_copy_async
but not in ggml_backend_sched_compute_splits
?
cf49428
to
a5eae7a
Compare
The idea is that the scheduler makes multiple copies of every input and synchronizes access to them with events. Instead of having to synchronize the entire backend, it is enough to synchronize with the event. However there was a missing |
if (sched->events[split_backend_id][sched->cur_copy] != NULL) { | ||
ggml_backend_event_synchronize(sched->events[split_backend_id][sched->cur_copy]); | ||
} else { | ||
ggml_backend_synchronize(split_backend); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this synchronization call can be optimized out since with a null event the backend has already been synchronized. But if there is no measurable performance difference it may be better to just keep it in to make the code easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I left it there for clarity. For backends that don't support events, ggml_backend_synchronize
should be a no-op anyway.
Prior to the latest commit the fix was working on my second machine with 3x P40. I'll review the new changes tomorrow. |
The changes to |
@slaren The patch fixed the issue on my system. Thank you! |
|
||
if (backend_src != backend_dst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it ensured that there are no race conditions between backend_src
and backend_dst
for this code branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What race conditions are you thinking about? It uses an event to synchronize the two streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misinterpreted the code. If my understanding is correct the synchronization happens outside this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the synchronization is done in this function, but the most complicated parts happen in ggml_backend_sched. Ultimately, the only responsability of this function is to implement the semantics of the copy_async
interface of ggml-backend, as defined in ggml-backend.h
:
// asynchronous copy
// the copy is performed after all the currently queued operations in backend_src
// backend_dst will wait for the copy to complete before performing other operations
// automatic fallback to sync copy if async is not supported
GGML_API void ggml_backend_tensor_copy_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, struct ggml_tensor * src, struct ggml_tensor * dst);
* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same
* ggml-backend : fix async copy from CPU * cuda : more reliable async copy, fix stream used when the devices are the same
Fixes #8685
The problem was that some copies from the CPU backend to the CUDA backend were not correctly synchronized, which in some cases could allow the CPU backend to overwrite the data in the next batch, before it was copied to the GPU.