-
Notifications
You must be signed in to change notification settings - Fork 335
Multi-queue synchronization #1169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this in writing!
undefined submit( | ||
sequence<GPUCommandBuffer> commandBuffers, | ||
optional sequence<GPUResourceTransfer> transfers = []); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting the signal value and waiting on it are separate operations on GPUQueue, as opposed to be associated with submissions specifically, because they also need to take into account writeBuffer and writeTexture."
What does the PR description mean when it says "because they also need to take into account writeBuffer and writeTexture?"
I would have expected that after submit
returns control to the caller, the developer is free to use the resources in the transfer list on the receiving queue and the WebGPU implementation handles setting up the fences as needed to avoid undefined behavior. What is the advantage of having the developer do this themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what's meant here is: "Getting signal values and waiting on signals are separate queue operations, rather than rolled into the submit() call, because writeBuffer and writeTexture also submit work to the queue. This avoids having to extend all three functions (submit, writeBuffer, and writeTexture) to return signal values and be able to wait on signals."
I would have expected that after
submit
returns control to the caller, the developer is free to use the resources in the transfer list on the receiving queue and the WebGPU implementation handles setting up the fences as needed to avoid undefined behavior. What is the advantage of having the developer do this themselves?
We shouldn't just implicitly insert a wait() on the receiving queue at whatever arbitrary time the app happens to issue the transferring submit on the sending queue. The app needs to have control over what point in the receiving queue the resources are received, otherwise the receiving queue can stall too early and go idle when it shouldn't have to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't just implicitly insert a wait() on the receiving queue at whatever arbitrary time the app happens to issue the transferring submit on the sending queue. The app needs to have control over what point in the receiving queue the resources are received, otherwise the receiving queue can stall too early and go idle when it shouldn't have to.
I agree it would be sub-optimal to insert a wait on the receiving queue at the time of the transferring submit. Why can't we insert a wait the first time you submit a command list with the object to the receiving queue? Wouldn't we need to run the same logic to validate the developer hasn't missed a wait? Missing a wait would lead to undefined behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we insert a wait the first time you submit a command list with the object to the receiving queue?
Because that would make the graph of synchronization implicit. Missing a wait would cause a validation error and not UB. As said in the meeting I'll detail more what the thought process was for this proposal.
spec/index.bs
Outdated
GPUSignalValue signal(); | ||
readonly attribute GPUSignalValue lastSignaledValue; | ||
|
||
Promise<undefined> onCompletion(optional GPUSignalValue value); | ||
|
||
GPUFence createFence(optional GPUFenceDescriptor descriptor = {}); | ||
undefined signal(GPUFence fence, GPUFenceValue signalValue); | ||
undefined wait(GPUQueue queue, optional GPUSignalValue value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand how the transfers
parameter works as well as how signal
, wait
and onCompletion
work on their own. However, I do not understand how these are meant to be used with one another, in particular the onCompletion
method. I think a small bit of sample code would help clarify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onCompletion
works just like GPUFence.onCompletion
: it returns a promise that is resolved when the content timeline knows that a fence/queue reached the given signal value. It does not interact with transfers or anything like that.
As promised in the last call, here's an explanation of the thought process behind this proposal. This explanation has some overlap with the Alternatives section of #1073. Explicit synchronization.The basis of this proposal is that it is important that the synchronization be visible to the developers. The multi-queue feature's goal is to expose coarse-grained parallelism of the hardware to more advanced WebGPU users. These users will want to control that if the hardware supports multiple queues, then for example the memory-bound workload on one queue runs in parallel of the fixed-function-bound workload on the other queue because it maximizes utilization. The DAG of execution of work on multiple queues needs to be in direct control of the developer and easily visible in the code. This is why the approach starts with the assumption that the transfers of resources have to be explicit and will produce validation errors if done wrong (instead of UB or additional implicit synchronization). Developers need the guarantee that the DAG of execution that's used is exactly the one that they specified. Explicit synchronization via "fences".One possibility, like @RafaelCintron commented above, or @kvark suggested in #1066 is that the operation of doing a transfer on a queue is in itself an "explicit synchronization" in that it shown in the code. The receiving queue's following operations (or following operations that use the transferred resource) would make sure it waits until the transferring queue has completed commands before the transfer. This is a semantic that is sound, but there's several drawbacks. The first one is that it adds constraints on the order in which the DAG is defined. All proposals need to require that the DAG be specified in a topological order, but having the resource transfers themselves define the synchronization reduces flexibility for the application on the order in which it defines the DAG. Basically if a developer wants to produce the following DAG:
Then they needs to do the following calls: queueB.submit(B1);
queueA.submit(A1, transferToB);
queueA.submit(A2);
queueB.submit(B2) But it would be much more convenient for them to do the following because it means the code for submitting to A and B can be more independent: queueA.submit(A1, transferToB);
queueA.signal(42);
queueA.submit(A2);
queueB.submit(B1);
queueB.wait(queueA, 42);
queueB.submit(B2); For composability of code it would be even better thing would be to be able to wait on a queue before signaling on another one but that brings additional issues so it isn't suggested in the proposal above. Transfers being synchronization also makes the synchronization less visible. It hides one half of the synchronization, the one for the receiving queue, inside the transfer command. This makes the synchronization invisible in the code that uses the receiving queue. It also buries the synchronization inside dictionaries in a list that's passed as one of the argument function call. The list of transfers could be built inside the deep rendering logic and bubbled up, then in the code that does the submit you'd see Finally some synchronization is useful even without transfers. Like discussed above, the application may want to synchronize queues just to make sure that the workloads being executed in parallel have different bottlenecks. This could be done with fake transferred, but really the synchronization and the transfer parts should be more orthogonal. So overall it's not enough to have transfers given in a call to submit do the synchronization. Synchronization should be done with Merging
|
For the example you gave, since the API is required to put the wait for A1 right before the submission of B2, you can submit the work as you have in the second section (labeled as the much more convenient one) even with implicit waits. How do explicit waits make things more flexible? It's already the case you can't wait for particular number before you signal that number so I must be missing something.
WebGPU already hides synchronization for buffer mapping and all memory barriers. What is it about multi-queue synchronization that you feel needs the synchronization to be explicit?
Can you give an example of a scenario or workload that would benefit from explicit transfers where implicit transfers would otherwise lead to sub-optimal execution or incorrect results? |
The point that wasn't very clear is that if the transfers are the synchronization, then the following code will schedule B1 after A1, so there's more constraints on your code's structure if you want to control the exact shape of the DAG. Or the WebGPU implementation knows that the resource is only used in B2, but then it decides on the DAG in a manner that's not very predictable and explicit.
Actually synchronization for buffer mapping is explicit, this is why we have |
I agree that native APIs let you do powerful things but that's because the developer knows more about their data dependencies than what they can express to the API. However, WebGPU is (by design) more limiting because any graph you build must be a subset of what the validation layer is willing to let you create. In a sense, the validation layer's graph is the upper limit of the graph the web developer can build. For portability, that graph should be speced. The WebGPU API knows what resources are owned each queue and which ones are used in each command list. Does there exist a set of WebGPU instructions such that the developer can build a better graph than the API but still pass the API's validation and be free of data races? If so, that would be a good example to talk about.
Pedantic: Having buffer mapping use promises is one form of being explicit, yes. My point was that we didn't spec buffer mapping by having the web developer signal and wait on fence values. One can say that passing a list of resources to submit with the destination queue clearly spelled out is also a form of being explicit. |
Yes, by making workloads that maximize different resources of the GPU at the same time. See the presentation linked that says they specifically overlap shadow map generation with post-processing because stress the fixed function while the other stresses ALUs. Explicit control helps achieve this level of detail, otherwise developers who are interested in it will have to fake signal-wait synchronization with dummy resource transfers.
I think the analogy is interesting but not too relevant as buffer mapping in CPU - GPU synchronization with the interesting part being how it is exposed in javascript while multi queue is GPU - GPU synchronization that's pipelined on an asynchronous coprocessor. |
As an aside, it occurs to me that if we made flushing buffered-up queue work (submits, writes, waits) explicit, with a flush() call (as was briefly discussed in the past), then we could allow out of order waits and validate in flush(). |
I see that @RafaelCintron raised exactly the concerns I had after reading @Kangz long comment 🥇 I don't think explicit sync gives you any more composability or flexibility. You still have to submit things in order on different queues, and in addition you have to care about fence values. What explicit improves, however, is that now you can see the DAG by reading the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we drop the more-or-less unrelated Fence deletions here, I think the GPUSignalValue
approach would have Approval from me!
spec/index.bs
Outdated
|
||
GPUFence createFence(optional GPUFenceDescriptor descriptor = {}); | ||
undefined signal(GPUFence fence, GPUFenceValue signalValue); | ||
undefined wait(GPUQueue queue, optional GPUSignalValue value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since q1.wait(q2)
is just shorthand for q1.wait(q2, q2.lastSignaledValue)
, isn't that easy enough to require writing out, given that we expect a generally higher bar for multiqueue users? I think explicit is good here and it's barely even extra code to require the explicit version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's also totally harmless to allow the shorter q1.wait(q2)
to be possible
: <dfn>wait(queue, value)</dfn> | ||
:: | ||
Creates a {{GPUFence}}. | ||
Waits for the |queue| reaching |value| on the [=Queue timeline=]. If |value| is undefined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Wait" sounds like it might cpu-block.
Maybe:
Synchronizes subsequent |this|'s queue operations (e.g. submit, writeBuffer) to happen-after |queue|'s queue operations on, up to and including when
|queue.signal|
returned |value| (or smaller).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should consider renaming wait
to synchronizeAfter
for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the name wait
may be a little misleading, even if it's specified to happen on the queue timeline.
In native API's vkQueueWaitIdle
is more of "me wait for the queue" than "queue wait for something else". Therefore, it seems to me that waitFor
would make this clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for waitFor
spec/index.bs
Outdated
</div> | ||
|
||
: <dfn>onCompletion(completionValue)</dfn> | ||
: <dfn>onCompletion(signalValue)</dfn> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onSignalComplete
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively maybe we should try to make this not sound like an event handler (with the word "on" at the front). I'll repost this comment on the relevant PR
spec/index.bs
Outdated
:: | ||
Returns a {{Promise}} that resolves once the fence's completion value ≥ `completionValue`. | ||
Returns a {{Promise}} that resolves once the queue reaches `signalValue` | ||
(or |this|.{{GPUQueue/lastSignaledValue}}) if undefined) on the [=content timeline=]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is better as requiring onCompletion(this.lastSignaledValue)
. The implicit version onCompletion()
looks like it'll resolve when all pending queue operations have completed.
q.submit(A)
const v = q.signal()
q.submit(B)
await q.onCompletion()
Users would naively expect this to resolve after q.submit(B)
is complete.
The explicit version of what those users want is simple and clear: q.onCompletion(q.signal())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
I brought back the fence code this PR was originally removing, as requested. |
I believe this PR is sufficiently simple for both users and implementors to start prototyping. I don't think we should rush into trying to smooth things into being more automated/implicit until we get more feedback. After all, the current group has little-to-no experience with multiple GPU queues, so we shouldn't try too hard to make it "ideal" for users until we know more. |
@kvark , I am definitely in favor of web developers explicitly transferring resources between queues via the However, I question the need to also require web developers to put waits and signals. Since the WebGPU implementation will be fully validating waits and signals and prevents developers from waiting on a number that hasn't been signaled, I think it can do a good enough job for developers. The If we need more explicit control, we can add it as a a future change when we consider advanced features like (more) explicit memory barriers and bindless. |
@RafaelCintron thank you for feedback. I think having
So when we consider the |
I am referring to 3: explicit transfers with errors (and therefore dropped submission) when you use a resource that doesn't belong to a queue. The developer knows the 'sending' queue they specify is signaled and the 'receiving' queue waits. |
To clarify, the developer doesn't know when that wait is going to happen. Imagine that we are the developers, and we just made a submission to queue 'A' with resource transfer into queue 'B'. Then suppose we are making a submission to queue 'B'. Question: will there be a queue wait for 'A'? In order to answer that question, we'll need to know if that resource is used in the submission we are making. If it's not used, then the wait will happen some time later (some of the following submissions). But figuring out if it's used is very difficult:
So in order to trace a usage of a texture, we as developers (or reviewers) would have to chase it through up to 5 levels of indirection (!). In other words, we can't expect anyone to reason about this with high level of confidence. Explicit |
You correct. I was assuming the reason for the transfer is the developer actually wants to use the resource on queue B.
I must be missing something but I don't understand why developers need to do all of this 'chasing' and 'guessing'. If you transfer the resource to another queue and you use it in the queue, the API will do the right thing to prevent data races. The same is true for memory barriers. Yes, you can chase and guess when the API puts the memory barriers by doing a similar analysis you did for signals and waits. Or you can trust that the API is doing it right. The only persuasive argument I've seen for requiring waits and signals is when the developer can do a better job than WebGPU at putting them in the right place because (perhaps) they know more about the hardware than WebGPU does. But having developers optimize for one piece of hardware risks the "optimization" doesn't work as well on all hardware and we want good portability. This is why I suggest we consider adding explicit waits/signals as a followon feature when we come to those use cases in the wild. |
I think users can implement graph-checking code themselves if they have a need for it. I don't think we absolutely need to offer that in the core API here. |
@jdashg the concerns I expressed yesterday wasn't about validation. The problem is that multi-queue is an expert level feature and that expert developers want to use it to control what operations are done in parallel on the multiple queues. For example in the Doom 2016 presentation that use it to run shadow map generation and post-processing in parallel for "bottleneck maximization" (concept name courtesy of @kainino0x). If we have an implicit synchronization on the receiving end, then it is hard for the developers to know that shadow map, and only shadow maps will run in parallel to the post-processing. If some other part of their render graph runs in parallel of the post-processing, it could kill the "bottleneck maximization" and lose the performance benefits. For example if the browser synchronizes the first time a resource is used on the receiving queue, the developers could make a change that happens to use a transferred resource in the previous submit, causing synchronization to move earlier and breaking the bottleneck maximization silently. Another example is if the browser synchronizes the next submit on the receiving queue (which is technically valid, nothing in the spec would say what exact execution graph you'd get), then likewise it silently produces an execution graph that breaks their assumption (and the perf gains). That's why I strongly believe we need explicit synchronization on the receiving queue, otherwise the feature will have much reduced benefits because it will silently lose the perf gains super easily. I think we had good discussions and a general direction for prototyping, but would like to reiterate that I don't think multi-queue should be a required feature or even specced as optional for v1 because:
|
Closing stale PR; we will revisit this as a proposal when we get to looking back at the |
Closes #1066
Closes #1073
This PR describes a merged proposal between ^ by the collective authorship of @Kangz , @austinEng, @kainino0x, and myself. It's incomplete (more spec text and validation is needed), but it clearly shows the direction.
The proposal removesGPUFence
objects and instead assigns a monotonically increasing number to each queue directly, which simplifies the explicit synchronization and removes one level of indirection. We believe that the explicit is still desired here because it has little impact on API usability, since using multiple queues is optional.Getting the signal value and waiting on it are separate operations on
GPUQueue
, as opposed to be associated with subimssions specifically, because they also need to take into accountwriteBuffer
andwriteTexture
.The transfer of ownership is still a part of the submission (like with
GPUTextureHandover
in #1066), but it's provided on the queue (like in #1073). This allows the Vulkan backend to avoid an extra dummy submission, all while still specifying the resource ownership fully in theGPUQueue
API.Texture ownership is done at the subresource level, which matches WebGPU synchronization today, as well as native APIs we target. This allows, for example, a streaming queue to upload individual mipmap levels and pass their ownership to the main queue.
Note that the PR doesn't include simultaneous access (
VK_SHARING_MODE_CONCURRENT
) between queues yet. I believe, enabling it for buffers would only require minor changes to the spec, and possibly an extra flag inGPUBufferDescriptor
. We will follow-up with this.Preview | Diff