Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

kvark
Copy link
Contributor

@kvark kvark commented Oct 23, 2020

Closes #1066
Closes #1073
This PR describes a merged proposal between ^ by the collective authorship of @Kangz , @austinEng, @kainino0x, and myself. It's incomplete (more spec text and validation is needed), but it clearly shows the direction.

The proposal removes GPUFence objects and instead assigns a monotonically increasing number to each queue directly, which simplifies the explicit synchronization and removes one level of indirection. We believe that the explicit is still desired here because it has little impact on API usability, since using multiple queues is optional.

Getting the signal value and waiting on it are separate operations on GPUQueue, as opposed to be associated with subimssions specifically, because they also need to take into account writeBuffer and writeTexture.

The transfer of ownership is still a part of the submission (like with GPUTextureHandover in #1066), but it's provided on the queue (like in #1073). This allows the Vulkan backend to avoid an extra dummy submission, all while still specifying the resource ownership fully in the GPUQueue API.

Texture ownership is done at the subresource level, which matches WebGPU synchronization today, as well as native APIs we target. This allows, for example, a streaming queue to upload individual mipmap levels and pass their ownership to the main queue.

Note that the PR doesn't include simultaneous access (VK_SHARING_MODE_CONCURRENT) between queues yet. I believe, enabling it for buffers would only require minor changes to the spec, and possibly an extra flag in GPUBufferDescriptor. We will follow-up with this.


Preview | Diff

Copy link
Contributor

@Kangz Kangz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this in writing!

Comment on lines +6354 to +6356
undefined submit(
sequence<GPUCommandBuffer> commandBuffers,
optional sequence<GPUResourceTransfer> transfers = []);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting the signal value and waiting on it are separate operations on GPUQueue, as opposed to be associated with submissions specifically, because they also need to take into account writeBuffer and writeTexture."

What does the PR description mean when it says "because they also need to take into account writeBuffer and writeTexture?"

I would have expected that after submit returns control to the caller, the developer is free to use the resources in the transfer list on the receiving queue and the WebGPU implementation handles setting up the fences as needed to avoid undefined behavior. What is the advantage of having the developer do this themselves?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's meant here is: "Getting signal values and waiting on signals are separate queue operations, rather than rolled into the submit() call, because writeBuffer and writeTexture also submit work to the queue. This avoids having to extend all three functions (submit, writeBuffer, and writeTexture) to return signal values and be able to wait on signals."

I would have expected that after submit returns control to the caller, the developer is free to use the resources in the transfer list on the receiving queue and the WebGPU implementation handles setting up the fences as needed to avoid undefined behavior. What is the advantage of having the developer do this themselves?

We shouldn't just implicitly insert a wait() on the receiving queue at whatever arbitrary time the app happens to issue the transferring submit on the sending queue. The app needs to have control over what point in the receiving queue the resources are received, otherwise the receiving queue can stall too early and go idle when it shouldn't have to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't just implicitly insert a wait() on the receiving queue at whatever arbitrary time the app happens to issue the transferring submit on the sending queue. The app needs to have control over what point in the receiving queue the resources are received, otherwise the receiving queue can stall too early and go idle when it shouldn't have to.

I agree it would be sub-optimal to insert a wait on the receiving queue at the time of the transferring submit. Why can't we insert a wait the first time you submit a command list with the object to the receiving queue? Wouldn't we need to run the same logic to validate the developer hasn't missed a wait? Missing a wait would lead to undefined behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we insert a wait the first time you submit a command list with the object to the receiving queue?

Because that would make the graph of synchronization implicit. Missing a wait would cause a validation error and not UB. As said in the meeting I'll detail more what the thought process was for this proposal.

spec/index.bs Outdated
Comment on lines 6347 to 6352
GPUSignalValue signal();
readonly attribute GPUSignalValue lastSignaledValue;

Promise<undefined> onCompletion(optional GPUSignalValue value);

GPUFence createFence(optional GPUFenceDescriptor descriptor = {});
undefined signal(GPUFence fence, GPUFenceValue signalValue);
undefined wait(GPUQueue queue, optional GPUSignalValue value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand how the transfers parameter works as well as how signal, wait and onCompletion work on their own. However, I do not understand how these are meant to be used with one another, in particular the onCompletion method. I think a small bit of sample code would help clarify things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onCompletion works just like GPUFence.onCompletion: it returns a promise that is resolved when the content timeline knows that a fence/queue reached the given signal value. It does not interact with transfers or anything like that.

@Kangz
Copy link
Contributor

Kangz commented Nov 10, 2020

As promised in the last call, here's an explanation of the thought process behind this proposal. This explanation has some overlap with the Alternatives section of #1073.

Explicit synchronization.

The basis of this proposal is that it is important that the synchronization be visible to the developers. The multi-queue feature's goal is to expose coarse-grained parallelism of the hardware to more advanced WebGPU users. These users will want to control that if the hardware supports multiple queues, then for example the memory-bound workload on one queue runs in parallel of the fixed-function-bound workload on the other queue because it maximizes utilization.

The DAG of execution of work on multiple queues needs to be in direct control of the developer and easily visible in the code. This is why the approach starts with the assumption that the transfers of resources have to be explicit and will produce validation errors if done wrong (instead of UB or additional implicit synchronization). Developers need the guarantee that the DAG of execution that's used is exactly the one that they specified.

Explicit synchronization via "fences".

One possibility, like @RafaelCintron commented above, or @kvark suggested in #1066 is that the operation of doing a transfer on a queue is in itself an "explicit synchronization" in that it shown in the code. The receiving queue's following operations (or following operations that use the transferred resource) would make sure it waits until the transferring queue has completed commands before the transfer.

This is a semantic that is sound, but there's several drawbacks.

The first one is that it adds constraints on the order in which the DAG is defined. All proposals need to require that the DAG be specified in a topological order, but having the resource transfers themselves define the synchronization reduces flexibility for the application on the order in which it defines the DAG. Basically if a developer wants to produce the following DAG:

queueA: -A1------A2--->
             \
              \
queueB: -B1------B2--->

Then they needs to do the following calls:

queueB.submit(B1);
queueA.submit(A1, transferToB);

queueA.submit(A2);
queueB.submit(B2)

But it would be much more convenient for them to do the following because it means the code for submitting to A and B can be more independent:

queueA.submit(A1, transferToB);
queueA.signal(42);
queueA.submit(A2);

queueB.submit(B1);
queueB.wait(queueA, 42);
queueB.submit(B2);

For composability of code it would be even better thing would be to be able to wait on a queue before signaling on another one but that brings additional issues so it isn't suggested in the proposal above.

Transfers being synchronization also makes the synchronization less visible. It hides one half of the synchronization, the one for the receiving queue, inside the transfer command. This makes the synchronization invisible in the code that uses the receiving queue. It also buries the synchronization inside dictionaries in a list that's passed as one of the argument function call. The list of transfers could be built inside the deep rendering logic and bubbled up, then in the code that does the submit you'd see queue.submit(commands, transfers), good luck figuring out what synchronization is implied!

Finally some synchronization is useful even without transfers. Like discussed above, the application may want to synchronize queues just to make sure that the workloads being executed in parallel have different bottlenecks. This could be done with fake transferred, but really the synchronization and the transfer parts should be more orthogonal.

So overall it's not enough to have transfers given in a call to submit do the synchronization. Synchronization should be done with GPUFence-like signal and wait primitives that build the DAG very explicitly, and transfers should build on that for validation.

Merging GPUFence in GPUQueue.

The validation for transfers need to ensure that there's a path in the execution DAG from the transfer to the use of the transferred resources. I thought on and off for the past year on how to do this perfectly when using GPUFence. It's hard. The complexity comes from the fact that for each fence signal value we need to know what it implies for other fences (so that we can correctly handle chains of DAG edges). It might look simple if you decide to decay each (fence, value) pair to the underlying implementation's (queue, value) pair, but then how do you handle correctly situations where a queue signals multiple fences without operations in the middle: is that the same (queue, value) pair or different ones? Each of these paths lead to a lot of spec and implementation complexity.

In #1073 I gave up and proposed to make the transfers require synchronization through the same GPUFence and gave up on chains of DAG edges. It works, is "easily" implementable but is kind of surprising. The other option that came up in a discussion with @kvark was to have a single fence per queue because then there isn't the complexity of propagating DAG edges from/to the same queue via multiple different GPUFence objects. Once you decide to have a single GPUFence per GPUQueue it makes sense to merge them.

Merging the GPUQueue and GPUFence has other benefits in addition to allowing this multi-queue proposal: it makes the API simpler (less concepts, less indirection), it helps composability because there is a single execution serial for the GPUQueue (instead of each middleware creating its own GPUFence). It also encodes at a type-level the previous constraints that a GPUFence had to always be signaled by the same queue and with increasing values.

I don't see a drawback today in merging GPUFence in GPUQueue and my intuition is that we aren't going to have issues with native API changes adding semantic to their native fence object that we'll have trouble replicating.

@kvark kvark mentioned this pull request Nov 10, 2020
@RafaelCintron
Copy link
Contributor

The first one is that it adds constraints on the order in which the DAG is defined.

For the example you gave, since the API is required to put the wait for A1 right before the submission of B2, you can submit the work as you have in the second section (labeled as the much more convenient one) even with implicit waits. How do explicit waits make things more flexible? It's already the case you can't wait for particular number before you signal that number so I must be missing something.

Transfers being synchronization also makes the synchronization less visible ... good luck figuring out what synchronization is implied!

WebGPU already hides synchronization for buffer mapping and all memory barriers. What is it about multi-queue synchronization that you feel needs the synchronization to be explicit?

Finally some synchronization is useful even without transfers. Like discussed above, the application may want to synchronize queues just to make sure that the workloads being executed in parallel have different bottlenecks.

Can you give an example of a scenario or workload that would benefit from explicit transfers where implicit transfers would otherwise lead to sub-optimal execution or incorrect results?

@Kangz
Copy link
Contributor

Kangz commented Nov 10, 2020

For the example you gave, since the API is required to put the wait for A1 right before the submission of B2, you can submit the work as you have in the second section (labeled as the much more convenient one) even with implicit waits. How do explicit waits make things more flexible? It's already the case you can't wait for particular number before you signal that number so I must be missing something.

The point that wasn't very clear is that if the transfers are the synchronization, then the following code will schedule B1 after A1, so there's more constraints on your code's structure if you want to control the exact shape of the DAG. Or the WebGPU implementation knows that the resource is only used in B2, but then it decides on the DAG in a manner that's not very predictable and explicit.

queueA.submit(A1, transferToB);
queueA.submit(A2);

queueB.submit(B1);
queueB.submit(B2)

WebGPU already hides synchronization for buffer mapping and all memory barriers. What is it about multi-queue synchronization that you feel needs the synchronization to be explicit?

Can you give an example of a scenario or workload that would benefit from explicit transfers where implicit transfers would otherwise lead to sub-optimal execution or incorrect results?

Actually synchronization for buffer mapping is explicit, this is why we have mapAsync. Multiqueue needs explicit synchronization because the developers need to control which parts are run in parallel to maximize utilization of the various parts of the GPU. See for example the "Async Post Processing" of Doom 2016's Advances in Realtime Rendering deck.

@RafaelCintron
Copy link
Contributor

Multiqueue needs explicit synchronization because the developers need to control which parts are run in parallel to maximize utilization of the various parts of the GPU.

I agree that native APIs let you do powerful things but that's because the developer knows more about their data dependencies than what they can express to the API. However, WebGPU is (by design) more limiting because any graph you build must be a subset of what the validation layer is willing to let you create. In a sense, the validation layer's graph is the upper limit of the graph the web developer can build. For portability, that graph should be speced.

The WebGPU API knows what resources are owned each queue and which ones are used in each command list. Does there exist a set of WebGPU instructions such that the developer can build a better graph than the API but still pass the API's validation and be free of data races? If so, that would be a good example to talk about.

Actually synchronization for buffer mapping is explicit, this is why we have mapAsync

Pedantic: Having buffer mapping use promises is one form of being explicit, yes. My point was that we didn't spec buffer mapping by having the web developer signal and wait on fence values. One can say that passing a list of resources to submit with the destination queue clearly spelled out is also a form of being explicit.

@Kangz
Copy link
Contributor

Kangz commented Nov 10, 2020

The WebGPU API knows what resources are owned each queue and which ones are used in each command list. Does there exist a set of WebGPU instructions such that the developer can build a better graph than the API but still pass the API's validation and be free of data races? If so, that would be a good example to talk about.

Yes, by making workloads that maximize different resources of the GPU at the same time. See the presentation linked that says they specifically overlap shadow map generation with post-processing because stress the fixed function while the other stresses ALUs. Explicit control helps achieve this level of detail, otherwise developers who are interested in it will have to fake signal-wait synchronization with dummy resource transfers.

Pedantic: Having buffer mapping use promises is one form of being explicit, yes. My point was that we didn't spec buffer mapping by having the web developer signal and wait on fence values. One can say that passing a list of resources to submit with the destination queue clearly spelled out is also a form of being explicit.

I think the analogy is interesting but not too relevant as buffer mapping in CPU - GPU synchronization with the interesting part being how it is exposed in javascript while multi queue is GPU - GPU synchronization that's pipelined on an asynchronous coprocessor.

@kainino0x
Copy link
Contributor

kainino0x commented Nov 10, 2020

For composability of code it would be even better thing would be to be able to wait on a queue before signaling on another one but that brings additional issues so it isn't suggested in the proposal above.

As an aside, it occurs to me that if we made flushing buffered-up queue work (submits, writes, waits) explicit, with a flush() call (as was briefly discussed in the past), then we could allow out of order waits and validate in flush().

@kvark
Copy link
Contributor Author

kvark commented Nov 13, 2020

I see that @RafaelCintron raised exactly the concerns I had after reading @Kangz long comment 🥇

I don't think explicit sync gives you any more composability or flexibility. You still have to submit things in order on different queues, and in addition you have to care about fence values. What explicit improves, however, is that now you can see the DAG by reading the GPUQueue calls only. You can reason about dependencies, and consequently it becomes easier both to read and maintain multi-queue code. So it's a solid win for the API quality.

Copy link
Contributor

@kdashg kdashg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we drop the more-or-less unrelated Fence deletions here, I think the GPUSignalValue approach would have Approval from me!

spec/index.bs Outdated

GPUFence createFence(optional GPUFenceDescriptor descriptor = {});
undefined signal(GPUFence fence, GPUFenceValue signalValue);
undefined wait(GPUQueue queue, optional GPUSignalValue value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since q1.wait(q2) is just shorthand for q1.wait(q2, q2.lastSignaledValue), isn't that easy enough to require writing out, given that we expect a generally higher bar for multiqueue users? I think explicit is good here and it's barely even extra code to require the explicit version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's also totally harmless to allow the shorter q1.wait(q2) to be possible

: <dfn>wait(queue, value)</dfn>
::
Creates a {{GPUFence}}.
Waits for the |queue| reaching |value| on the [=Queue timeline=]. If |value| is undefined,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Wait" sounds like it might cpu-block.
Maybe:

Synchronizes subsequent |this|'s queue operations (e.g. submit, writeBuffer) to happen-after |queue|'s queue operations on, up to and including when |queue.signal| returned |value| (or smaller).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should consider renaming wait to synchronizeAfter for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the name wait may be a little misleading, even if it's specified to happen on the queue timeline.
In native API's vkQueueWaitIdle is more of "me wait for the queue" than "queue wait for something else". Therefore, it seems to me that waitFor would make this clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for waitFor

spec/index.bs Outdated
</div>

: <dfn>onCompletion(completionValue)</dfn>
: <dfn>onCompletion(signalValue)</dfn>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onSignalComplete?

Copy link
Contributor

@kainino0x kainino0x Nov 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively maybe we should try to make this not sound like an event handler (with the word "on" at the front). I'll repost this comment on the relevant PR

spec/index.bs Outdated
::
Returns a {{Promise}} that resolves once the fence's completion value &ge; `completionValue`.
Returns a {{Promise}} that resolves once the queue reaches `signalValue`
(or |this|.{{GPUQueue/lastSignaledValue}}) if undefined) on the [=content timeline=].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better as requiring onCompletion(this.lastSignaledValue). The implicit version onCompletion() looks like it'll resolve when all pending queue operations have completed.

q.submit(A)
const v = q.signal()
q.submit(B)
await q.onCompletion()

Users would naively expect this to resolve after q.submit(B) is complete.
The explicit version of what those users want is simple and clear: q.onCompletion(q.signal())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@kvark
Copy link
Contributor Author

kvark commented Nov 13, 2020

I brought back the fence code this PR was originally removing, as requested.

@kvark
Copy link
Contributor Author

kvark commented Nov 18, 2020

I believe this PR is sufficiently simple for both users and implementors to start prototyping. I don't think we should rush into trying to smooth things into being more automated/implicit until we get more feedback. After all, the current group has little-to-no experience with multiple GPU queues, so we shouldn't try too hard to make it "ideal" for users until we know more.
If there are no blockers by the next call, I suggest we proceed.

@RafaelCintron
Copy link
Contributor

RafaelCintron commented Nov 20, 2020

@kvark , I am definitely in favor of web developers explicitly transferring resources between queues via the transfers dictionary passed to submit.

However, I question the need to also require web developers to put waits and signals. Since the WebGPU implementation will be fully validating waits and signals and prevents developers from waiting on a number that hasn't been signaled, I think it can do a good enough job for developers. The transfers dictionary already provides a level of explicitness which I feel is adequate.

If we need more explicit control, we can add it as a a future change when we consider advanced features like (more) explicit memory barriers and bindless.

@kvark
Copy link
Contributor Author

kvark commented Nov 23, 2020

@RafaelCintron thank you for feedback. I think having transfers with implicit wait takes an interesting position in the solution space that may see like a good compromise, but there are caveats. Here is the "explicitness ladder" in this space:

  1. Fully implicit transfers and waits. Suffers from the need to do a dummy submission on Vulkan, which reportedly is slow at least on mobile GPUs. Also, the DAG is not visible at all, making it hard to reason about the behavior of a program.
  2. Implicit waits with transfers that are hints to the implementation. If the actual usage diverges from the hint, e.g. resource was hinted to transfer to queue A but ends up being used on queue B, then the implementation would have to create a dummy submission to A with the only purpose or receiving that resource and sending it to B. That would be a very unfortunate code path to have, I'm sure no users would want to hit it. Also, DAG is only partially visible: we see what resources are getting transferred, but we don't see where the semaphore/fence waits are going to happen upon submissions (i.e. where the queues are synced up).
  3. Implicit waits with explicit transfers that are required. If the actual usage diverges, an error is generated, and the submission is dropped. This may or may not break the whole application. Again, the DAG is only partially visible.
  4. Fully explicit waits and transfers. If the actual usage diverges, an error is generated, and the submission is dropped. This may or may not break the whole application. The DAG is fully visible at the GPUQueue level.

So when we consider the transfers without the waits, we are talking about (2) or (3), both of which have several issues to them. Were you suggesting the transfers to be hints (as in (2)) or strong requirements (as in (3))?

@RafaelCintron
Copy link
Contributor

So when we consider the transfers without the waits, we are talking about (2) or (3), both of which have several issues to them.

I am referring to 3: explicit transfers with errors (and therefore dropped submission) when you use a resource that doesn't belong to a queue. The developer knows the 'sending' queue they specify is signaled and the 'receiving' queue waits.

@kvark
Copy link
Contributor Author

kvark commented Nov 23, 2020

@RafaelCintron

The developer knows the 'sending' queue they specify is signaled and the 'receiving' queue waits.

To clarify, the developer doesn't know when that wait is going to happen. Imagine that we are the developers, and we just made a submission to queue 'A' with resource transfer into queue 'B'. Then suppose we are making a submission to queue 'B'. Question: will there be a queue wait for 'A'?

In order to answer that question, we'll need to know if that resource is used in the submission we are making. If it's not used, then the wait will happen some time later (some of the following submissions). But figuring out if it's used is very difficult:

GPUTexture <- GPUTextureView <- GPUBindGroup <- GPURenderBundleEncoder <- GPURenderPassEncoder <- GPUCommandBuffer

So in order to trace a usage of a texture, we as developers (or reviewers) would have to chase it through up to 5 levels of indirection (!). In other words, we can't expect anyone to reason about this with high level of confidence. Explicit wait allows the users to state their assumption and avoid guessing.

@RafaelCintron
Copy link
Contributor

To clarify, the developer doesn't know when that wait is going to happen. Imagine that we are the developers, and we just made a submission to queue 'A' with resource transfer into queue 'B'. Then suppose we are making a submission to queue 'B'. Question: will there be a queue wait for 'A'?

You correct. I was assuming the reason for the transfer is the developer actually wants to use the resource on queue B.

So in order to trace a usage of a texture, we as developers (or reviewers) would have to chase it through up to 5 levels of indirection (!). In other words, we can't expect anyone to reason about this with high level of confidence. Explicit wait allows the users to state their assumption and avoid guessing.

I must be missing something but I don't understand why developers need to do all of this 'chasing' and 'guessing'. If you transfer the resource to another queue and you use it in the queue, the API will do the right thing to prevent data races. The same is true for memory barriers. Yes, you can chase and guess when the API puts the memory barriers by doing a similar analysis you did for signals and waits. Or you can trust that the API is doing it right.

The only persuasive argument I've seen for requiring waits and signals is when the developer can do a better job than WebGPU at putting them in the right place because (perhaps) they know more about the hardware than WebGPU does. But having developers optimize for one piece of hardware risks the "optimization" doesn't work as well on all hardware and we want good portability. This is why I suggest we consider adding explicit waits/signals as a followon feature when we come to those use cases in the wild.

@kdashg
Copy link
Contributor

kdashg commented Nov 23, 2020

I think users can implement graph-checking code themselves if they have a need for it. I don't think we absolutely need to offer that in the core API here.

@Kangz
Copy link
Contributor

Kangz commented Nov 24, 2020

@jdashg the concerns I expressed yesterday wasn't about validation. The problem is that multi-queue is an expert level feature and that expert developers want to use it to control what operations are done in parallel on the multiple queues. For example in the Doom 2016 presentation that use it to run shadow map generation and post-processing in parallel for "bottleneck maximization" (concept name courtesy of @kainino0x).

If we have an implicit synchronization on the receiving end, then it is hard for the developers to know that shadow map, and only shadow maps will run in parallel to the post-processing. If some other part of their render graph runs in parallel of the post-processing, it could kill the "bottleneck maximization" and lose the performance benefits.

For example if the browser synchronizes the first time a resource is used on the receiving queue, the developers could make a change that happens to use a transferred resource in the previous submit, causing synchronization to move earlier and breaking the bottleneck maximization silently. Another example is if the browser synchronizes the next submit on the receiving queue (which is technically valid, nothing in the spec would say what exact execution graph you'd get), then likewise it silently produces an execution graph that breaks their assumption (and the perf gains). That's why I strongly believe we need explicit synchronization on the receiving queue, otherwise the feature will have much reduced benefits because it will silently lose the perf gains super easily.

I think we had good discussions and a general direction for prototyping, but would like to reiterate that I don't think multi-queue should be a required feature or even specced as optional for v1 because:

  1. it's clear now that what we have can be extended for multi-queue
  2. multi-queue is incredibly complicated to get right and implement when other extensions would give much more bang for buck (FP16, subgroups, even bindless)
  3. there are still a ton of things to figure out (queue discovery, concurrent access, ...) and it will delay a release of WebGPU v1 for marginal gains

@kainino0x kainino0x added the multi-queue Part of the multi-queue feature label Dec 16, 2020
@kvark kvark added this to the post-MVP milestone Feb 1, 2021
@kainino0x
Copy link
Contributor

Closing stale PR; we will revisit this as a proposal when we get to looking back at the multi-queue label.

@kainino0x kainino0x closed this Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-queue Part of the multi-queue feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-queue proposal with explicit transfers Strawman Multi-Queue Proposal
5 participants