Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use

# Background

WebGPU currently has 2 buffer upload facilities: `GPUBuffer.mapAsync()` and `GPUBuffer.writeBuffer()`.

For `GPUBuffer.mapAsync()`, there currently is the restriction that no mappable buffer can be used as anything else, other than COPY. This means that, in order to be useful, an application has to allocate 2 buffers - one for mapping and one for using. And, if an application wants to round-trip data through a shader, they have to allocate 3 buffers - one for the upload, one for the download, and one for the shader. Therefore, in order to use `mapAsync()`, an application needs to double (or triple) their memory use and add one or two extra copy operation. On a UMA system, neither the extra allocation nor the copy is necessary, which means there's both a perf and memory cost to using `mapAsync()` on those systems. What's more, because the application is explicitly writing this code, there's not really anything we can do to optimize out the extra buffer allocation / copy operation.

On the other hand, `GPUQueue.writeBuffer()` is associated with a particular point in the queue's timeline, and therefore can be called even when the destination buffer is in-use by the GPU. This means that the implementation of `writeBuffer()` is _required_ to copy the data to an intermediate invisible buffer under-the-hood, even on UMA systems, and then schedule a copy operation on the queue to copy the data from the intermediate buffer to the final destination. This extra allocation and extra copy operation don't necessarily need to exist on UMA systems. (`GPUBuffer.writeBuffer()` is a good API in general because of its simple semantics and ease of use, but it does have this drawback.)

It would be valuable if we could combine the best parts of `GPUBuffer.mapAsync()` and `GPUQueue.writeBuffer()` into something which doesn't require an extra allocation or copy on UMA systems. This kind of combination would have to be something that isn't UMA-specific, but would work on both UMA and non-UMA, and UMA systems would be able to avoid extra allocations/copies under-the-hood.

# Goals

1. The "async" part of `GPUBuffer.mapAsync()` would be valuable, because that allows the implementation to not have to stash any data due to the destination buffer being busy.
2. The "map" part of `GPUBuffer.mapAsync()` would be valuable because it allows the array buffer to be backed directly by GPU memory, thereby potentially avoiding another copy on UMA systems.
3. The "queue" part of `GPUQueue.writeBuffer()` would be valuable, because non-UMA systems _would_ need to schedule an internal copy to the destination, and specifying the queue gives them a place to do that.

# Proposal

I think the most natural solution to this would be:
1. Give `mapAsync()` an extra `GPUQueue` argument. (`getMappedRange()` and `unmap()` will implicitly use this queue). We could also say that the queue is optional, and if it's unspecified, the device's default queue will be used instead.
2. Relax the requirement that the only other usage a mappable buffer can have is COPY

That's it!

- On a UMA system, you'd be able to map the destination (storage) buffer directly - No copies, no extra allocations, it's living the UMA dream.
  - For reading, `mapAsync()` would just ignore its `GPUQueue` argument.
  - For writing, `mapAsync()` would use its `GPUQueue` argument to schedule a `clearBuffer()` command of the relevant region of the buffer. After the clear operation is complete, the map promise would be resolved.
- On a non-UMA system:
  - For reading, `mapAsync()` would schedule a copy from the source (storage) buffer to a temporary buffer using the specified GPUQueue (which is what an application would have had to do themself), and the map operation would proceed just like normal on the temporary buffer. This is exactly what an author would have had to do themself.
  - For writing, `mapAsync()` would just stash the queue, map a temporary buffer, and wait for `unmap()` to be called. When `unmap()` is called, it would schedule a copy on the stashed queue from the temporary buffer to the destination buffer. This is exactly what an author would have had to do themself.

It's important to note is that this proposal doesn't restrict the amount of control a WebGPU author has. If an author wants to allocate their own map/copy buffer and explicitly copy the data to/from it on its way to its final destination (as they would do today), they can still do that, and no invisible-under-the-hood temporary buffers would be allocated.

This proposal also has a natural path forward for read/write mapping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388

Background

Goals

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388

Description

Background

Goals

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions