Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388

@litherum

Description

@litherum

Background

WebGPU currently has 2 buffer upload facilities: GPUBuffer.mapAsync() and GPUBuffer.writeBuffer().

For GPUBuffer.mapAsync(), there currently is the restriction that no mappable buffer can be used as anything else, other than COPY. This means that, in order to be useful, an application has to allocate 2 buffers - one for mapping and one for using. And, if an application wants to round-trip data through a shader, they have to allocate 3 buffers - one for the upload, one for the download, and one for the shader. Therefore, in order to use mapAsync(), an application needs to double (or triple) their memory use and add one or two extra copy operation. On a UMA system, neither the extra allocation nor the copy is necessary, which means there's both a perf and memory cost to using mapAsync() on those systems. What's more, because the application is explicitly writing this code, there's not really anything we can do to optimize out the extra buffer allocation / copy operation.

On the other hand, GPUQueue.writeBuffer() is associated with a particular point in the queue's timeline, and therefore can be called even when the destination buffer is in-use by the GPU. This means that the implementation of writeBuffer() is required to copy the data to an intermediate invisible buffer under-the-hood, even on UMA systems, and then schedule a copy operation on the queue to copy the data from the intermediate buffer to the final destination. This extra allocation and extra copy operation don't necessarily need to exist on UMA systems. (GPUBuffer.writeBuffer() is a good API in general because of its simple semantics and ease of use, but it does have this drawback.)

It would be valuable if we could combine the best parts of GPUBuffer.mapAsync() and GPUQueue.writeBuffer() into something which doesn't require an extra allocation or copy on UMA systems. This kind of combination would have to be something that isn't UMA-specific, but would work on both UMA and non-UMA, and UMA systems would be able to avoid extra allocations/copies under-the-hood.

Goals

  1. The "async" part of GPUBuffer.mapAsync() would be valuable, because that allows the implementation to not have to stash any data due to the destination buffer being busy.
  2. The "map" part of GPUBuffer.mapAsync() would be valuable because it allows the array buffer to be backed directly by GPU memory, thereby potentially avoiding another copy on UMA systems.
  3. The "queue" part of GPUQueue.writeBuffer() would be valuable, because non-UMA systems would need to schedule an internal copy to the destination, and specifying the queue gives them a place to do that.

Proposal

I think the most natural solution to this would be:

  1. Give mapAsync() an extra GPUQueue argument. (getMappedRange() and unmap() will implicitly use this queue). We could also say that the queue is optional, and if it's unspecified, the device's default queue will be used instead.
  2. Relax the requirement that the only other usage a mappable buffer can have is COPY

That's it!

  • On a UMA system, you'd be able to map the destination (storage) buffer directly - No copies, no extra allocations, it's living the UMA dream.
    • For reading, mapAsync() would just ignore its GPUQueue argument.
    • For writing, mapAsync() would use its GPUQueue argument to schedule a clearBuffer() command of the relevant region of the buffer. After the clear operation is complete, the map promise would be resolved.
  • On a non-UMA system:
    • For reading, mapAsync() would schedule a copy from the source (storage) buffer to a temporary buffer using the specified GPUQueue (which is what an application would have had to do themself), and the map operation would proceed just like normal on the temporary buffer. This is exactly what an author would have had to do themself.
    • For writing, mapAsync() would just stash the queue, map a temporary buffer, and wait for unmap() to be called. When unmap() is called, it would schedule a copy on the stashed queue from the temporary buffer to the destination buffer. This is exactly what an author would have had to do themself.

It's important to note is that this proposal doesn't restrict the amount of control a WebGPU author has. If an author wants to allocate their own map/copy buffer and explicitly copy the data to/from it on its way to its final destination (as they would do today), they can still do that, and no invisible-under-the-hood temporary buffers would be allocated.

This proposal also has a natural path forward for read/write mapping.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions