Synchronous non-blocking mapWrite

Behold another data upload proposal! It comes from lengthy discussions internally with @austinEng (basically a simplification of his idea) @kenrussell with input from @kvark and @kainino0x (actually close to one of his very old proposals).

## Assumptions

0. All browsers are or will be multiprocess.

1. To have a pit of success, buffers are either upload buffers, readback buffers, or device-local buffers. This means that `MAP_WRITE` can only be used as `COPY_SRC` and `MAP_READ` only with `COPY_DST`. Allowing mappable usages with usages like `VERTEX` or `UNIFORM` means developers are likely to think "great, I can have all the usages at once", have it work OK on their Intel laptop but be much slower on discrete GPUs. Having all buffers mapped or mappable also adds more pressure to the OS, even on mobile. I think it's better to allow known universal fast paths (that can prevent some last % optiimzations) and have an optional feature for UMA, or a more expert feature for reduced restrictions on mappable buffers.

2. In a large amount of systems it will be possible to **synchronously** create a shmem in the content process, and send it to the GPU process where it will be wrapped in a GPU resource. This is possible:

   - on D3D12 with [ID3D12Device3::OpenExistingHeapFromFileMapping](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device3-openexistingheapfromfilemapping)
   - on Metal with [MTLDevice newBufferWithBytesNoCopy:length:options:deallocator:](https://developer.apple.com/documentation/metal/mtldevice/1433382-newbufferwithbytesnocopy?language=objc)
   - on Vulkan with [VK_EXT_external_memory_host](https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VK_EXT_external_memory_host)
   - on Vulkan Android with AHardwareBuffer magic when the extension is not supported.

## Proposal

`mapReadAsync` stays the same. `MAP_WRITE` (resp. `MAP_READ`) is only allowed with `COPY_SRC` (resp. `COPY_DST`). `GPUBuffer` now includes the following methods for mapping related things:

```webidl
partial interface GPUBuffer {
    ArrayBuffer mapWrite(unsigned long offset = 0, unsigned long size = 0);
    async ArrayBuffer mapReadAsync();

    void unmap();
};
```

Calling `mapWrite` puts the buffer in the "mapped for writing" state. It's a validation error (and JS exception?) to `mapWrite` overlapping ranges of the buffer (until the next `unmap`). As usual it's an error to call `GPUQueue.submit` with a buffer in the mapped state referenced in the `commands` argument.

`mapWrite` returns a new `ArrayBuffer` that will replace the content of that range of the buffer when `unmap` is called. There's also some restrictions where a buffer can only be mapped on one thread and has to be unmapped there. (that's eww compared to native :/, maybe we can find something better).

## Possible implementation

An example implementation of this proposal could be that:

 - On the first `mapWrite` after creation or an `unmap` call, a shmem of the same size as the buffer is created and returned to JS. At the same time it is sent to the GPU process and wrapped in a GPU resource there that replaces the previous GPU resource associated with that `GPUBuffer`. (replacing the resource is mostly ok because of the usage restrictions)
 - On unmap, a signal is sent to the GPU process that the data has been unmapped (for submit validation).

If the OS / driver doesn't support wrapping shmem in GPU resources, unmap() sends a list of regions from the shmem to update in the mapped buffer.

Imagine we are a really smart user-agents that knows which sub-ranges of a buffer are in use on the client side, then we can skip creating a new shmem and reuse the existing one.

## Choices

 - Does `mapWrite` allow specifying the full range, or only a subrange?
 - Is the content of the buffer cleared after the first `mapWrite` call after creation or `unmap` or is the buffer content preserved?
 - More stuff I missed?

## Comparison with `writeBuffer`

In the case where we can wrap shmem in a GPU resource, the number of copies for `writeBuffer` / `mapWrite` for JS and WASM are the following:

 - JS `mapWrite`: `shmem/staging -> device-local`
 - WASM `mapWrite`: `wasm -> shmem/staging -> device-local`
 - JS `writeBuffer`: `data -> shmem/staging -> device-local`
 - WASM `writeBuffer`: `data -> shmem/staging -> device-local`

When it's not possible to wrap shmem, the `shmem/staging` step becomes `shmem -> staging` with one extra copy happening.

`mapWrite` in JS is the fastest path and it incurs a single copy after initialization of the data. All other paths need to copy from some already initialized data somewhere to shmem and incur 2 copies. At some point I thought WASM `writeBuffer` would be better than WASM `mapWrite` but that turned out to not be the case.

A reason for doing `writeBuffer` is for simplicity, but I think that `mapWrite` is fairly understandable and can easily shim `writeBuffer` (more easily than `mapWriteAsync` for example).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Synchronous non-blocking mapWrite #594

Assumptions

Proposal

Possible implementation

Choices

Comparison with `writeBuffer`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Synchronous non-blocking mapWrite #594

Description

Assumptions

Proposal

Possible implementation

Choices

Comparison with writeBuffer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Comparison with `writeBuffer`