-
Notifications
You must be signed in to change notification settings - Fork 335
Description
Motivation
Tile-based deferred renderers (TBDR) are popular in mobile devices. These types of renderers split up the framebuffer into tiles, and for each tile, run the fragment shading stage for all triangles that touch that tile. The benefit is that a tile can fit inside on-chip memory, meaning that all pixel operations operate on the on-chip memory as a cache. This cache only has to be transferred to/from main memory once per tile, rather than all pixel operations operating directly on main memory. This is beneficial because the ratio of memory bandwidth to computational facilities is different on mobile devices than desktop devices.
Consider an application that has two distinct fragment shaders which need to communicate on a per-pixel basis (i.e. the rendering of a particular pixel relies on the previously shaded value of that same pixel). This kind of architecture is common; the deferred shading technique uses a first pass to populate a buffer with geometry information, and a second pass to compute lighting for each pixel's geometry computed in the previous pass. The most natural way to encode such an architecture would be two use two render passes, one for each fragment shader / destination texture pair. However, between these two render passes, the rendered data must be flushed out to main memory and then read back in again for the second shader. This data transfer is costly on TBDRs, and unnecessary, because the data is simply round-tripping through main memory.
If this data never had to touch main memory, there would be a performance benefit. However, there's a memory benefit, too: if no data ever ends up actually hitting main memory for a resource, that resource doesn't actually need any memory backing it. We can simply avoid allocating a backing store for that resource in device memory. This can be significant, as a large & fat g-buffer can be tens or even hundreds of megabytes.
I made a benchmark ChipMemorySpeed.zip to try to measure the performance effects. It compares the speed of three realizations of the above architecture:
- Direct Resource Writes: Two passes: The first pass's fragment shader explicitly writes into a texture (as in, the shader itself executes a
write()
call) and the second one reads it - Render Targets: Two passes: The first pass renders into a texture as a render target, and the second one reads it
- Imageblocks: One pass: The first fragment shader writes into Metal Imageblock storage which lives on-chip, and the second fragment shader reads from it. This avoids the round-trip through main memory.
The test times how fast every pixel on a 4k * 4k texture gets filled with the above data flows on an iPhone XS. I found that, for a thin intermediate buffer (1 byte per pixel), there was no difference between the Render Targets and Imageblocks approaches. However, for a fatter intermediate buffer (16 bytes per pixel), here are the results:
The results show that keeping data on-chip can result in a 40% performance improvement. Most g-buffers are fatter than 16 bytes per pixel, so we can expect the performance improvements can be even greater with real-world renderers.
These results match intuition. Render targets get access to specialized hardware to dump an entire Imageblock to memory, so it is expected to be faster than directly writing pixel-by-pixel into the destination image. Imageblock data doesn't even have to go to main memory at all, so it is expected to be faster than either of the other two options.
D3D12
D3D12 recently added support for render passes in Windows 10 build 1809. The docs list two main benefits:
- Being able to specify load and store operations, which can avoid render target loads and stores
- Being able to make two distinct render passes behave as if they were a single render pass, even if they are in different command lists. (The two command lists don't have to both call
OMSetRenderTargets()
anymore.)
The first item is something we already have in WebGPU (GPULoadOp
and GPUStoreOp
). I'm not sure that the second item is really relevant to this investigation. I couldn't find any other controls for on-chip memory.
Vulkan
Vulkan has the concept of "subpasses" within a single render pass. Each subpass can render to a different render target, but all the render targets within a render pass must have the same dimensions. A fragment shader inside one of the subpasses is allowed to read from its pixel's location from a previously-executed subpass by using a special function in the shader: subpassLoad()
. This function doesn't take an (x, y)
location, because it's only allowed to read from its own pixel's location.
At the time the render pass is created, the author has to specify a plan for how many subpasses there will be, which passes have a dependency on which other passes, and which render targets each pass will write to and read from. Given this information, a TBDR can decide how best to use the limited memory available on the chip. For example, if all the render targets fit on-chip, the implementation is free to keep them there, but if they don't all fit, the implementation can choose the best time within the computation graph to "spill" to device memory. The implementation is free to spill as often or as rarely as it wants, and is even free to cause subpassLoad()
to read from memory rather than reading from on-chip memory. This means the implementation is essentially performing a register allocation analysis over the render pass, just like how a compiler would do.
Aside: Making the Vulkan runtime perform register selection is surprisingly un-Vulkan-y. I would have expected them to just expose every last detail of the hardware and force applications to figure out how best to use it.
It's also possible to get the memory benefits by specifying VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT
at resource creation time. Because it's not guaranteed when the implementation will or will not spill to memory, this flag isn't a guarantee; sometimes it will result in a real allocation and sometimes it won't.
Subpasses exist on all devices, not just TBDR devices. Presumably this is because the details of how they are executed is abstracted away from the machine itself.
Metal
Metal gives application authors fine-grained control over exactly how to use on-chip memory. It exposes this memory with imageblocks. Imageblocks are only available on some iOS and tvOS hardware. A particular fragment shader invocation may only access the imageblock data associated with that particular fragment location.
Imageblocks come in two flavors: implicit and explicit imageblocks. Metal application developers were already using implicit imageblocks without knowing it; when you output to multiple render targets (MRT) and annotate the outputs with [[color(0)]]
, those fields live in an imageblock.
Explicit imageblocks work by annotating an input and output struct with [[imageblock_data]]
. This indicates that the memory backing those fields will live on the chip. This data isn't associated with any render target or any resource - it just lives on-chip. The imageblock lives for the lifetime of the render pass. If two fragment shaders within the same render pass both include structs marked as [[imageblock_data]]
, then there's a process which matches the fields from one to the other. The format of an imageblock can change throughout the render pass - different structs annotated with [[imageblock_data]]
can be present as both an input and an output of a fragment shader, and the different structs are allowed to have different fields. This way, the author is in complete control over what on-chip memory gets used for during every draw call and whether or not to spill to main memory at any particular time.
A render pass has to know how much imageblock memory will be consumed by any fragment shader. Luckily, when compiling the shader, the pipeline state object includes a getter for how much memory that particular pipeline requires, which can be supplied into the render pass. Apple publishes tables of exactly how much on-chip memory is available for each device that supports imageblocks. Currently, this is 32KB for every device that supports it.
Explicit imageblocks don't require any backing resource, so they get memory savings. Implicit imageblocks support the memory savings by specifying MTLResourceOptions.storageModeMemoryless
at resource creation time. This isn't a hint; it means the resource will actually not have any payload memory. If you use one of these, you're not allowed to load or store to it.
OpenGL ES (just for fun)
EXT_shader_pixel_local_storage works almost exactly like Metal's imageblocks.