Survey of Nvidia RTX Technolog
Survey of Nvidia RTX Technolog
, 2020.
Russian Text © The Author(s), 2020, published in Programmirovanie, 2020, Vol. 46, No. 4.
*e-mail: [email protected]
**e-mail: [email protected]
***e-mail: [email protected]
Received December 25, 2019; revised January 9, 2020; accepted January 13, 2020
DOI: 10.1134/S0361768820030068
297
298 SANZHAROV et al.
Fig. 1. Performance ratio between primary and secondary rays, i.e. how many times does performance drop when switching from
primary to secondary rays.
Finally, there is a group of solutions aimed at devel- of the BVH tree. The main difference of [14] is that
oping hardware extensions for graphics processors or treelets can store bounding volumes data with reduced
developing similar massively parallel programmable precision of 5 bits per plane instead of 32 bits for the
systems. One of the first programmable solutions of standard floating point type. This reduces the memory
this type was presented in [9]. Tree traversal and inter- load and improves the performance of the GPU
section calculations were implemented in a special cache. In addition, the solution proposed in [14] is rel-
block with fixed functionality, while user programs atively cheap in terms of the occupied crystal area –
(shaders) were executed on the so-called Shader Pro- the number of transistors used.
cessing Unit (SPU), which was very similar in archi- There are solutions aimed at the hardware imple-
tecture to the early GPU processor cores. Like Saar- mentation of ray tracing for mobile systems, where
COR, the work [9] used packet tracing, which caused system power consumption is an important parameter
significant performance drops for rays diverging in dif- [15, 16]. These works target classical ray tracing [17]
ferent directions – also called incoherent rays. The implementation, and unlike many of the works dis-
same problem is observed in many GPU ray tracing cussed above, they use the MIMD architecture with
implementations [10, 11]. VLIW processors to reduce energy and efficiency
One of the solutions to the random access memory losses during calculations for diverging rays.
problem was proposed in [12]. It involves dividing the To summarize, a lot of hardware implementations
memory request stream into at least 2 streams – the of ray tracing have been developed so far. A more com-
data stream for the rays (ray stream) and the data plete review can be found in [18]. In addition, some
stream for the scene (BVH tree, scene stream). It can commercial companies also presented their solutions
be said that [12] expands on the traditional approach [19], although at present they are not publicly avail-
of hiding memory latency with the help of deep pipe- able. Thus, RTX is the first technology available to the
lining, widely used in the GPU, – once the treelet (a general public. But since this technology is closed, it is
fragment of the BVH tree) is loaded to the cache it is unclear what particular acceleration methods RTX
traversed by all the rays that are currently being pro- uses. To understand this, we examined Nvidia RTX as
cessed on the GPU. The authors of [12] claim that in a black box, conducting various experiments and mea-
this way they manage to avoid random access. suring performance. For this purpose, we imple-
For GPUs, in addition to incoherent rays, there is mented basic path tracing algorithm using the Vulkan
also the problem of an irregular distribution of work. interface for RTX.
When there are few active threads/rays in the SIMD
thread group (warp), the efficiency of the SIMD GPU
processor is reduced. To solve this problem, in [10, 11], 2.1. Path Tracing on GPU
thread compaction and path regeneration were used, and GPU ray tracing by itself is a concise and indepen-
in [13] a block regeneration technique was proposed. dent task that can be solved effectively in a variety of
In [8, 10, 14] authors used the idea of grouping the ways. However, the problem changes radically when it
BVH tree into the so-called treelets – small fragments is necessary to build an extensible ray-tracing based
Fig. 2. Test scenes. Sponza and Crysponza – low detail scenes with predominantly rectangular geometry. San Miguel contains a
lot of non-rectangular forms and is the one most close to scenes found in practical applications. Hairballs scene uses instancing
intensively; the base mesh consists of difficult for BVH tree geometric forms – thin hairs.
software system with a large number of different fea- on each other’s performance, because the kernel
tures while maintaining the initial level of perfor- requires as many registers as the heaviest state [20, 21].
mance at least approximately. This task is largely non- 2) “Separate kernel” – an approach in which the
trivial even for CPU implementations, but on the code is organized (usually manually) in the form of
GPU it requires special approaches. Currently, there several kernels, communicating with each other
are three general approaches: explicitly through data buffers in memory [20]. This
1) “Uber-kernel” – an approach in which the code approach solves the main disadvantages of the uber-
is organized manually or automatically (usually the kernel, and thanks to the explicit division into kernels,
latter) in the form of a finite state machine inside one it allows maintaining the performance of critical sec-
computational kernel. The state machine is used to tions of code. However, it has an increased complexity
reduce register pressure, since each state in the top of development due to the need for explicit data trans-
switch operator gets all the registers available for the fer, which is especially noticeable in the presence of
program (kernel) at its disposal. The main disadvan- sorting or compaction of threads [13]. In addition,
tages of this approach are significant performance there are increased overhead costs for launching and
losses on branching (when different threads execute waiting for kernels to finish execution, as well as the
different states), and the influence of different states data transfer itself. Therefore, this approach can slow
Table 1. Millions of rays per second for 1024 × 1024 resolution and 1 sample per pixel, GTX1070 GPU (software imple-
mentation of RTX by Nvidia)
Scene Primary, MRays/sec. Secondary, MRays/sec. Tertiary, MRays/sec.
down the program in simple scenarios where the over- reveals that Microsoft is adding some of its own func-
head becomes comparable with the useful work of the tionality implemented in their HLSL compiler. Among
kernels. the additional features available in DirectX12, it is
3) “Wavefront path tracing” – a complex approach worth to mention the so-called “inline” ray tracing
which is based on grouping work and data for rays in feature that appeared in the new version (called DXR
separate queues [21]. Queues are executed in different Tier 1.1). This features allows to call ray tracing func-
kernels, and the result is stored in the memory also by tions in an arbitrary shader (pixel, compute, etc.)
calling specific kernels. Thanks to the grouping by without creating a special ray tracing pipeline [22].
conditional shaders, wavefront path tracing has lower In this case, the calling code performs all the work
losses on branching than previous approaches. How- necessary to use the results of ray tracing – calcula-
ever, the sorting and compaction of threads are strictly tions for ray intersections with one or the other kind of
required in this approach, therefore its overhead is primitive, ray misses, etc.
even higher than in the case of previous approach. Based on this we’ve chosen Vulkan as the main API
for our experiments. Ray tracing in Vulkan is used as a
separate type of pipeline along with the traditional
3. KNOWN DETAILS graphics and compute pipelines. To use this pipeline,
Currently, RTX technology is available through first it is necessary to build an acceleration structure in
hardware-software interfaces (Application Program- the form of a two-level tree. The lower level of the tree
ming Interfaces or APIs) such as DirectX12, Vulkan, (Bottom Level Acceleration Structure or BLAS) is built
and OptiX. The Vulkan API is of most interest to our for individual objects (RTX supports user-defined geo-
study, as it was designed specifically to provide devel- metric primitives) or meshes. The top level of the tree
opers with the most transparent access to the function- (Top Level Acceleration Structure or TLAS) is built for
ality of GPUs at a low level. This approach differs, for instances of these objects/meshes. Regarding the con-
example, from OptiX, in which Nvidia seeks to hide struction of accelerating structures in RTX the latest
implementation details to make life easier for the appli- information was presented at the SIGGRAPH confer-
cation developer. As for DirectX12, a careful analysis ence in 2019 [23].
Table 2. Millions of rays per second for 1024 × 1024 resolution and 1 sample per pixel, RTX2070 GPU (hardware acceler-
ated RTX implementation)
Scene Primary, MRays/sec. Secondary, MRays/sec. Tertiary, MRays/sec.
Sponza, RTX 970 534 490
Sponza, Hydra 480 122 130
Crysponza, RTX 788 386 337
Crysponza, Hydra 276 92 80
San Miguel, RTX 286 180 151
San Miguel, Hydra 127 48 42
Hairballs, RTX 282 238 289
Hairballs, Hydra 61 50 56
Fig. 3. Comparison of open-source path tracing implementation in HydraRenderer and RTX path tracer on GTX1070 (left) and
RTX2070 (right). For both images the left part (left to the dashed lines) shows performance for primary (coherent) rays, right part
(right to the dashed lines) shows performance for secondary divergent rays.
generation shader, we spawned random amount of umn in Fig. 1) has a significant lead over all software
rays (10 to 40) and measured the drop in performance. implementations and does not slow down for more
And next, we conducted a similar experiment with than 2 times on any test scene. Second, on the Hair-
ordinary computations, when some heavy computa- balls scene, where ray grouping cannot help in princi-
tion (for example, perlin noise evaluation) was also ple due to the high complexity of the geometry, the
performed randomly from 10 to 40 times (Fig. 4a). hardware implementation and open-source software
Experiment 3. In this experiment, our goal was to implementations (hydra2070 and hydra1070 columns)
check for the presence of internal queues in RTX that that do not perform ray grouping behave identically
transmit data between the various stages of the ray and do not significantly lose performance. At the same
tracing pipeline. To do this, we sequentially increased time, the Nvidia RTX software implementation
the ray payload and measured the percentage drop in (gtx1070) demonstrates unexpected behavior: on a
performance to understand at what point the data simple Sponza scene it is in the lead, but on the other
transfer becomes a bottleneck (Fig. 4b). more complex scenes it is significantly outperformed
by the open-source implementation. That is, it has a
significantly higher performance drop percentage in
5. RESULTS transition from primary to secondary rays.
This can be caused by one of two main reasons:
Result 1. Nvidia RTX is primarily aimed at acceler-
ating random access to memory in tracing a large 1) If the software implementation of RTX (on
number of diverging rays. This conclusion follows GTX1070) is made in the form of a monolithic kernel,
from Fig. 3, on the right. On a small scene (Sponza) then the performance advantage on a simple scene is a
for primary (coherent) rays the hardware implementa- result of reduced overhead, since there is no need to
tion of Nvidia RTX outperforms the open-source soft- transfer data between different kernels. At the same
ware implementation from [26] by no more than 2 time, losses on complex scenes is a direct consequence
times. However, for secondary rays this ratio reaches of the known shortcomings of the uber kernels [20, 21].
5–6 times. In addition, on a heavy scene (Hair Balls) 2) If the software implementation of RTX (on
RTX achieves the same 4–5 times advantage, and the GTX1070) has a form of “wavefront path tracing” then
fact that acceleration is preserved for scenes where without proper hardware support for work distribution
memory is a bottleneck confirms our assumption. this approach is apparently not effective enough.
Result 2. RTX implements some sort of a mecha- We consider the first scenario as more probable,
nism for ray grouping. This is confirmed by the analy- however, since RTX is a closed-source technology, the
sis of the ray tracing performance degradation, pre- second option cannot be completely ruled out.
sented in Fig. 1. One can notice the following: first, Result 3. RTX implements some sort of internal
the hardware implementation (rtx2070, the first col- mechanism for irregular work distribution. This
mechanism apparently works on a principle similar to mance Graphics on Graphics Hardware, ACM Euro-
“wavefront path tracing” [21]. This conclusion is con- graphics, Eurographics Assoc., 2002, pp. 137–146.
firmed by the following observation during experi- 2. Pfister, H., et al., The VolumePro real-time ray-casting
ment 2: when for each pixel we generated random system, in Computer Graphics and Interactive Tech-
amount of rays (from 10 to 40), we measured 2 times niques, New York: Association for Computing Machin-
performance drop compared to scenario of consis- ery, 1999, pp. 251–260.
tently generating 10 rays per pixel. On the other hand, 3. Schmittler, J., Wald, I., and Slusallek, P., SaarCOR: a
when we repeated the same experiment with Perlin hardware architecture for ray tracing, Proc. ACM Special
noise computation, the performance drop was exactly Interest Group on Computer Graphics Conf. on Graphics
Hardware, Eurographics Assoc., 2002, pp. 27–36.
4 times, as it should be on the GPU due to the fact that
all threads in the SIMD warp group must wait for the 4. Schmittler, J., et al., Realtime ray tracing of dynamic
scenes on an FPGA chip, Proc. ACM SIGGRAPH/EU-
slowest one to complete execution (Fig. 4a). ROGRAPHICS Conf. on Graphics Hardware, ACM,
Result 4. RTX implements data transfer between 2004, pp. 95–106.
different stages of the pipeline (i.e., different user pro- 5. Hall, D., The AR350: today’s ray trace rendering pro-
grams) through the queues on the chip. This is con- cessor, Proc. Eurographics/SIGGRAPH Workshop on
firmed by the nature of the performance drop with Graphics Hardware – Hot 3D Session 1, Los Angeles,
increasing ray payload (Fig. 4b). This is further con- 2001.
firmed by the introduction of “Mesh shaders” in RTX 6. Seiler, L., et al., Larrabee: a many-core x86 architec-
hardware. ture for visual computing, ACM Trans. Graph., 2008,
vol. 7, no. 3, art. 18.
Result 5. Nvidia RTX is an extremely complex
technology that is difficult to effectively implement in 7. TRaX: Spjut, J., et al., A multi-threaded architecture
for real-time ray tracing, Proc. Symp. on Application
software. This conclusion is confirmed by the low effi- Specific Processors, Institute of Electrical and Electron-
ciency of the RTX software implementation from the ics Engineers, 2008, pp. 108–114.
Nvidia itself on the GTX1070 graphics card, which
8. Kopta, D., et al., An energy and bandwidth efficient ray
loses 2–3 times in performance to a simple open- tracing architecture, in High-Performance Graphics,
source ray tracing software implementation (Table 1). ACM, 2013, pp. 121–128.
The low performance in this case is probably the result of 9. Woop, S., Schmittler, J., and Slusallek, P., RPU: a pro-
high flexibility of the technology and the desire to make grammable ray processing unit for realtime ray tracing,
it as general as possible. And without proper hardware ACM Trans. Graph. (TOG), ACM, 2005, vol. 24, no. 3,
support such complex implementation is slow. pp. 434–444.
10. Aila, T. and Karras, T., Architecture considerations for
tracing incoherent rays, in High-Performance Graphics,
6. CONCLUSIONS Eurographics Assoc., 2010, pp. 113–122.
Nvidia RTX technology is a fairly general mecha- 11. Nocak, J., Havran, V., and Dachsbacher, C., Path re-
nism combining various hardware functionalities, generation for interactive path tracing, Proc. EURO-
which can be used not only in ray tracing, but also in GRAPHICS, Norrköping, 2010, pp. 61–64.
other applications (see [24] as an example of such use). 12. Shkurko, K., et al., Dual streaming for hardware-accel-
The main mechanisms used by RTX include: (1) erated ray tracing, in High Performance Graphics, ACM,
arranging random memory access during the tracing 2017, p. 12.
of diverging rays and (2) a mechanism for GPU work 13. Frolov, V.A. and Galaktionov, V.A., Low overhead path
creation, which includes (3) data transfer between dif- regeneration, Progr. Comput. Software, 2016, vol. 42,
no. 6, pp. 382–387.
ferent kernels through a cache on a chip. For the user,
RTX greatly simplifies development and provides high 14. Keely, S., Reduced precision hardware for ray tracing,
flexibility. On the other hand, this technology signifi- Proc. High-Performance Graphics, Los Angeles, 2014,
pp. 29–40.
cantly limits portability, since RTX is implemented as
a separate type of pipeline in Vulkan, and there is 15. Nah, J.H., et al., RayCore: a ray-tracing hardware ar-
chitecture for mobile devices, ACM Trans. Graph.
practically no other way to use the code developed for (TOG), ACM, 2014, vol. 33, no. 5, p. 162.
RTX in any other way. This problem is partially solved
in DirectX12 (DXR Tier 1.1) by introduction of the 16. Lee, W.J., et al., SGRT: a mobile GPU architecture for
real-time ray tracing, Proc. High-Performance Graphics,
“inline” ray tracing, which allows the use of RTX in the ACM, 2013, pp. 109–119.
“traditional” graphics/compute pipelines. However, the
17. Whitted, T., An improved illumination model for shad-
use of DirectX12 itself reduces portability even more. ed display, ACM Spec. Interest Group Comput. Graph.
Interact. Techn., 1979, vol. 13, no. 2, p. 14.
REFERENCES 18. Deng, Y., et al., Toward real-time ray tracing: a survey
on hardware acceleration and microarchitecture tech-
1. Meißner, M., et al., VIZARD II: a reconfigurable in- niques, ACM Comput. Surv. (CSUR), 2017, vol. 50, no. 4,
teractive volume rendering system, Proc. High-Perfor- p. 58.
19. Imagination technologies. PowerVR Ray Tracing, 23. Viitanen Timo, Acceleration data structure hardware
2019. https://www.imgtec.com/graphics-proces- (and software), Proc. SIGGRAPH 2019, Los Angeles,
sors/architecture/powervr-ray-tracing/ 2019.
20. Frolov, V.A., Kharlamov, A.A., and Ignatenko, A.V.,
Biased solution of integral illumination equation via ir- 24. Wald, I., et al., RTX beyond ray tracing: exploring the
radiance caching and path tracing on GPUs, Progr. use of hardware ray tracing cores for tet-mesh point lo-
Comput. Software, 2011, vol. 37, no. 5, pp. 252–259. cation, Proc. High-Performance Graphics 2019, Stras-
21. Laine, S., Karras, T., and Aila, T., Megakernels consid- bourg, July 8–10, 2019.
ered harmful: wavefront path tracing on GPUs, Proc. 25. Nvidia. Ray tracing developer resources, 2019.
5th High-Performance Graphics Conf. (HPG’13), New https://developer.nvidia.com/rtx/raytracing
York: ACM, 2013, pp. 137–143.
22. Microsoft. DirectX, DXR ray tracing specification, 26. Ray Tracing Systems, Keldysh Institute of Applied
2019. Mathematics, Moscow State Uiversity. Hydra Render-
https://microsoft.github.io/DirectX-Specs/d3d/Ray- er. Open source rendering system, 2019.
tracing.html https://github.com/Ray-Tracing-Systems/HydraAPI