Graphics Hardware
Tomas Akenine-Möller
Department of Computer Engineering
Chalmers University of Technology
Graphics hardware – why?
About 100x faster!
Another reason: about 100x faster!
Simple to pipeline and parallelize
There is currently only hardware for triangle
rasterization with texturing (e.g., OpenGL
acceleration)
Ray tracing: there are research architetures,
and one commercial product
– More to come!
Tomas Akenine-Mőller © 2003
Today’s topics
The basics of ”perspective correct
texturing”
Background on graphics hardware
The architecture of the XBOX
The architecture of the KYRO
Thereis very little documentation on
graphics architectures…
Tomas Akenine-Mőller © 2003
Perspective-correct texturing
How is texture coordinates interpolated over a triangle?
Linearly?
Linear interpolation Perspective-correct interpolation
Perspective-correct interpolation gives foreshortening effect!
Hardware does this for you, but you need to understand this anyway!
Tomas Akenine-Mőller © 2003
Recall the following, and then we
change notation a bit
Before projection, v, and after p (p=Mv)
After projection p is not 1!
w
Homogenization: (p /p , p /p , p /p , 1)
x w y w z w
Rewrite and change notation to:
– w=pw
– And instead use: (pxw , py w , pz w , w)
– After homogenization: (px , py, pz, 1)
Also, remember that visible points, (px , py,
pz, 1), are inside a unit-cube:
(-1,-1,-1) (1,1,1)
Tomas Akenine-Mőller © 2003
Texture coordinate interpolation
Linear interpolation does not work
Rational linear interpolation does:
– u(x)=(ax+b) / (cx+d)
– a,b,c,d are computed from triangle’s vertices (x,y,z,w,u,v)
Not really efficient
Smarter:
– Compute (u/w,v/w,1/w) per vertex
– These quantities can be linearly interpolated!
– Then at each pixel, compute 1/(1/w)=w
– And obtain: (w*u/w,w*v/w)=(u,v)
– The (u,v) are perspectively-correct interpolated
Need to interpolate shading this way too
– Though, not as annoying as textures
Since linear interpolation now is OK, compute, e.g.,
(u/w)/x, and use this to update u/w when stepping in
the x-direction (similarly for other parameters)
Tomas Akenine-Mőller © 2003
Background:
Graphics hardware architectures
Evolution of graphics hardware has
started from the end of the pipeline
– Rasterizer was put into hardware first (most
performance to gain from this)
– Then the geometry stage
– Application will not be put into hardware!
Two major ways of getting better
performance:
– Pipelining
– Parallellization
– Combinations of these are often used
Tomas Akenine-Mőller © 2003
Briefly about pipelining
In GeForce3: 600-800 pipeline stages!
– 57 million transistors
– Pentium IV: 20 stages, 42 million transistors
Newer cards:
– Radeon 9700: 110M transistors
– GeForce FX 5800: 125 M transistors, 500 MHz
Ideally: n stages n times throughput
– But latency increases!
– However, not a problem here
Chip runs at about 200 MHz (5ns per clock)
5ns*700=3.5 s
We got about 20 ms per frame (50 frames per second)
Graphics hardware is simpler to pipeline because:
– Pixels are (most often) independent of each other
– Few branches and much fixed functionality
– Don’t need high clock freq: bandwidth to memory is bottleneck
This is changing with increased programmability
– Simpler to predict memory access pattern (do prefecthing!)
Tomas Akenine-Mőller © 2003
Parallellism
”Simple” idea: compute n results in parallel,
then combine results
GeForce FX 5800: 8 pixels/clock, 16 textures/clock
– With a pipeline of several 100 stages, there are many
pixels being processed simultaneously
Not always simple!
– Try to parallelize a sorting algorithm…
– But pixels are independent of each other, so simpler for
graphics hardware
Can parallellize both geometry and rasterizer:
Tomas Akenine-Mőller © 2003
Taxanomy of hardware
Need to sort from model space to screen
space
Gives four major architectures:
– Sort-first
– Sort-middle
– Sort-Last Fragment
– Sort-Last Image
Willdescribe these briefly, and then
focus on sort-middle and sort-last
fragment (used in commercial hardware)
Tomas Akenine-Mőller © 2003
Sort-First
Sorts primitives before geometry
stage
– Screen in divided into large regions
– A separate pipeline is responsible for each
region (or many)
G is geometry, FG & FM is part of rasterizer
– A fragment is all the generated information for a pixel on a
triangle
– FG is Fragment Generation (finds which pixels are inside
triangle)
– FM is Fragment Merge (merges the created fragments with
various buffers (Z, color))
Not explored much at all Tomas Akenine-Mőller © 2003
Sort-Middle
Sorts betwen G and R
Pretty natural, since after G, we know the
screen-space positions of the triangles
Most hardware uses this!
– Examples include InfiniteReality (from SGI) and the
KYRO architecture (from Imagination)
Spread work arbitrarily among G’s
Then depending on screen-space position, sort to different R’s
– Screen can be split into ”tiles”. For example:
Rectangular blocks (8x8 pixels)
Every n scanlines
The R is responsible for rendering inside tile
A triangle can be sent to many FG’s depending on overlap
(over tiles)
Tomas Akenine-Mőller © 2003
Sort-Last Fragment
Sorts
betwen FG and FM
XBOX uses this!
Again spread work among G’s
The generated work is sent to FG’s
Then sort fragments to FM’s
– An FM is responsible for a tile of pixels
A triangle is only sent to one FG, so this avoids
doing the same work twice
– Sort-Middle: If a triangle overlaps several tiles, then the
triangle is sent to all FG’s responsible for these tiles
– Results in extra work
Tomas Akenine-Mőller © 2003
Sort-Last Image
Sorts after entire pipeline
So each FG & FM has a separate
frame buffer for entire screen (Z and
color)
After all primitives have been sent to
pipeline, the z-buffers and color buffers are
merged into one color buffer
Can be seen as a set of independent pipelines
Huge memory requirements!
Used in research, but probably not
commerically
Tomas Akenine-Mőller © 2003
Memory bandwidth usage is huge!!
R is read, W is write, T is texture, Z is Z-buffer,
C is color buffer
Assuming 2 textures per pixel, and TR costs 24
bytes (triline MIP-mapping), the rest costs 32
bits (4 bytes)
A ”normal” pixel costs:
– ZR+ZW+CW+2*TR=60 bytes per pixel
At60 fps, 1280x1024: 4.5 Gb/s
But a pixel is overwritten many times!
Overdraw=4 gives: 18 Gb/s !
Then assume DDRAM at 300 MHz, 256 bits
per access: 9.6 Gb/s
18>9.6 !!
Tomas Akenine-Mőller © 2003
Memory bandwidth, cont’d
18>9.6
On top of that bandwith usage is never
100%, and we can also use more
textures, anti-aliasing, to use up even
more bandwidth
However, there are many techniques to
reduce bandwith usage:
– Texture caching with prefetching
– Texture compression
– Z-compression
– Z-occlusion testing (HyperZ)
Tomas Akenine-Mőller © 2003
Z-occlusion testing and Z-
compression
One way of reducing bandwidth
– ATI Inc., pioneered with their HyperZ technology
Very simple, and very effective
Divide screen into tiles of 8x8 pixels
Keep a status memory on-chip
– Very fast access
– Stores additional information that this algorithm
uses
Enables occlusion culling on triangle
basis, z-compression, and fast Z-clears
Tomas Akenine-Mőller © 2003
Architecture of
Z-cull and Z-
compress
Store zmax per tile, and a flag (whether cleared,
compressed/uncompressed)
Rasterize one tile at a time
Test if zmin on triangle is farther away than tile’s zmax
– If so, don’t do any work for that tile!!!
– Saves texturing and z-read for entire tile – huge savings!
Otherwize read compressed Z-buffer, & unpack
Write to unpacked Z-buffer, and when finished compress
and send back to memory, and also: update zmax
For fast Z-clears: just set a flag to ”clear” for each tile
– Then we don’t need to read from Z-buffer, just send cleared Z for
that tile Tomas Akenine-Mőller © 2003
The Xbox game console
Builtby Microsoft
and NVIDIA
Is almost a PC:
– Pentium III, 733 MHz
– An extended GeForce3
Why a console then?
– It stays constant…
– You don’t have to care
about 20 different
graphics cards, and
CPUs from 100 MHz to
2GHz
Tomas Akenine-Mőller © 2003
Xbox is a UMA machine
UMA = unified memory architecture
– Every component in the system accesses the same
memory
We focus on the
GPU
Tomas Akenine-Mőller © 2003
Xbox Graphics Processing Unit
(GPU)
Supports programmable
vertex shaders
– No fixed-function geometry
stage
Is sort-last fragment
architecture
Rasterizer: handles four
pixels per clock
Runs at 250 MHz
Tomas Akenine-Mőller © 2003
Xbox geometry
stage
Dual vertex shaders
– Same vertex program
is executed on two
vertices in parallell
Vertex shader unit is a SIMD machine that operates
on 4 components at a time
– The point is that instead of a fixed function geometry
stage, we have now full control over animation of vertices
and lighting etc.
Uses DMA (direct memory access), so that the
GPU fetches vertices directly from memory by itself!
Three different caches – for better performance!
Tomas Akenine-Mőller © 2003
Xbox geometry stage:
caches
Pre T&L (transform & lighting)
– Stores vertices fetched from mem
– The idea is the avoid redundant memory fetches
– A vertex is, on average, shared by 6 triangles
– Has 4 kbytes of storage
Post T&L cache:
– Avoid running vertex shader more than once for same vertex
– So it has storage for 16 transformed vertices
Primitive Assembly cache:
– A transformed vertex requires a lot of memory, and so it takes a
while to fetch a vertex from the Post T&L cache
– Can store 3 fully shaded vertices
– Is there to avoid fetches from Post T&L
Task of PA cache is to feed rasterizer with triangles
Tomas Akenine-Mőller © 2003
Xbox rasterizer
First block: triangle setup
(TS) and FG
TS computes various
deltas (see slide 6) and
other startup info
This block also does Z-occlusion testing
FG generates fragments inside triangles
– Tests 2x2 pixels at a time, and forwards these to the four pipelines that follow
– Note: near edges, not all pixels are inside triangles, and therefore 0-3
pipelines may be idle
– There are many strategies on how to find which fragments are inside triangle,
but exactly how this is done on the XBOX is not known
Tomas Akenine-Mőller © 2003
Xbox rasterizer
Sorting is done after FG
– Sort-last fragment arch.
First: 2 texture units
– Can be run twice 4
texture lookups
RC (register combiners) operate on the filtered texel
values from TC and from interpolated shading over
triangle (programmable too)
– Can be used for bump mapping, for example
Finally,result from TXs, RCs, shading interpolation, fog
interpolation is merged into a final color for that pixel
Tomas Akenine-Mőller © 2003
Xbox rasterizer:
Fragment merge
The combiner produced a final color for the
pixel on a triangle
FG merges this with:
– Color in color buffer (alpha blending)
– Respect to Z-buffer
– Stencil testing
– Alpha testing
Z-compression and decompression is handled
here as well
Writes final color over the system memory bus
Tomas Akenine-Mőller © 2003
Xbox texture swizzling
A technique to improve usage of locality
in textures
– Not likely that we will access texels in a linear
fashion (i.e., one scanline at a time)
– Use swizzling instead
Assume (u,v)=(un-1…u1u0, vn-1…v1v0)
– ui and vi are bits i of u and v
Linear (normal): (width*v+u)*bytes_per_color
Instead: (u v …u v u v )* bytes_per_color
n-1 n-1 1 1 0 0
Tomas Akenine-Mőller © 2003
Xbox texture
swizzling
This access
technique gives
the following
pattern (4
bytes/color)
This is a space-
filling curve, and
those are often
designed so that
coherency usage Almost a Hilbert curve
is improved
Example: bilinear
filtering
Tomas Akenine-Mőller © 2003
Xbox conclusion
(Almost) a PC with great graphics
hardware
Sort-last fragment architecture
2 vertex shaders
4 pixel pipelines @ 250 MHz
Programmable per pixel as well
One of the best consoles right now…
– Not for long though
Tomas Akenine-Mőller © 2003
KYRO – a different architecture
Based on cost-effective PowerVR architecture
Tile-based
– For KYRO II: 32x16 pixels
Fundamental difference
– For entire scene, do this:
– Find all triangles inside each tile
– Render all triangle inside tile
Advantage: can implement temporary color,
stencil, and Z-buffer in fast on-chip memory
Saves memory and memory bandwidth!
– Claims to save 2/3 of bandwidth compared to traditional
architecture (without Z-occlusion testing)
Tomas Akenine-Mőller © 2003
KYRO architecture overview
CPU sends triangle data to KYRO II
Tile Accelerator (TA)
– Need an entire scene before ISP and TSP blocks can start
– So TA works on the next image, while ISP and TSP works
on the current image (i.e., they work in a pipelined fashion)
– TA sorts triangles, and creates a list of triangle pointers for
each tile (for tris inside tile)
Tomas Akenine-Mőller © 2003
KYRO
Tile accelerator:
– When all triangle for entire scene are sorted into tiles, the
TA can send tile data to next block ISP
– And the TA then continues on the next frame’s sorting in
parallel
Image synthesis processor (ISP):
– Implements Z-buffer, color buffer, stencil buffer for tile
– And occlusion culling (similar to Z-occlusion testing)
Test 32 pixels at a time against Z-buffer
Records which pixels are visible
– Groups pixels with same texture and sends to TSP
These are guaranteed to be visible, so we only texture each pixel
once
Tomas Akenine-Mőller © 2003
KYRO: TSP
Texture and Shading Processor (TSP):
– Handles texturing and shading interpolation
Has two pipelines that run in parallell
– 2 pixels per clock
Can use 8 textures at most
– Is implemented by ”looping” in TSP
– I.e., not full speed
Texturedata is fetched from local memory
Supersampling: 2x1, 1x2, and 2x2
– Renders a larger image and filters and scales down
– For 2x2: Need only 4x the size of tile (or rather, render 4x
as many tiles, i.e., need not 4x memory)Tomas Akenine-Mőller © 2003
KYRO: pros and cons
Uses a small amount of very fast memory
– Reduces bandwidth greatly
– Reduces frame buffer memory greatly
But more local memory is needed
– For tile sorting
– Amount of local memory places a limit on how many
triangles can be rendered
– 3 MB can handle a little over 30,000 triangles
Design is parallel
– Add more pipelines that can handle the rest of the
architecture that follows the Tile Accelerator
– But bottleneck will (likely) move, and so not sure how
much can be gained
Tomas Akenine-Mőller © 2003
Challenges for the future
Continueto push the frontier of ”normal”
graphics hardware
– How long can the ”2x performance per 6
months” keep up?
– Keep adding new features…
– Next generation is expected to be massively
programmable, both at vertices and at pixels
– Another goal is to make rendering more realistic
Dothis by developing new algorithms for the
programmable hardware
Tomas Akenine-Mőller © 2003
Challenges for the future
Design a new architecture targeted for global
illumination
Very few have focused on ”ray tracing”- based
algorithms so far
It is time now…
Would be nice with:
– Rapid intersection testing of curved surfaces in hardware
– Rapid traversal of spatial data structure
– Handling of very large scenes
Standard graphics hardware can handle quite good because a
triangle can be discared once it has been rendered
Ray tracing-based algorithms cannot do this, because it renders
shadows and reflections and therefore need to know of geometry
nearby
– Photon mapping…
Tomas Akenine-Mőller © 2003
Challenges for the future
Design really small architectures with
really scarce resources
– Little chip area
– Little memory
– Little bandwidth
Sothat it can be used in mobile devices,
e.g., PalmPilot’s, phones, etc.
Tomas Akenine-Mőller © 2003
Graphics hardware conclusion
Possible to build great hardware for standard
triangle rendering
– Reasons: pixel independency, parallellism, pipelining,
etc.
Ray tracing-based hardware will come
– It has been shown that commodity graphics hardware
can be used for ray tracing
– See paper by Tim Purcell et al., SIGGRAPH 2002
Not sure what will happen in the future, but it
will happen pretty fast
– ”it will be utterly fantastic”
Tomas Akenine-Mőller © 2003