Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views38 pages

GFXHW

Uploaded by

Kerolaine Amorim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views38 pages

GFXHW

Uploaded by

Kerolaine Amorim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 38

Graphics Hardware

Tomas Akenine-Möller
Department of Computer Engineering
Chalmers University of Technology
Graphics hardware – why?
 About 100x faster!
 Another reason: about 100x faster!
 Simple to pipeline and parallelize

 There is currently only hardware for triangle


rasterization with texturing (e.g., OpenGL
acceleration)
 Ray tracing: there are research architetures,
and one commercial product
– More to come!

Tomas Akenine-Mőller © 2003


Today’s topics
 The basics of ”perspective correct
texturing”
 Background on graphics hardware
 The architecture of the XBOX
 The architecture of the KYRO

 Thereis very little documentation on


graphics architectures…

Tomas Akenine-Mőller © 2003


Perspective-correct texturing
 How is texture coordinates interpolated over a triangle?
 Linearly?

Linear interpolation Perspective-correct interpolation


 Perspective-correct interpolation gives foreshortening effect!
 Hardware does this for you, but you need to understand this anyway!

Tomas Akenine-Mőller © 2003


Recall the following, and then we
change notation a bit
 Before projection, v, and after p (p=Mv)
 After projection p is not 1!
w
 Homogenization: (p /p , p /p , p /p , 1)
x w y w z w
 Rewrite and change notation to:
– w=pw
– And instead use: (pxw , py w , pz w , w)
– After homogenization: (px , py, pz, 1)
 Also, remember that visible points, (px , py,
pz, 1), are inside a unit-cube:
(-1,-1,-1)  (1,1,1)
Tomas Akenine-Mőller © 2003
Texture coordinate interpolation
 Linear interpolation does not work
 Rational linear interpolation does:
– u(x)=(ax+b) / (cx+d)
– a,b,c,d are computed from triangle’s vertices (x,y,z,w,u,v)
 Not really efficient
 Smarter:
– Compute (u/w,v/w,1/w) per vertex
– These quantities can be linearly interpolated!
– Then at each pixel, compute 1/(1/w)=w
– And obtain: (w*u/w,w*v/w)=(u,v)
– The (u,v) are perspectively-correct interpolated
 Need to interpolate shading this way too
– Though, not as annoying as textures
 Since linear interpolation now is OK, compute, e.g.,
(u/w)/x, and use this to update u/w when stepping in
the x-direction (similarly for other parameters)
Tomas Akenine-Mőller © 2003
Background:
Graphics hardware architectures
 Evolution of graphics hardware has
started from the end of the pipeline
– Rasterizer was put into hardware first (most
performance to gain from this)
– Then the geometry stage
– Application will not be put into hardware!
 Two major ways of getting better
performance:
– Pipelining
– Parallellization
– Combinations of these are often used
Tomas Akenine-Mőller © 2003
Briefly about pipelining
 In GeForce3: 600-800 pipeline stages!
– 57 million transistors
– Pentium IV: 20 stages, 42 million transistors
 Newer cards:
– Radeon 9700: 110M transistors
– GeForce FX 5800: 125 M transistors, 500 MHz
 Ideally: n stages  n times throughput
– But latency increases!
– However, not a problem here
 Chip runs at about 200 MHz (5ns per clock)
 5ns*700=3.5 s
 We got about 20 ms per frame (50 frames per second)
 Graphics hardware is simpler to pipeline because:
– Pixels are (most often) independent of each other
– Few branches and much fixed functionality
– Don’t need high clock freq: bandwidth to memory is bottleneck
 This is changing with increased programmability
– Simpler to predict memory access pattern (do prefecthing!)
Tomas Akenine-Mőller © 2003
Parallellism
 ”Simple” idea: compute n results in parallel,
then combine results
 GeForce FX 5800: 8 pixels/clock, 16 textures/clock
– With a pipeline of several 100 stages, there are many
pixels being processed simultaneously
 Not always simple!
– Try to parallelize a sorting algorithm…
– But pixels are independent of each other, so simpler for
graphics hardware
 Can parallellize both geometry and rasterizer:

Tomas Akenine-Mőller © 2003


Taxanomy of hardware
 Need to sort from model space to screen
space
 Gives four major architectures:
– Sort-first
– Sort-middle
– Sort-Last Fragment
– Sort-Last Image

 Willdescribe these briefly, and then


focus on sort-middle and sort-last
fragment (used in commercial hardware)
Tomas Akenine-Mőller © 2003
Sort-First
 Sorts primitives before geometry
stage
– Screen in divided into large regions
– A separate pipeline is responsible for each
region (or many)
G is geometry, FG & FM is part of rasterizer
– A fragment is all the generated information for a pixel on a
triangle
– FG is Fragment Generation (finds which pixels are inside
triangle)
– FM is Fragment Merge (merges the created fragments with
various buffers (Z, color))
 Not explored much at all Tomas Akenine-Mőller © 2003
Sort-Middle
 Sorts betwen G and R
 Pretty natural, since after G, we know the
screen-space positions of the triangles
 Most hardware uses this!
– Examples include InfiniteReality (from SGI) and the
KYRO architecture (from Imagination)
 Spread work arbitrarily among G’s
 Then depending on screen-space position, sort to different R’s
– Screen can be split into ”tiles”. For example:
 Rectangular blocks (8x8 pixels)
 Every n scanlines
 The R is responsible for rendering inside tile
 A triangle can be sent to many FG’s depending on overlap
(over tiles)

Tomas Akenine-Mőller © 2003


Sort-Last Fragment
 Sorts
betwen FG and FM
 XBOX uses this!

 Again spread work among G’s


 The generated work is sent to FG’s
 Then sort fragments to FM’s
– An FM is responsible for a tile of pixels
A triangle is only sent to one FG, so this avoids
doing the same work twice
– Sort-Middle: If a triangle overlaps several tiles, then the
triangle is sent to all FG’s responsible for these tiles
– Results in extra work

Tomas Akenine-Mőller © 2003


Sort-Last Image
 Sorts after entire pipeline
 So each FG & FM has a separate
frame buffer for entire screen (Z and
color)
 After all primitives have been sent to
pipeline, the z-buffers and color buffers are
merged into one color buffer
 Can be seen as a set of independent pipelines
 Huge memory requirements!
 Used in research, but probably not
commerically
Tomas Akenine-Mőller © 2003
Memory bandwidth usage is huge!!
R is read, W is write, T is texture, Z is Z-buffer,
C is color buffer
 Assuming 2 textures per pixel, and TR costs 24
bytes (triline MIP-mapping), the rest costs 32
bits (4 bytes)
 A ”normal” pixel costs:
– ZR+ZW+CW+2*TR=60 bytes per pixel
 At60 fps, 1280x1024: 4.5 Gb/s
 But a pixel is overwritten many times!
 Overdraw=4 gives: 18 Gb/s !
 Then assume DDRAM at 300 MHz, 256 bits
per access: 9.6 Gb/s
 18>9.6 !!
Tomas Akenine-Mőller © 2003
Memory bandwidth, cont’d
 18>9.6
 On top of that bandwith usage is never
100%, and we can also use more
textures, anti-aliasing, to use up even
more bandwidth
 However, there are many techniques to
reduce bandwith usage:
– Texture caching with prefetching
– Texture compression
– Z-compression
– Z-occlusion testing (HyperZ)
Tomas Akenine-Mőller © 2003
Z-occlusion testing and Z-
compression
 One way of reducing bandwidth
– ATI Inc., pioneered with their HyperZ technology
 Very simple, and very effective
 Divide screen into tiles of 8x8 pixels
 Keep a status memory on-chip
– Very fast access
– Stores additional information that this algorithm
uses
 Enables occlusion culling on triangle
basis, z-compression, and fast Z-clears
Tomas Akenine-Mőller © 2003
Architecture of
Z-cull and Z-
compress

 Store zmax per tile, and a flag (whether cleared,


compressed/uncompressed)
 Rasterize one tile at a time
 Test if zmin on triangle is farther away than tile’s zmax
– If so, don’t do any work for that tile!!!
– Saves texturing and z-read for entire tile – huge savings!
 Otherwize read compressed Z-buffer, & unpack
 Write to unpacked Z-buffer, and when finished compress
and send back to memory, and also: update zmax
 For fast Z-clears: just set a flag to ”clear” for each tile
– Then we don’t need to read from Z-buffer, just send cleared Z for
that tile Tomas Akenine-Mőller © 2003
The Xbox game console
 Builtby Microsoft
and NVIDIA
 Is almost a PC:
– Pentium III, 733 MHz
– An extended GeForce3
 Why a console then?
– It stays constant…
– You don’t have to care
about 20 different
graphics cards, and
CPUs from 100 MHz to
2GHz

Tomas Akenine-Mőller © 2003


Xbox is a UMA machine
 UMA = unified memory architecture
– Every component in the system accesses the same
memory
We focus on the
GPU

Tomas Akenine-Mőller © 2003


Xbox Graphics Processing Unit
(GPU)
 Supports programmable
vertex shaders
– No fixed-function geometry
stage
 Is sort-last fragment
architecture
 Rasterizer: handles four
pixels per clock
 Runs at 250 MHz

Tomas Akenine-Mőller © 2003


Xbox geometry
stage
 Dual vertex shaders
– Same vertex program
is executed on two
vertices in parallell
 Vertex shader unit is a SIMD machine that operates
on 4 components at a time
– The point is that instead of a fixed function geometry
stage, we have now full control over animation of vertices
and lighting etc.
 Uses DMA (direct memory access), so that the
GPU fetches vertices directly from memory by itself!
 Three different caches – for better performance!

Tomas Akenine-Mőller © 2003


Xbox geometry stage:
caches
 Pre T&L (transform & lighting)
– Stores vertices fetched from mem
– The idea is the avoid redundant memory fetches
– A vertex is, on average, shared by 6 triangles
– Has 4 kbytes of storage
 Post T&L cache:
– Avoid running vertex shader more than once for same vertex
– So it has storage for 16 transformed vertices
 Primitive Assembly cache:
– A transformed vertex requires a lot of memory, and so it takes a
while to fetch a vertex from the Post T&L cache
– Can store 3 fully shaded vertices
– Is there to avoid fetches from Post T&L
 Task of PA cache is to feed rasterizer with triangles
Tomas Akenine-Mőller © 2003
Xbox rasterizer
 First block: triangle setup
(TS) and FG
 TS computes various
deltas (see slide 6) and
other startup info
 This block also does Z-occlusion testing
 FG generates fragments inside triangles
– Tests 2x2 pixels at a time, and forwards these to the four pipelines that follow
– Note: near edges, not all pixels are inside triangles, and therefore 0-3
pipelines may be idle
– There are many strategies on how to find which fragments are inside triangle,
but exactly how this is done on the XBOX is not known

Tomas Akenine-Mőller © 2003


Xbox rasterizer
 Sorting is done after FG
– Sort-last fragment arch.
 First: 2 texture units
– Can be run twice  4
texture lookups
 RC (register combiners) operate on the filtered texel
values from TC and from interpolated shading over
triangle (programmable too)
– Can be used for bump mapping, for example
 Finally,result from TXs, RCs, shading interpolation, fog
interpolation is merged into a final color for that pixel

Tomas Akenine-Mőller © 2003


Xbox rasterizer:
Fragment merge
 The combiner produced a final color for the
pixel on a triangle
 FG merges this with:
– Color in color buffer (alpha blending)
– Respect to Z-buffer
– Stencil testing
– Alpha testing
 Z-compression and decompression is handled
here as well
 Writes final color over the system memory bus

Tomas Akenine-Mőller © 2003


Xbox texture swizzling
A technique to improve usage of locality
in textures
– Not likely that we will access texels in a linear
fashion (i.e., one scanline at a time)
– Use swizzling instead
 Assume (u,v)=(un-1…u1u0, vn-1…v1v0)
– ui and vi are bits i of u and v
 Linear (normal): (width*v+u)*bytes_per_color
 Instead: (u v …u v u v )* bytes_per_color
n-1 n-1 1 1 0 0

Tomas Akenine-Mőller © 2003


Xbox texture
swizzling
 This access
technique gives
the following
pattern (4
bytes/color)
 This is a space-
filling curve, and
those are often
designed so that
coherency usage Almost a Hilbert curve
is improved
 Example: bilinear
filtering
Tomas Akenine-Mőller © 2003
Xbox conclusion
 (Almost) a PC with great graphics
hardware
 Sort-last fragment architecture
 2 vertex shaders
 4 pixel pipelines @ 250 MHz
 Programmable per pixel as well
 One of the best consoles right now…
– Not for long though

Tomas Akenine-Mőller © 2003


KYRO – a different architecture
 Based on cost-effective PowerVR architecture
 Tile-based
– For KYRO II: 32x16 pixels
 Fundamental difference
– For entire scene, do this:
– Find all triangles inside each tile
– Render all triangle inside tile
 Advantage: can implement temporary color,
stencil, and Z-buffer in fast on-chip memory
 Saves memory and memory bandwidth!
– Claims to save 2/3 of bandwidth compared to traditional
architecture (without Z-occlusion testing)
Tomas Akenine-Mőller © 2003
KYRO architecture overview

 CPU sends triangle data to KYRO II


 Tile Accelerator (TA)
– Need an entire scene before ISP and TSP blocks can start
– So TA works on the next image, while ISP and TSP works
on the current image (i.e., they work in a pipelined fashion)
– TA sorts triangles, and creates a list of triangle pointers for
each tile (for tris inside tile)

Tomas Akenine-Mőller © 2003


KYRO
 Tile accelerator:
– When all triangle for entire scene are sorted into tiles, the
TA can send tile data to next block ISP
– And the TA then continues on the next frame’s sorting in
parallel
 Image synthesis processor (ISP):
– Implements Z-buffer, color buffer, stencil buffer for tile
– And occlusion culling (similar to Z-occlusion testing)
 Test 32 pixels at a time against Z-buffer
 Records which pixels are visible

– Groups pixels with same texture and sends to TSP


 These are guaranteed to be visible, so we only texture each pixel
once
Tomas Akenine-Mőller © 2003
KYRO: TSP
 Texture and Shading Processor (TSP):
– Handles texturing and shading interpolation
 Has two pipelines that run in parallell
– 2 pixels per clock
 Can use 8 textures at most
– Is implemented by ”looping” in TSP
– I.e., not full speed
 Texturedata is fetched from local memory
 Supersampling: 2x1, 1x2, and 2x2
– Renders a larger image and filters and scales down
– For 2x2: Need only 4x the size of tile (or rather, render 4x
as many tiles, i.e., need not 4x memory)Tomas Akenine-Mőller © 2003
KYRO: pros and cons
 Uses a small amount of very fast memory
– Reduces bandwidth greatly
– Reduces frame buffer memory greatly
 But more local memory is needed
– For tile sorting
– Amount of local memory places a limit on how many
triangles can be rendered
– 3 MB can handle a little over 30,000 triangles
 Design is parallel
– Add more pipelines that can handle the rest of the
architecture that follows the Tile Accelerator
– But bottleneck will (likely) move, and so not sure how
much can be gained
Tomas Akenine-Mőller © 2003
Challenges for the future
 Continueto push the frontier of ”normal”
graphics hardware
– How long can the ”2x performance per 6
months” keep up?
– Keep adding new features…
– Next generation is expected to be massively
programmable, both at vertices and at pixels
– Another goal is to make rendering more realistic
 Dothis by developing new algorithms for the
programmable hardware

Tomas Akenine-Mőller © 2003


Challenges for the future
 Design a new architecture targeted for global
illumination
 Very few have focused on ”ray tracing”- based
algorithms so far
 It is time now…
 Would be nice with:
– Rapid intersection testing of curved surfaces in hardware
– Rapid traversal of spatial data structure
– Handling of very large scenes
 Standard graphics hardware can handle quite good because a
triangle can be discared once it has been rendered
 Ray tracing-based algorithms cannot do this, because it renders
shadows and reflections and therefore need to know of geometry
nearby
– Photon mapping…
Tomas Akenine-Mőller © 2003
Challenges for the future
 Design really small architectures with
really scarce resources
– Little chip area
– Little memory
– Little bandwidth
 Sothat it can be used in mobile devices,
e.g., PalmPilot’s, phones, etc.

Tomas Akenine-Mőller © 2003


Graphics hardware conclusion
 Possible to build great hardware for standard
triangle rendering
– Reasons: pixel independency, parallellism, pipelining,
etc.
 Ray tracing-based hardware will come
– It has been shown that commodity graphics hardware
can be used for ray tracing
– See paper by Tim Purcell et al., SIGGRAPH 2002
 Not sure what will happen in the future, but it
will happen pretty fast
– ”it will be utterly fantastic”

Tomas Akenine-Mőller © 2003

You might also like