CSC 391/691: GPU Programming Fall 2011
Introduction to GPUs for HPC
Copyright © 2011 Samuel S. Cho
High Performance Computing
• Speed: Many problems that are
interesting to scientists and engineers
would take a long time to execute on
a PC or laptop: months, years, “never”.
• Size: Many problems that are
interesting to scientists and engineers
can’t fit on a PC or laptop with a few
GB of RAM or a few 100s of GB of
disk space.
• Supercomputers or clusters of
computers can make these problems
practically numerically solvable.
Scientific and Engineering Problems
• Simulations of physical phenomena such as:
• Weather forecasting
• Earthquake forecasting
• Galaxy formation
• Oil reservoir management
• Molecular dynamics
• Data Mining: Finding needles of critical
information in a haystack of data such as:
• Bioinformatics
• Signal processing
• Detecting storms that might turn into
hurricanes
• Visualization: turning a vast sea of data into
pictures that scientists can understand.
• At its most basic level, all of these problems
involve many, many floating point
operations.
Hardware Accelerators
• In HPC, an accelerator is hardware
component whose role is to speed up
some aspect of the computing
workload.
• In the olden days (1980s), Not an accelerator*
supercomputers sometimes had array
processors, which did vector
operations on arrays
• PCs sometimes had floating point
accelerators: little chips that did the
floating point calculations in hardware
rather than software.
Not an accelerator*
*Okay, I lied.
To Accelerate Or Not To Accelerate
• Pro:
• They make your code
run faster.
• Cons:
• They’re expensive.
• They’re hard to program.
• Your code may not be
cross-platform.
Why GPU for HPC?
• Graphics Processing Units (GPUs) were
originally designed to accelerate graphics tasks
like image rendering.
• They became very very popular with video
gamers, because they’ve produced better and
better images, and lightning fast.
• And, prices have been extremely good, ranging
from three figures at the low end to four figures
at the high end.
• Chips are expensive to design (hundreds of
millions of $$$), expensive to build the factory
for (billions of $$$), but cheap to produce.
• For example, in 2006 – 2007, GPUs sold at a
rate of about 80 million cards per year,
generating about $20 billion per year in revenue.
• This means that the GPU companies have been
able to recoup the huge fixed costs.
• Remember: GPUs mostly do stuff like rendering
images. This is done through mostly floating
point arithmetic – the same stuff people use
supercomputing for!
What are GPUs?
• GPUs have developed from graphics cards into a
platform for high performance computing (HPC)
-- perhaps the most important development in Early design
HPC for many years.
• Co-processors -- very old idea that appeared in Memory CPU
1970s and 1980s with floating point co-
processors attached to microprocessors that
did not then have floating point capability.
• These coprocessors simply executed floating Graphics
point instructions that were fetched from card
memory.
• Around same time, interest to provide hardware
support for displays, especially with increasing
use of graphics and PC games. Display
• Led to graphics processing units (GPUs)
attached to CPU to create video display.
Modern GPU Design
• By late 1990’s, graphics chips needed Input stage
to support 3-D graphics, especially
for games and graphics APIs such as
DirectX and OpenGL. Vertex shader
stage
• Graphics chips generally had a Graphics
pipeline structure with individual memory
Geometry
stages performing specialized shader stage
operations, finally leading to loading
frame buffer for display.
Rasterizer stage
• Individual stages may have access to Frame
buffer
graphics memory for storing Pixel shading
intermediate computed data. stage
General Purpose GPU (GPGPU) Designs
• High performance pipelines call for high-speed (IEEE) floating point operations.
• Known as GPGPU (General-purpose computing on graphics processing units) -- Difficult
to do with specialized graphics pipelines, but possible.)
• By mid 2000’s, recognized that individual stages of graphics pipeline could be implemented
by a more general purpose processor core (although with a data-parallel paradigm)
• 2006 -- First GPU for general high performance computing as well as graphics processing,
NVIDIA GT 80 chip/GeForce 8800 card.
• Unified processors that could perform vertex, geometry, pixel, and general
computing operations
• Could now write programs in C rather than graphics APIs.
• Single-instruction multiple thread (SIMT) programming model
NVIDIA Tesla Platform
• NVIDIA Tesla series was their first platform for the high performance
computing market.
• Named for Nikola Tesla, a pioneering mechanical and electrical engineer and
inventor.
GTX 480 C2070
NVIDIA GTX 480 Specs
• 3 billion transistors
• 480 compute cores
• 1.401 GHz
• Single precision floating point performance:
1.35 TFLOPs (2 single precision flops per clock
per core)
• Double precision floating point performance:
168 GFLOPs (1 double precision flop per clock
per core)
• Internal RAM: 1.5 GB DDR5 VRAM
• Internal RAM speed: 177.4 GB/sec (compared
21-25 GB/sec for regular RAM)
• PCIe slot (at most 8 GB/sec per GPU card)
• 250 W thermal power
Coming: Kepler and Maxwell
• NVIDIA’s 20-series is also known by the codename “Fermi.” It runs at about
0.5 TFLOPs per GPU card (peak).
• The next generation, to be released in 2011, is codenamed “Kepler” and will
be capable of something like 1.4 TFLOPs double precision per GPU card.
• After “Kepler” will come “Maxwell” in 2013, capable of something like 4
TFLOPs double precision per GPU card.
• So, the increase in performance is likely to be roughly 2.5x – 3x per
generation, roughly every two years.
Maryland CPU/GPU Cluster Infrastructure
http://www.umiacs.umd.edu/
research/GPU/facilities.html
Intel’s Response to NVIDIA GPUs
Does it work?
Example Applications URL Speedup
Seismic Database http://www.headwave.com 66x – 100x
Mobile Phone Antenna Simulation http://www.accelware.com 45x
Molecular Dynamics http://www.ks.uiuc.edu/Research/vmd 21x – 100x
Neuron Simulation http://www.evolvedmachines.com 100x
MRI Processing http://bic-test.beckman.uiuc.edu 245x – 415x
Atmospheric Cloud Simulation http://www.cs.clemson.edu/~jesteel/clouds.html 50x
• Looks like remarkable speedup compared
to traditional CPU-based HPC approaches