GPU Fundamentals
Who Am I?
2002 – B.S. Computer Science – Furman University
2005 – M.S. Computer Science – UT Knoxville
2002 – Graduate Teaching Assistant
2005 – Graduate Research Assistant (ICL)
2005 – 2013 – Cray, Inc
Worked on porting & optimizing HPC apps @ ORNL, User Training
2013 – Present – NVIDIA Corp.
Porting & optimizing HPC apps @ ORNL , User Training,
Representative to OpenACC & OpenMP
2
GPU Architecture
Speed v. Throughput
AGENDA Latency Hiding
Memory Coalescing
SIMD v. SIMT
3
GPU Architecture
Two Main Components
Global memory
Analogous to RAM in a CPU server
Accessible by both GPU and CPU
Currently up to 16 GB in Tesla products
Streaming Multiprocessors (SM)
Perform the actual computation
Each SM has its own: Control units, registers, execution pipelines, caches
7
GPU Architecture
Streaming Multiprocessor (SM)
Many CUDA Cores per SM
Architecture dependent
Special-function units
cos/sin/tan, etc.
Shared memory + L1 cache
Thousands of 32-bit registers
8
GPU Architecture
CUDA Core
Floating point & Integer unit
IEEE 754-2008 floating-point CUDA Core
standard Dispatch Port
Operand Collector
Fused multiply-add (FMA)
instruction for both single and
FP Unit INT Unit
double precision
Logic unit Result Queue
Move, compare unit
Branch unit
9
Execution Model
Software Hardware
Threads are executed by scalar processors
Scalar
Thread Processor
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on one
Thread multiprocessor - limited by multiprocessor
Block Multiprocessor
resources (shared memory and register file)
... A kernel is launched as a grid of thread blocks
Grid Device 10
Warps
A thread block consists of
32 Threads 32-thread warps
... = 32 Threads
32 Threads A warp is executed
Thread
Block Warps Multiprocessor physically in parallel
(SIMT) on a multiprocessor
11
GPU Memory Hierarchy Review
SM-0 SM-1 SM-N
Registers Registers Registers
L1 SMEM L1 SMEM L1 SMEM
L2
Global Memory
12
GPU Architecture
Memory System on each SM
Extremely fast, but small, i.e., 10s of Kb
Programmer chooses whether to use cache as L1 or Shared Mem
L1
Hardware-managed—used for things like register spilling
Should NOT attempt to utilize like CPU caches
Shared Memory—programmer MUST synchronize data accesses!!!
User-managed scratch pad
Repeated access to same data or multiple threads with same data
13
GPU Architecture
Memory system on each GPU board
Unified L2 cache (100s of Kb)
Fast, coherent data sharing across all cores in the GPU
Unified/Managed Memory
Since CUDA6 it’s possible to allocate 1 pointer (virtual address) whose physical
location will be managed by the runtime.
Pre-Pascal GPUS – managed by software, limited to GPU memory size
Pascal & Beyond – Hardware can page fault to manage location, can oversubscribe
GPU memory.
14
Speed v. Throughput
Speed Throughput
Which is better depends on your needs…
*Images from Wikimedia Commons via Creative Commons 15
Low Latency or High Throughput?
CPU GPU
Optimized for low-latency access to Optimized for data-parallel, throughput
cached data sets computation
Control logic for out-of-order and Tolerant of memory latency
speculative execution More transistors dedicated to computation
10’s of threads 10,000’s of threads
16
Low Latency or High Throughput?
CPU architecture must minimize latency within each thread
GPU architecture hides latency with computation from other thread warps
CPU core – Low Latency Processor Computation Thread/Warp
T1 T2 T3 T4 Tn Processing
GPU Stream Multiprocessor – High Throughput Processor
Waiting for data
W4
W3 Ready to be processed
W2
W1 Context switch
17
Memory Coalescing
Global memory access happens in
transactions of 32 or 128 bytes
The hardware will try to reduce to 0 1 31
as few transactions as possible
Coalesced access:
A group of 32 contiguous threads
(“warp”) accessing adjacent words
0 1 31
Few transactions and high utilization
Uncoalesced access:
A warp of 32 threads accessing
scattered words
Many transactions and low utilization
18
SIMD and SIMT
LD.128b Single Instruction Multiple
Thread (SIMT)
LD. LD. LD. LD. • Scalar instructions execute
LD.128b 1 1 1 1 simultaneously by multiple
LD. LD. LD. LD. hardware threads
AD.128b +
1 1 1 1 • Contiguous data not required.
ST.128b + + + +
Single Instruction Multiple Data (SIMD) AD. AD. AD. AD.
• Vector instructions perform the same operation on 1 1 1 1
multiple data elements.
• Data must be loaded and stored in contiguous LD. LD. LD. LD.
buffers 1 1 1 1
19
SIMD and SIMT
LD.128b Single Instruction Multiple
Thread (SIMT)
LD. LD. LD. LD. • Scalar instructions execute
LD.128b 1 1 1 1 simultaneously by multiple
LD. LD. LD. LD. hardware threads
AD.128b +
1 1 1 1 • Contiguous data not required.
+ • So if something can run in
ST.128b + + +
SIMD, it can run in SIMT, but
not necessarily the reverse.
Single Instruction Multiple Data (SIMD) AD. AD. AD. AD. • SIMT can better handle
• Vector instructions perform the same operation on 1 1 1 1 indirection
multiple data elements.
• Data must be loaded and stored in contiguous LD. LD. LD. LD.
buffers 1 1 1 1
20
SIMD and SIMT
LD.128b Single Instruction Multiple
Thread (SIMT)
LD. LD. LD. LD. • Scalar instructions execute
LD.128b 1 1 1 1 simultaneously by multiple
LD. LD. LD. LD. hardware threads
AD.128b +
1 1 1 1 • Contiguous data not required.
+ • So if something can run in
ST.128b + + +
SIMD, it can run in SIMT, but
not necessarily the reverse.
Single Instruction Multiple Data (SIMD) AD. AD. AD. AD. • SIMT can better handle
• Vector instructions perform the same operation on 1 1 1 1 indirection
multiple data elements. • The hardware enables
• Data must be loaded and stored in contiguous LD. LD. LD. LD.
1 1 1 1 parallel execution of scalar
buffers instructions
• Either the programmer or the compiler must
generate vector instructions
21
SIMD and SIMT Branching
SIMT
SIMD
1. Execute converged 1. Execute converged
instructions instructions
2. Generate vector 2. Executed true
mask for true branch
3. Execute masked 3. Execute false
vector instruction branch
4. Generate vector 4. Continue to
mask for false execute converged
5. Execute masked instructions
vector instruction
6. Continue to
execute converged
instructions
Divergence (hopefully) handled by compiler Divergence handle by hardware through
through masks and/or gather/scatter predicated instructions.
operations. 22
Next 2 Lectures
Wednesday – OpenACC Basics
Friday – More OpenACC?
23