0% found this document useful (0 votes)

92 views20 pages

GPU Fundamentals

GPUs are optimized for throughput via thousands of lightweight threads that hide memory latency. They have a streaming multiprocessor architecture with many CUDA cores that execute warps of 32 threads simultaneously using SIMT. The memory hierarchy includes fast but small registers and shared memory per multiprocessor, with global memory requiring coalesced access for best performance.

Uploaded by

Jyotirmay Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views20 pages

GPU Fundamentals

Uploaded by

Jyotirmay Sahu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

GPU Fundamentals

Jeff Larkin <[email protected]>, November 14, 2016

Who Am I?
2002 – B.S. Computer Science – Furman University
2005 – M.S. Computer Science – UT Knoxville
2002 – Graduate Teaching Assistant
2005 – Graduate Research Assistant (ICL)
2005 – 2013 – Cray, Inc
Worked on porting & optimizing HPC apps @ ORNL, User Training
2013 – Present – NVIDIA Corp.
Porting & optimizing HPC apps @ ORNL , User Training,
Representative to OpenACC & OpenMP
2
GPU Architecture
Speed v. Throughput

AGENDA Latency Hiding

Memory Coalescing
SIMD v. SIMT

3
GPU Architecture
Two Main Components

Global memory
Analogous to RAM in a CPU server

Accessible by both GPU and CPU

Currently up to 16 GB in Tesla products

Streaming Multiprocessors (SM)

Perform the actual computation

Each SM has its own: Control units, registers, execution pipelines, caches

7
GPU Architecture
Streaming Multiprocessor (SM)

Many CUDA Cores per SM

Architecture dependent

Special-function units
cos/sin/tan, etc.

Shared memory + L1 cache

Thousands of 32-bit registers

8
GPU Architecture
CUDA Core
Floating point & Integer unit
IEEE 754-2008 floating-point CUDA Core
standard Dispatch Port
Operand Collector
Fused multiply-add (FMA)
instruction for both single and
FP Unit INT Unit
double precision

Logic unit Result Queue

Move, compare unit

Branch unit
9
Execution Model
Software Hardware
Threads are executed by scalar processors
Scalar
Thread Processor
Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one

Thread multiprocessor - limited by multiprocessor
Block Multiprocessor
resources (shared memory and register file)

... A kernel is launched as a grid of thread blocks

Grid Device 10
Warps

A thread block consists of

32 Threads 32-thread warps
... = 32 Threads
32 Threads A warp is executed
Thread
Block Warps Multiprocessor physically in parallel
(SIMT) on a multiprocessor

11
GPU Memory Hierarchy Review

SM-0 SM-1 SM-N

Registers Registers Registers

L1 SMEM L1 SMEM L1 SMEM

Global Memory
12
GPU Architecture
Memory System on each SM

Extremely fast, but small, i.e., 10s of Kb

Programmer chooses whether to use cache as L1 or Shared Mem
L1
Hardware-managed—used for things like register spilling

Should NOT attempt to utilize like CPU caches

Shared Memory—programmer MUST synchronize data accesses!!!

User-managed scratch pad

Repeated access to same data or multiple threads with same data

13
GPU Architecture
Memory system on each GPU board

Unified L2 cache (100s of Kb)

Fast, coherent data sharing across all cores in the GPU
Unified/Managed Memory
Since CUDA6 it’s possible to allocate 1 pointer (virtual address) whose physical
location will be managed by the runtime.
Pre-Pascal GPUS – managed by software, limited to GPU memory size
Pascal & Beyond – Hardware can page fault to manage location, can oversubscribe
GPU memory.

14
Speed v. Throughput
Speed Throughput

Which is better depends on your needs…

*Images from Wikimedia Commons via Creative Commons 15

Low Latency or High Throughput?

CPU GPU
 Optimized for low-latency access to  Optimized for data-parallel, throughput
cached data sets computation
 Control logic for out-of-order and  Tolerant of memory latency
speculative execution  More transistors dedicated to computation
 10’s of threads  10,000’s of threads
16
Low Latency or High Throughput?
CPU architecture must minimize latency within each thread
GPU architecture hides latency with computation from other thread warps

CPU core – Low Latency Processor Computation Thread/Warp

T1 T2 T3 T4 Tn Processing

GPU Stream Multiprocessor – High Throughput Processor

Waiting for data
W4
W3 Ready to be processed
W2
W1 Context switch

17
Memory Coalescing
Global memory access happens in
transactions of 32 or 128 bytes
The hardware will try to reduce to 0 1 31
as few transactions as possible
Coalesced access:
A group of 32 contiguous threads
(“warp”) accessing adjacent words
0 1 31
Few transactions and high utilization
Uncoalesced access:
A warp of 32 threads accessing
scattered words
Many transactions and low utilization

18
SIMD and SIMT

LD.128b Single Instruction Multiple

Single Instruction Multiple Data (SIMD) AD. AD. AD. AD.

• Vector instructions perform the same operation on 1 1 1 1
multiple data elements.
• Data must be loaded and stored in contiguous LD. LD. LD. LD.
buffers 1 1 1 1

19
SIMD and SIMT

LD.128b Single Instruction Multiple

Thread (SIMT)
LD. LD. LD. LD. • Scalar instructions execute
LD.128b 1 1 1 1 simultaneously by multiple
LD. LD. LD. LD. hardware threads
AD.128b +
1 1 1 1 • Contiguous data not required.
+ • So if something can run in
ST.128b + + +
SIMD, it can run in SIMT, but
not necessarily the reverse.
Single Instruction Multiple Data (SIMD) AD. AD. AD. AD. • SIMT can better handle
• Vector instructions perform the same operation on 1 1 1 1 indirection
multiple data elements.
• Data must be loaded and stored in contiguous LD. LD. LD. LD.
buffers 1 1 1 1

20
SIMD and SIMT

LD.128b Single Instruction Multiple

Thread (SIMT)
LD. LD. LD. LD. • Scalar instructions execute
LD.128b 1 1 1 1 simultaneously by multiple
LD. LD. LD. LD. hardware threads
AD.128b +
1 1 1 1 • Contiguous data not required.
+ • So if something can run in
ST.128b + + +
SIMD, it can run in SIMT, but
not necessarily the reverse.
Single Instruction Multiple Data (SIMD) AD. AD. AD. AD. • SIMT can better handle
• Vector instructions perform the same operation on 1 1 1 1 indirection
multiple data elements. • The hardware enables
• Data must be loaded and stored in contiguous LD. LD. LD. LD.
1 1 1 1 parallel execution of scalar
buffers instructions
• Either the programmer or the compiler must
generate vector instructions
21
SIMD and SIMT Branching
SIMT
SIMD

1. Execute converged 1. Execute converged

instructions instructions
2. Generate vector 2. Executed true
mask for true branch
3. Execute masked 3. Execute false
vector instruction branch
4. Generate vector 4. Continue to
mask for false execute converged
5. Execute masked instructions
vector instruction
6. Continue to
execute converged
instructions

Divergence (hopefully) handled by compiler Divergence handle by hardware through

through masks and/or gather/scatter predicated instructions.
operations. 22
Next 2 Lectures

Wednesday – OpenACC Basics

Friday – More OpenACC?

Odia Poet Upendra Bhanja Audio Files
No ratings yet
Odia Poet Upendra Bhanja Audio Files
1 page
Only Pandas
No ratings yet
Only Pandas
8 pages
ICT 1101 - PC Hardware Assembly 02 Fall 2021
No ratings yet
ICT 1101 - PC Hardware Assembly 02 Fall 2021
24 pages
Chapter 5 - Basics of PLC Programming
No ratings yet
Chapter 5 - Basics of PLC Programming
91 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
GPU Architecture for Engineers
No ratings yet
GPU Architecture for Engineers
32 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Presentation1 (1) HPC Mod 3
No ratings yet
Presentation1 (1) HPC Mod 3
51 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Hardware
No ratings yet
Hardware
54 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Lec 1
No ratings yet
Lec 1
27 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Section 2 TR
No ratings yet
Section 2 TR
26 pages
GPU Programming Course Schedule
No ratings yet
GPU Programming Course Schedule
33 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Cpus: Latency Oriented Design
No ratings yet
Cpus: Latency Oriented Design
2 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
PART19
No ratings yet
PART19
20 pages
Lecture 6 4
No ratings yet
Lecture 6 4
10 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lec 14
No ratings yet
Lec 14
52 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Slides - Chapter 6
No ratings yet
Slides - Chapter 6
59 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Lec 3
No ratings yet
Lec 3
48 pages
AMPE Tema4 GPU Architecture
No ratings yet
AMPE Tema4 GPU Architecture
95 pages
CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
Chapter 9 - Multiple Core Computers
No ratings yet
Chapter 9 - Multiple Core Computers
44 pages
Catanzaro Intro To GPUs
No ratings yet
Catanzaro Intro To GPUs
76 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Unit 4
100% (1)
Unit 4
48 pages
Main Parameters To Evaluate The GPU Performance
No ratings yet
Main Parameters To Evaluate The GPU Performance
40 pages
GPGPU
No ratings yet
GPGPU
139 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
NCSA02 Fundamental CUDA Optimization
No ratings yet
NCSA02 Fundamental CUDA Optimization
50 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
Gpu Architecture
No ratings yet
Gpu Architecture
43 pages
Python-Howto Curses
No ratings yet
Python-Howto Curses
8 pages
All T-Codes of Sap Fico
No ratings yet
All T-Codes of Sap Fico
31 pages
Sap Notes
No ratings yet
Sap Notes
38 pages
Python TCS
0% (1)
Python TCS
6 pages
GPU Computing With Apache Spark and Python: April 5, 2016
No ratings yet
GPU Computing With Apache Spark and Python: April 5, 2016
55 pages
Game of Thrones - 1x03 - Lord Snow.720p HDTV - En.srt
No ratings yet
Game of Thrones - 1x03 - Lord Snow.720p HDTV - En.srt
83 pages
OpenMP 01 Introduction
No ratings yet
OpenMP 01 Introduction
70 pages
Python Basics and Exercises
No ratings yet
Python Basics and Exercises
32 pages
Trianz Resume Template
No ratings yet
Trianz Resume Template
4 pages
Python Basics and Exercises
No ratings yet
Python Basics and Exercises
32 pages
Init - (Self,: Class. - Init - (... ) X. - Init - (... ) Class (... )
No ratings yet
Init - (Self,: Class. - Init - (... ) X. - Init - (... ) Class (... )
12 pages
Imp Details
No ratings yet
Imp Details
6 pages
Python Quick Note
No ratings yet
Python Quick Note
23 pages
Day 1 in Venice: Saint Mark'S and Rialto
No ratings yet
Day 1 in Venice: Saint Mark'S and Rialto
90 pages
Jyotirmay Sahu: Professional Summary
No ratings yet
Jyotirmay Sahu: Professional Summary
3 pages
SB 1372
No ratings yet
SB 1372
20 pages
Module 5
No ratings yet
Module 5
142 pages
200 Cursos + Regalos - Adventure PlusDigital 2025
No ratings yet
200 Cursos + Regalos - Adventure PlusDigital 2025
11 pages
Seacon RUBBER-MOLDED PDF
No ratings yet
Seacon RUBBER-MOLDED PDF
38 pages
8085 Microprocessor MCQS: August 30, 2017
No ratings yet
8085 Microprocessor MCQS: August 30, 2017
3 pages
KFS 564a
No ratings yet
KFS 564a
38 pages
Fone Xiaomi 24-05-24
No ratings yet
Fone Xiaomi 24-05-24
9 pages
Manual Book HP Revolve 810
No ratings yet
Manual Book HP Revolve 810
116 pages
6th Sem Electronics EL 6215 - PIC Microcontroller and Embedded Systems April 2018 A
No ratings yet
6th Sem Electronics EL 6215 - PIC Microcontroller and Embedded Systems April 2018 A
2 pages
Wa0076
No ratings yet
Wa0076
2 pages
KincoCatalog ALL TYPE
No ratings yet
KincoCatalog ALL TYPE
30 pages
Tech Enthusiasts' Price List
No ratings yet
Tech Enthusiasts' Price List
5 pages
Volume-1 BoQ
No ratings yet
Volume-1 BoQ
5 pages
Cable Supports Catalogue Complete 20131003
No ratings yet
Cable Supports Catalogue Complete 20131003
124 pages
under NDA: 泰凌 Kite BLE SDK 开发指南
No ratings yet
under NDA: 泰凌 Kite BLE SDK 开发指南
317 pages
BenQ Joybook S73 (Mitac Gazelle 8224)
No ratings yet
BenQ Joybook S73 (Mitac Gazelle 8224)
47 pages
Athena Software Manual 2.0
No ratings yet
Athena Software Manual 2.0
144 pages
Eylenburg Operating System Timeline Family Tree
No ratings yet
Eylenburg Operating System Timeline Family Tree
23 pages
PM573 Eth
No ratings yet
PM573 Eth
3 pages
Latest Log
No ratings yet
Latest Log
7 pages
PowerEdge Architecture Technical Overview
No ratings yet
PowerEdge Architecture Technical Overview
24 pages
SIMATIC ModbusTCP For PN CPU Redundant English
No ratings yet
SIMATIC ModbusTCP For PN CPU Redundant English
94 pages
PIC16F887 Assembly Instructions
No ratings yet
PIC16F887 Assembly Instructions
23 pages
LXI Dishwasher SERVICE MANUAL
No ratings yet
LXI Dishwasher SERVICE MANUAL
52 pages
Sit 409 Embedded Systems
No ratings yet
Sit 409 Embedded Systems
3 pages
ZR 110-130-144
No ratings yet
ZR 110-130-144
54 pages
Panasonic Y1a1w
No ratings yet
Panasonic Y1a1w
76 pages
Affordable Lenovo Legion Laptop Sri Lanka 154385
No ratings yet
Affordable Lenovo Legion Laptop Sri Lanka 154385
2 pages

GPU Fundamentals

Uploaded by

GPU Fundamentals

Uploaded by

GPU Fundamentals

Jeff Larkin <[email protected]>, November 14, 2016

AGENDA Latency Hiding

Accessible by both GPU and CPU

Currently up to 16 GB in Tesla products

Streaming Multiprocessors (SM)

Many CUDA Cores per SM

Shared memory + L1 cache

Logic unit Result Queue

Move, compare unit

Thread blocks do not migrate

Several concurrent thread blocks can reside on one

... A kernel is launched as a grid of thread blocks

A thread block consists of

SM-0 SM-1 SM-N

L1 SMEM L1 SMEM L1 SMEM

Extremely fast, but small, i.e., 10s of Kb

Should NOT attempt to utilize like CPU caches

Shared Memory—programmer MUST synchronize data accesses!!!

User-managed scratch pad

Repeated access to same data or multiple threads with same data

Unified L2 cache (100s of Kb)

Which is better depends on your needs…

*Images from Wikimedia Commons via Creative Commons 15

CPU core – Low Latency Processor Computation Thread/Warp

GPU Stream Multiprocessor – High Throughput Processor

LD.128b Single Instruction Multiple

Single Instruction Multiple Data (SIMD) AD. AD. AD. AD.

LD.128b Single Instruction Multiple

LD.128b Single Instruction Multiple

1. Execute converged 1. Execute converged

Divergence (hopefully) handled by compiler Divergence handle by hardware through

Wednesday – OpenACC Basics

You might also like