Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
1 views83 pages

Lecture 9 Multi-Processor

The document discusses parallel processors and the challenges associated with parallel programming, emphasizing the importance of task-level parallelism and strategies for effective parallelization. It covers concepts like Amdahl's Law, data and task decomposition, and scaling examples to illustrate performance improvements with multiple processors. Additionally, it addresses shared memory multiprocessors, cache coherency, and the protocols used to maintain data consistency across caches.

Uploaded by

hyhuang891215
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views83 pages

Lecture 9 Multi-Processor

The document discusses parallel processors and the challenges associated with parallel programming, emphasizing the importance of task-level parallelism and strategies for effective parallelization. It covers concepts like Amdahl's Law, data and task decomposition, and scaling examples to illustrate performance improvements with multiple processors. Additionally, it addresses shared memory multiprocessors, cache coherency, and the protocols used to maintain data consistency across caches.

Uploaded by

hyhuang891215
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Parallel Processors from Client to

Cloud
Parallel Computers
• Goal: connecting multiple computers
to get higher performance
– Multiprocessors
– Scalability, availability, power efficiency

• Task-level (process-level) parallelism


» High throughput for independent jobs
• Parallel processing program
» Single program run on multiple processors
§6.2 The Difficulty of Creating Parallel Processing Programs
Parallel Programming
n Parallel software is the problem
n Need to get significant performance
improvement
n Otherwise, just use a faster uniprocessor,
since it’s easier!
n Difficulties
n Partitioning
n Coordination
n Communications overhead

Chapter 6 — Parallel Processors from Client to Cloud — 3


Amdahl’s Law
n Sequential part can limit speedup
n Example: 100 processors, 90× speedup?
n Tnew = Tparallelizable/100 + Tsequential
1
n Speedup = = 90
(1- Fparallelizable ) + Fparallelizable /100
n Solving: Fparallelizable = 0.999
n Need sequential part to be 0.1% of original
time

Chapter 6 — Parallel Processors from Client to Cloud — 4


Parallelization Strategy
n Data decomposition
n Task decomposition
n Objective
n Minimize the communication overheads as
much as possible

Chapter 6 — Parallel Processors from Client to Cloud — 5


Data Decomposition

• Decide how data elements should be divided


among processors
• Decide which tasks each processor should be
doing
• Example: Find the largest element in an array
Task Decomposition

• Divide tasks among processors


• Decide which data elements are going to be
accessed (read and/or written) by which
processors
• Example
Pipelining

• Special kind of task parallelism

Core 1: Stage 1 Core 2: Stage 2 Core 3: Stage 3 Core 4: Stage 4


t1 t2 t3 t4
Data Data Data Data
t2 t3 t4 t5
Data Data Data Data
Scaling Example
• Workload: sum of 10 scalars, and 10 × 10
matrix sum
– Speed up from 10 to 100 processors
• Single processor: Time = (10 + 100) × tadd
• 10 processors
– Time = 10 × tadd + 100/10 × tadd = 20 × tadd
– Speedup = 110/20 = 5.5 (55% of potential)
• 100 processors
– Time = 10 × tadd + 100/100 × tadd = 11 × tadd
– Speedup = 110/11 = 10 (10% of potential)
• Assumes load can be balanced across
processors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 9


Scaling Example (cont)
• What if matrix size is 100 × 100?
• Single processor: Time = (10 + 10000) ×
tadd
• 10 processors
– Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
– Speedup = 10010/1010 = 9.9 (99% of potential)
• 100 processors
– Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
– Speedup = 10010/110 = 91 (91% of potential)
• Assuming load balanced

Chapter 7 — Multicores, Multiprocessors, and Clusters — 10


Strong vs Weak Scaling

• Strong scaling: problem size fixed


• Weak scaling: problem size proportional to
number of processors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 11


§7.3 Shared Memory Multiprocessors
Shared Memory Multiprocessor
• SMP: shared memory multiprocessor
– Hardware provides single physical
address space for all processors
– Synchronize shared variables using locks
– Memory access time
» UMA (uniform) vs. NUMA (nonuniform)

Chapter 7 — Multicores, Multiprocessors, and Clusters — 12


§7.3 Shared Memory Multiprocessors
Cache Coherency
• Traffic per processor and the bus bandwidth
determine the # of processors
• Caches can lower bus traffic
– Cache coherency problem

Chapter 7 — Multicores, Multiprocessors, and Clusters — 13


Cache Coherency
Time Event $ A $ B X
(memo
ory)
0 1
1 CPU A 1 1
reads X
2 CPU B 1 1 1
reads X
3 CPU A 0 1 0
stores 0
into X
Cache Coherency Protocol
• Snooping Solution (Snoopy Bus):
– Send all requests for data to all processors
– Processors snoop to see if they have a copy and respond accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for small scale machines (most of the market)
Basic Snoopy Protocols
• Write Invalidate Protocol:
– Multiple readers, single writer
– Write to shared data: an invalidate is sent to all caches which snoop and
invalidate any copies
– Read Miss:
» Write-through: memory is always up-to-date
» Write-back: snoop in caches to find most recent copy
• Write Update Protocol:
– Write to shared data: broadcast on bus, processors snoop, and update
copies
– Read miss: memory is always up-to-date
• What happens if two processors try to write to the same
shared data word in the same clock cycle?
– Write serialization: bus serializes requests
Basic Snoopy Protocols
• Invalidation
Processor activity Bus activity Contents of Contents of Contexts of memory
CPU A’ cache CPU B’s cache location X
0
CPU A reads X Cache miss for X 0
CPU B reads X Cache miss for X 0 0 0
CPU A writes a 1 to X Invalidation for X 1 0
CPU B reads X Cache miss for X 1 1 1

• Update

Processor activity Bus activity Contents of Contents of Contexts of memory


CPU A’ cache CPU B’s cache location X
0
CPU A reads X Cache miss for X 0
CPU B reads X Cache miss for X 0 0 0
CPU A writes a 1 to X Write broadcast of X 1 1 1
CPU B reads X 1 1 1
Basic Snoopy Protocols
• Write Invalidate versus Broadcast:
– Invalidate requires one transaction per write-run
– Invalidate uses spatial locality: one transaction per block
– Update has lower latency between write and read
– Update: BW (increased) vs. latency (decreased) tradeoff

Invalidate protocol is more popular than update !


An Example Snoopy Protocol

• Invalidation protocol, write-back cache


• Each block of memory is in one state:
– Clean in all caches and up-to-date in memory
– OR Dirty in exactly one cache
– OR Not in any caches
• Each cache block is in one state:
– Shared: block can be read
– OR Exclusive: cache has only copy, its writeable, and dirty
– OR Invalid: block contains no data
• Read misses: cause all caches to snoop
• Writes to clean line are treated as misses (or
write invalidate)
Snoopy-Cache State Machine-I
CPU Read hit
• State machine
for CPU requests
for each
cache block CPU Read Shared
Invalid (read/only)
Place read miss
on bus

CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
(read/write)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus
Snoopy-Cache State Machine-II
• State machine
for bus requests Write miss
for each for this block Shared
cache block Invalid
(read/only)

Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block; (abort
Exclusive memory access)
(read/write)
Example
Processor 1 Processor 2 Bus Memory
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 1
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2. miss on bus
Active arrow = Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 3
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 10
P2: Write 40 to A2 10
10

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2. miss on bus
Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 4
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 10
10

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 5
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map Shared
Invalid CPU Read Miss
to same cache block, Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Coherency Misses: 4th C
Joins Compulsory, Capacity, Conflict

1. True sharing misses arise from the


communication of data through the cache
coherence mechanism
• Invalidates due to 1st write to shared block
• Reads by another CPU of modified block in different
cache
• Miss would still occur if block size were 1 word
2. False sharing misses when a block is
invalidated because some word in the block,
other than the one being read, is written into
• Invalidation does not cause a new value to be
communicated, but only causes an extra cache miss
• Block is shared, but no word in block is actually shared
Þ miss would not occur if block size were 1 word
Example: True vs. False Sharing vs.
Hit?
• Assume x1 and x2 in same cache block.
P1 and P2 both read x1 and x2 before.

Time P1 P2 True, False, Hit? Why?


1 Write x1 True miss; invalidate x1 in P2
2 Read x2 False miss; x1 irrelevant to P2
3 Write x1 False miss; x1 irrelevant to P2
4 Write x2 False miss; x1 irrelevant to P2
5 Read x2 True miss; invalidate x2 in P1
§7.3 Shared Memory Multiprocessors
Shared Memory
• SMP: shared memory multiprocessor
– Hardware provides single physical
address space for all processors
– Synchronize shared variables using locks
– Memory access time
» UMA (uniform) vs. NUMA (nonuniform)

Chapter 7 — Multicores, Multiprocessors, and Clusters — 30


Communication Models

• Single Address Space: load/store


Pn
load x Common x
P2
Physical
P1 Address
store xP0
Shared Pn private
Portion of
Address
Space
P2 private

Private portion P1 private


Address
space
P0 private
Program Example – Single-Address Space

• sum 100,000 numbers & 100 processors (load & store)

First Step: each processor (Pn) sums his subset of numbers


Sum is shared variable

sum 0 1 2::::::::::::::999

P0
Program Example – Single-Address Space

Second Step: Add partial sums via divide-and-conquer


Message Passing Multiprocessors
• Clusters: Collections of computers connected via I/O
over standard network switches to form a message-
passing multiprocessors
• NUMA: Non-Uniform Memory Access/Directory-based
cache coherency protocol
Processor Processor Processor Processor
+Cache +Cache +Cache +Cache

: :
memory I/O memory I/O

Interconnection Network

memory I/O
: : memory I/O

Processor Processor Processor Processor


+Cache +Cache +Cache +Cache
Communication Models

• Multiple address spaces: Message Passing

Local Process Local Process


Address Space Address Space

match
Recv y, P, t

x Send x, Q, t y

Process P Process Q
Parallel Program – Message Passing

• sum 100,000 numbers & 100 processors (send & receive)

First Step: each processor (Pn) sums his subset of numbers


Parallel Program – Message Passing

Second Step: Add partial sums via divide-and-conquer


Bisection Bandwidth is Important
Bus Multicore
Total network bandwidth =
p p p bandwidth-per-link x link_no

c c c
Bisection bandwidth =
BUS
the bandwidth between two equal parts of a
multiprocessor

Ring Multicore
p p p p
c c c c
s s s s
Network Topology

switch

Processor-memory

Fully-connected
2D torus

Ring
Cube
Multistage Networks

Chapter 6 — Parallel Processors from Client to Cloud — 40


Network Characteristics

• Performance
– Latency per message (unloaded network)
– Throughput
» Link bandwidth
» Total network bandwidth
» Bisection bandwidth
– Congestion delays (depending on traffic)
• Cost
• Power
• Routability in silicon

Chapter 7 — Multicores, Multiprocessors, and Clusters — 41


What is multi-core?

chip
chip chip

Core Core Core Core Core


Cache Cache Cache Cache Cache

bus
bus

Off-chip bus On-chip bus


From Multicore to Manycore
Basic CMP Architecture
• L 1 caches are always private to a core
• L2 caches can be private or shared – which is
better?

Core 1 Core 2 Core 3 Core 4

I-L1 D-L1 I-L1 D-L1 I-L1 D-L1 I-L1 D-L1

Interconnection Network

L2
Scalable CMP Architecture
• Tiled CMP
– Each tile includes processor, L1, L2, and router
– Physically distributed last level cache
ARM Big.Little Technology
• ARM big.LITTLE processing is designed to deliver the
vision of the right processor for the right job.
• In current big.LITTLE system implementations a ‘big’ ARM
Cortex™-A15 processor is paired with a ‘LITTLE’ Cortex™-
A7 processor to create a system that can accomplish both
high intensity and low intensity tasks in the most energy
efficient manner
– Cortex-A15 – heavy workloads
– Cortex-A7 - light workloads, like operating system activities,
user interface and other always on, always connected tasks.

46
Multithreading
• Performing multiple threads of execution in
parallel
– Replicate registers, PC, etc.
– Fast switching between threads
• Fine-grain multithreading
– Switch threads after each cycle
– Interleave instruction execution
– If one thread stalls, others are executed
• Coarse-grain multithreading
– Only switch on long stall (e.g., L2-cache miss)
– Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
Multithreaded Categories
Superscalar Fine-Grained Coarse-Grained Multiprocessing
Time (processor cycle)

Thread 1 Thread 3 Thread 5


Thread 2 Thread 4 Idle slot

48
Simultaneous Multithreading

• In multiple-issue dynamically scheduled


processor
– Schedule instructions from multiple threads
– Instructions from independent threads execute when
function units are available
– Within threads, dependencies handled by scheduling and
register renaming
• Example: Intel Pentium-4 HT
– Two threads: duplicated registers, shared function units
and caches

Chapter 7 — Multicores, Multiprocessors, and Clusters — 49


Multithreaded Categories
Simultaneous
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading
Time (processor cycle)

Thread 1 Thread 3 Thread 5


Thread 2 Thread 4 Idle slot

50
§7.6 SISD, MIMD, SIMD, SPMD, and Vector
Computing Device Classification:
Instruction and Data Streams

Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345

n SPMD: Single Program Multiple Data


n A parallel program on a MIMD computer
n Conditional code for different processors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 51


Introduction
SIMD
n SIMD architectures can exploit significant data-
level parallelism for:
n matrix-oriented scientific computing
n media-oriented image and sound processors

n SIMD is more energy efficient than MIMD


n Only needs to fetch one instruction per data operation
n Makes SIMD attractive for personal mobile devices

n SIMD allows programmer to continue to think


sequentially
Copyright © 2012, Elsevier Inc. All rights reserved. 52
SIMD Instruction Set Extensions for Multimedia
SIMD Extensions
• Media applications operate on data types
narrower than the native word size
– 4-byte registers – R,G,B (byte per pixel)
• Implementations:
– Intel MMX (1996)
» Eight 8-bit integer ops or four 16-bit integer ops
– Streaming SIMD Extensions (SSE) (1999)
» Eight 16-bit integer ops
» Four 32-bit integer/fp ops or two 64-bit integer/fp ops
– Advanced Vector Extensions (2010)
» Four 64-bit integer/fp ops

– Operands must be consecutive and aligned memory


locations
Vector Architectures
Vector Architectures
n Basic idea:
n Read sets of data elements into “vector registers”
(Gather)
n Operate on those registers
n Highly pipelined function units
n Disperse the results back into memory (Scatter)

Copyright © 2012, Elsevier Inc. All rights reserved. 54


Vector Extension to RISC-V
n v0 to v31: 64 × 64-bit element registers
n Vector instructions
n fld.v, fsd.v: load/store vector
n fadd.d.v: add vectors of double
n fadd.d.vs: add scalar to each element of vector of
double
n Significantly reduces instruction-fetch bandwidth

55
Example: DAXPY (Y = a × X + Y)
Conventional RISC-V code:
fld f0,a(x3) // load scalar a
addi x5,x19,512 // end of array X
loop: fld f1,0(x19) // load x[i]
fmul.d f1,f1,f0 // a * x[i]
fld f2,0(x20) // load y[i]
fadd.d f2,f2,f1 // a * x[i] + y[i]
fsd f2,0(x20) // store y[i]
addi x19,x19,8 // increment index to x
addi x20,x20,8 // increment index to y
bltu x19,x5,loop // repeat if not done

Vector RISC-V code:


fld f0,a(x3) // load scalar a
fld.v v0,0(x19) // load vector x
fmul.d.vs v0,v0,f0 // vector-scalar multiply
fld.v v1,0(x20) // load vector y
fadd.d.v v1,v1,v0 // vector-vector add
fsd.v v1,0(x20) // store vector y

Chapter 6 — Parallel Processors from Client to Cloud — 56


Vector vs. Scalar

• Vector architectures and compilers


– Simplify data-parallel programming
– Explicit statement of absence of loop-carried
dependences
» Reduced checking in hardware
– Regular access patterns benefit from interleaved and
burst memory
– Avoid control hazards by avoiding loops
• More general than ad-hoc media extensions
(such as MMX, SSE)
– Better match with compiler technology

Chapter 7 — Multicores, Multiprocessors, and Clusters — 57


Multiple-Lane Vector units
n Vector units can be combination of pipelined and
arrayed functional units:

Chapter 6 — Parallel Processors from Client to Cloud — 58


What is GPU?
l Graphics Processing Units
l GPU is a device to compute massive vertices, pixels,
and general purpose data
l Feature
l High availability
l High computing performance
l Low price of computing capability
600
Tesla
C870

500

GeForce
8800 GTX
400
Quadro
GFLOPS

FX 5600
300
G71
G70-512
G70
200

3.0 GHz
100 Core 2 Quad
NV40 3.0 GHz
NV35
NV30 Core 2 Duo
3.0 GHz Pentium 4
0
59
Jan 2003 Jul 2003 Jan 2004 Jul 2004 Jan 2005 Jul 2005 Jan 2006 Jul 2006 Jan 2007 Jul 2007
GPU’s History and Evolution
l Early History
l In early 90’s, graphics are only performed by a video graphics array (VGA)
controller
l In 1997, VGA controllers start to incorporate 3D acceleration functions
l In 2000, the term GPU is coined to denote that the graphics devices had
become a processor
l GPU Evolution
l Fixed-Function → Programmable
l 1999, NVIDIA Geforce 256, Fixed-function vertex transform and pixel pipeline
l 2001, NVIDIA Geforce 3, 1st programmable vertex processor
l 2002, ATI Radeon 9700, 1st programmable pixel(fragment) processor
l Non-unified Processor → Unified Processor
l 2005, Microsoft XBOX 360, 1st unified shader architecture
l Tesla GPU series released in 2007
l Fermi Architecture released in 2009
l Kepler Architecture released in 2012
l Maxwell Architecture released in 2014
l Pascal Architecture released in 2016 (16 nm FinFET process)
l Volta Architecture released in 2017 – Tensor Core for AI (12 nm process)
l Turing Architecture released in 2018

60
Fixed-function 3D Graphics Pipeline
•Set render state
•Draw calls

•Vertices
•Textures

Programming languages/APIs
DirectX, OpenGL 61
Programmable 3D Graphics Pipeline
High Level Shader Language
(HLSL)
• Vertex computation

•In fixed-function pipeline, transformation, shading,


lighting are set via setting render state

•In programmable pipeline, these are done by•Pixel


shadercomputation
code written by engineers

62
Unified Shader Architecture
l Use the same shader
processors for all types
of computation
l Vertex threads
l Pixel threads
l Computation threads
l Advantage
l Better resource utilization
l Lower hardware
complexity

63
Modern GPUs: A Computing Device
l GPUs have orders of magnitudes more computing
power than CPU
l General-purpose tasks with high-degree of data level
parallelism running on GPU outperform those on CPU
=> General-Purpose computing on GPU (GPGPU)
l GPGPU programming models
l NVIDIA’s CUDA
GPGPU Performance
l AMD’s StreamSDK
Medical Imaging 300X
l OpenCL
Molecular Dynamics 150X
SPICE 130X
Fourier Transform 130X
Fluid Dynamics 100X

64
Fundamental Architectural
Differences between CPU & GPU
l Multi-core CPU
l Coarse-grain, heavyweight threads
l Memory latency is resolved though large on-chip caches &
out-of-order execution
l Modern GPU
l Fine-grain, lightweight threads
l Exploit thread-level parallelism for hiding latency

Out-of-order control logic


ALU ALU
Branch Predictor
Memory Prefetcher ALU ALU
CPU GPU
Non-blocking Cache

DRAM DRAM

65
66
SIMD processor

67
SIMT Execution Model of GPUs
l SIMT (Single Instruction Multiple Threads)
l Warp
l A group of threads (pixel, vertex, compute…)
l Basic scheduling/execution unit Time
Warp 1, Instruction 30
l Common PC value
Warp 4, Instruction 1

Warp 10, Instruction 13

1 ~ 32 thread ID Warp 16, Instruction 7


Thread block
… Warp 1


33 ~ 64 thread ID
… Warp 1, Instruction 31
… Warp 2

1 ~ x thread ID Warp 10, Instruction 14

… Warp n
Warp 4, Instruction 2

Warp 16, Instruction 868


GPU Memory Structures

Chapter 6 — Parallel
Processors from Client
Latency Hiding
l Interleaved warp execution
Time

Warp 1, Instruction 30

Executing
STALL Warp 4, Instruction 1
Waiting (ready)
Warp 10, Instruction 13
Stalled
STALL Warp 16, Instruction 7

Warp 1, Instruction 31

Warp 10, Instruction 14

Warp 4, Instruction 2

Warp 16, Instruction 8

Warp 1 Warp 4 Warp 10 Warp 16

70
Volta

71
Turing

72
Graphics in the System

Chapter 7 — Multicores, Multiprocessors, and Clusters — 73


CPU/GPU Integration:
CPU’s Advancement Meets GPU’s
Microprocessor Advancement

Multi-Core Heterogeneous
CPU/GPU
Single-Thread
Integration
Mainstream

Era Era Systems Era

High Performance
Programmability

Task Parallel Execution Heterogeneous


Computing System-Level
Progammable
Experts Only

GPU Advancement
Vertex/Pixel
Shader

Power-efficient
Unacceptable

Data Parallel Graphics


Execution Driver-based
programs

Throughput Performance 74
Heterogeneous Computing ~ 2011
AMD Fusion

l Intel SandyBridge
l Shared last-level cache (LLC) and main
memory

l AMD Fusion APU (Accelerated


Processing Unit) Intel SandyBridge
l Shared main memory

75
Evolution of Heterogeneous Computing
▪ Dedicated GPU
– GPU kernel is launched through the device driver
– Separate CPU/GPU address space
– Separate system/GPU memory
– Data copy between CPU/GPU via PCIe

Address space Address space


OpenCL managed by OS managed by driver
Application
OpenCL
Runtime Library CPU data preparation computation
GPU

User Space

Kernel Space !"#$! Core2 … CoreN CU1 CU2 CUN


L1 L1 L1
GPU L1/L2 L1/L2 L1/L2
Device Driver

LLC L2

= kernel launch
process
System memory GPU memory
(coherent) PCIe (Non-coherent)

76
Evolution of Heterogeneous Computing
▪ Integrated GPU architecture
– GPU kernel is launched through the device driver
– Separate CPU/GPU address space
– Separate system/GPU memory
– Data copy between CPU/GPU via memory bus

Address space Address space


OpenCL managed by OS managed by driver
Application
OpenCL
Runtime Library CPU data preparation Computation
GPU

User Space

Kernel Space !"#$! Core2 … CoreN CU1 CU2 CUN


L1 L1 L1
GPU L1/L2 L1/L2 L1/L2
Device Driver

LLC L2

= kernel launch
process
System
System
memory
memory GPU
GPU
memory
memory
(coherent)
(coherent) PCIe (Non-coherent)
(Non-coherent)

77
Evolution of Heterogeneous Computing
▪ Integrated GPU architecture
– GPU kernel is launched through the device driver
– Unified CPU/GPU address space (managed by OS)
– Unified system/GPU memory
– No data copy - data can be retrieved by pointer passing
Address space Address space
OpenCL managed by OS managed by driver
Application
OpenCL
Runtime Library CPU data preparation Computation
GPU
Address space

User Space

Kernel Space !"#$! Core2 … CoreN managed by OS CU1 CU2 CUN


L1 L1 L1
GPU L1/L2 L1/L2 L1/L2
Device Driver L2

LLC L2
LLC

= kernel launch
process
System memory GPU memory
Coherent
(coherent)
system memory
(Non-coherent)

78
CUDA Unified Memory

79
§6.10 Multiprocessor Benchmarks and Performance Models
Parallel Benchmarks
l Linpack: matrix linear algebra
l SPECrate: parallel run of SPEC CPU programs
l Job-level parallelism
l SPLASH: Stanford Parallel Applications for
Shared Memory
l Mix of kernels and applications, strong scaling
l NAS (NASA Advanced Supercomputing) suite
l computational fluid dynamics kernels
l PARSEC (Princeton Application Repository for
Shared Memory Computers) suite
l Multithreaded applications using Pthreads and
OpenMP
Chapter 6 — Parallel Processors
from Client to Cloud — 80
Modeling Performance~ Roofline Model
l Target performance metric
l Achievable GFLOPs/sec
l Hardware: for a given computer, determine
l Peak GFLOPS (from data sheet)
l Peak memory bytes/sec (using Stream benchmark)
l Software: Arithmetic intensity of a kernel
l FLOPs per byte of memory accessed

Chapter 6 — Parallel
Processors from Client
Roofline : A Simple Performance
Model

Floating-Point Ops/sec
= Bytes/sec
Floating-Point Ops/ byte

Arithmetic Intensity

(assuming 16GB/sec peak bandwidth)

Attainable GFLOPs/sec
= Min ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )

Chapter 6 — Parallel Processors


from Client to Cloud — 82
Comparing Systems
l Example: Opteron X2 vs. Opteron X4
l 2-core vs. 4-core, 2× FP performance/core, 2.2GHz
vs. 2.3GHz
l Same memory system

n To get higher performance


on X4 than X2
n Need high arithmetic intensity
n Or working set must fit in X4’s
2MB L-3 cache

Chapter 6 — Parallel
Processors from Client

You might also like