0% found this document useful (0 votes)

1 views83 pages

Lecture 9 Multi-Processor

The document discusses parallel processors and the challenges associated with parallel programming, emphasizing the importance of task-level parallelism and strategies for effective parallelization. It covers concepts like Amdahl's Law, data and task decomposition, and scaling examples to illustrate performance improvements with multiple processors. Additionally, it addresses shared memory multiprocessors, cache coherency, and the protocols used to maintain data consistency across caches.

Uploaded by

hyhuang891215

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views83 pages

Lecture 9 Multi-Processor

Uploaded by

hyhuang891215

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Parallel Processors from Client to

Cloud
Parallel Computers
• Goal: connecting multiple computers
to get higher performance
– Multiprocessors
– Scalability, availability, power efficiency

• Task-level (process-level) parallelism

» High throughput for independent jobs
• Parallel processing program
» Single program run on multiple processors
§6.2 The Difficulty of Creating Parallel Processing Programs
Parallel Programming
n Parallel software is the problem
n Need to get significant performance
improvement
n Otherwise, just use a faster uniprocessor,
since it’s easier!
n Difficulties
n Partitioning
n Coordination
n Communications overhead

Chapter 6 — Parallel Processors from Client to Cloud — 3

Amdahl’s Law
n Sequential part can limit speedup
n Example: 100 processors, 90× speedup?
n Tnew = Tparallelizable/100 + Tsequential
1
n Speedup = = 90
(1- Fparallelizable ) + Fparallelizable /100
n Solving: Fparallelizable = 0.999
n Need sequential part to be 0.1% of original
time

Chapter 6 — Parallel Processors from Client to Cloud — 4

Parallelization Strategy
n Data decomposition
n Task decomposition
n Objective
n Minimize the communication overheads as
much as possible

Chapter 6 — Parallel Processors from Client to Cloud — 5

Data Decomposition

• Decide how data elements should be divided

among processors
• Decide which tasks each processor should be
doing
• Example: Find the largest element in an array
Task Decomposition

• Divide tasks among processors

• Decide which data elements are going to be
accessed (read and/or written) by which
processors
• Example
Pipelining

• Special kind of task parallelism

Core 1: Stage 1 Core 2: Stage 2 Core 3: Stage 3 Core 4: Stage 4

t1 t2 t3 t4
Data Data Data Data
t2 t3 t4 t5
Data Data Data Data
Scaling Example
• Workload: sum of 10 scalars, and 10 × 10
matrix sum
– Speed up from 10 to 100 processors
• Single processor: Time = (10 + 100) × tadd
• 10 processors
– Time = 10 × tadd + 100/10 × tadd = 20 × tadd
– Speedup = 110/20 = 5.5 (55% of potential)
• 100 processors
– Time = 10 × tadd + 100/100 × tadd = 11 × tadd
– Speedup = 110/11 = 10 (10% of potential)
• Assumes load can be balanced across
processors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 9

Scaling Example (cont)
• What if matrix size is 100 × 100?
• Single processor: Time = (10 + 10000) ×
tadd
• 10 processors
– Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
– Speedup = 10010/1010 = 9.9 (99% of potential)
• 100 processors
– Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
– Speedup = 10010/110 = 91 (91% of potential)
• Assuming load balanced

Chapter 7 — Multicores, Multiprocessors, and Clusters — 10

Strong vs Weak Scaling

• Strong scaling: problem size fixed

• Weak scaling: problem size proportional to
number of processors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 11

§7.3 Shared Memory Multiprocessors
Shared Memory Multiprocessor
• SMP: shared memory multiprocessor
– Hardware provides single physical
address space for all processors
– Synchronize shared variables using locks
– Memory access time
» UMA (uniform) vs. NUMA (nonuniform)

Chapter 7 — Multicores, Multiprocessors, and Clusters — 12

§7.3 Shared Memory Multiprocessors
Cache Coherency
• Traffic per processor and the bus bandwidth
determine the # of processors
• Caches can lower bus traffic
– Cache coherency problem

Chapter 7 — Multicores, Multiprocessors, and Clusters — 13

Cache Coherency
Time Event $ A $ B X
(memo
ory)
0 1
1 CPU A 1 1
reads X
2 CPU B 1 1 1
reads X
3 CPU A 0 1 0
stores 0
into X
Cache Coherency Protocol
• Snooping Solution (Snoopy Bus):
– Send all requests for data to all processors
– Processors snoop to see if they have a copy and respond accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for small scale machines (most of the market)
Basic Snoopy Protocols
• Write Invalidate Protocol:
– Multiple readers, single writer
– Write to shared data: an invalidate is sent to all caches which snoop and
invalidate any copies
– Read Miss:
» Write-through: memory is always up-to-date
» Write-back: snoop in caches to find most recent copy
• Write Update Protocol:
– Write to shared data: broadcast on bus, processors snoop, and update
copies
– Read miss: memory is always up-to-date
• What happens if two processors try to write to the same
shared data word in the same clock cycle?
– Write serialization: bus serializes requests
Basic Snoopy Protocols
• Invalidation
Processor activity Bus activity Contents of Contents of Contexts of memory
CPU A’ cache CPU B’s cache location X
0
CPU A reads X Cache miss for X 0
CPU B reads X Cache miss for X 0 0 0
CPU A writes a 1 to X Invalidation for X 1 0
CPU B reads X Cache miss for X 1 1 1

• Update

Processor activity Bus activity Contents of Contents of Contexts of memory

CPU A’ cache CPU B’s cache location X
0
CPU A reads X Cache miss for X 0
CPU B reads X Cache miss for X 0 0 0
CPU A writes a 1 to X Write broadcast of X 1 1 1
CPU B reads X 1 1 1
Basic Snoopy Protocols
• Write Invalidate versus Broadcast:
– Invalidate requires one transaction per write-run
– Invalidate uses spatial locality: one transaction per block
– Update has lower latency between write and read
– Update: BW (increased) vs. latency (decreased) tradeoff

Invalidate protocol is more popular than update !

An Example Snoopy Protocol

• Invalidation protocol, write-back cache

• Each block of memory is in one state:
– Clean in all caches and up-to-date in memory
– OR Dirty in exactly one cache
– OR Not in any caches
• Each cache block is in one state:
– Shared: block can be read
– OR Exclusive: cache has only copy, its writeable, and dirty
– OR Invalid: block contains no data
• Read misses: cause all caches to snoop
• Writes to clean line are treated as misses (or
write invalidate)
Snoopy-Cache State Machine-I
CPU Read hit
• State machine
for CPU requests
for each
cache block CPU Read Shared
Invalid (read/only)
Place read miss
on bus

CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
(read/write)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus
Snoopy-Cache State Machine-II
• State machine
for bus requests Write miss
for each for this block Shared
cache block Invalid
(read/only)

Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block; (abort
Exclusive memory access)
(read/write)
Example
Processor 1 Processor 2 Bus Memory
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit

is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 1
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit

is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2. miss on bus
Active arrow = Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 3
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 10
P2: Write 40 to A2 10
10

Assumes initial cache state Remote Write CPU Read hit

is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2. miss on bus
Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 4
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 10
10

Assumes initial cache state Remote Write CPU Read hit

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 5
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20

Assumes initial cache state Remote Write CPU Read hit

is invalid and A1 and A2 map Shared
Invalid CPU Read Miss
to same cache block, Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Place Write
Remote Read
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Coherency Misses: 4th C
Joins Compulsory, Capacity, Conflict

1. True sharing misses arise from the

communication of data through the cache
coherence mechanism
• Invalidates due to 1st write to shared block
• Reads by another CPU of modified block in different
cache
• Miss would still occur if block size were 1 word
2. False sharing misses when a block is
invalidated because some word in the block,
other than the one being read, is written into
• Invalidation does not cause a new value to be
communicated, but only causes an extra cache miss
• Block is shared, but no word in block is actually shared
Þ miss would not occur if block size were 1 word
Example: True vs. False Sharing vs.
Hit?
• Assume x1 and x2 in same cache block.
P1 and P2 both read x1 and x2 before.

Time P1 P2 True, False, Hit? Why?

1 Write x1 True miss; invalidate x1 in P2
2 Read x2 False miss; x1 irrelevant to P2
3 Write x1 False miss; x1 irrelevant to P2
4 Write x2 False miss; x1 irrelevant to P2
5 Read x2 True miss; invalidate x2 in P1
§7.3 Shared Memory Multiprocessors
Shared Memory
• SMP: shared memory multiprocessor
– Hardware provides single physical
address space for all processors
– Synchronize shared variables using locks
– Memory access time
» UMA (uniform) vs. NUMA (nonuniform)

Chapter 7 — Multicores, Multiprocessors, and Clusters — 30

Communication Models

• Single Address Space: load/store

Pn
load x Common x
P2
Physical
P1 Address
store xP0
Shared Pn private
Portion of
Address
Space
P2 private

Private portion P1 private

Address
space
P0 private
Program Example – Single-Address Space

• sum 100,000 numbers & 100 processors (load & store)

First Step: each processor (Pn) sums his subset of numbers

Sum is shared variable

sum 0 1 2::::::::::::::999

P0
Program Example – Single-Address Space

Second Step: Add partial sums via divide-and-conquer

Message Passing Multiprocessors
• Clusters: Collections of computers connected via I/O
over standard network switches to form a message-
passing multiprocessors
• NUMA: Non-Uniform Memory Access/Directory-based
cache coherency protocol
Processor Processor Processor Processor
+Cache +Cache +Cache +Cache

: :
memory I/O memory I/O

Interconnection Network

memory I/O
: : memory I/O

Processor Processor Processor Processor

+Cache +Cache +Cache +Cache
Communication Models

• Multiple address spaces: Message Passing

Local Process Local Process

Address Space Address Space

match
Recv y, P, t

x Send x, Q, t y

Process P Process Q
Parallel Program – Message Passing

• sum 100,000 numbers & 100 processors (send & receive)

First Step: each processor (Pn) sums his subset of numbers

Parallel Program – Message Passing

Second Step: Add partial sums via divide-and-conquer

Bisection Bandwidth is Important
Bus Multicore
Total network bandwidth =
p p p bandwidth-per-link x link_no

c c c
Bisection bandwidth =
BUS
the bandwidth between two equal parts of a
multiprocessor

Ring Multicore
p p p p
c c c c
s s s s
Network Topology

switch

Processor-memory

Fully-connected
2D torus

Ring
Cube
Multistage Networks

Chapter 6 — Parallel Processors from Client to Cloud — 40

Network Characteristics

• Performance
– Latency per message (unloaded network)
– Throughput
» Link bandwidth
» Total network bandwidth
» Bisection bandwidth
– Congestion delays (depending on traffic)
• Cost
• Power
• Routability in silicon

Chapter 7 — Multicores, Multiprocessors, and Clusters — 41

What is multi-core?

chip
chip chip

Core Core Core Core Core

Cache Cache Cache Cache Cache

bus
bus

Off-chip bus On-chip bus

From Multicore to Manycore
Basic CMP Architecture
• L 1 caches are always private to a core
• L2 caches can be private or shared – which is
better?

Core 1 Core 2 Core 3 Core 4

I-L1 D-L1 I-L1 D-L1 I-L1 D-L1 I-L1 D-L1

Interconnection Network

L2
Scalable CMP Architecture
• Tiled CMP
– Each tile includes processor, L1, L2, and router
– Physically distributed last level cache
ARM Big.Little Technology
• ARM big.LITTLE processing is designed to deliver the
vision of the right processor for the right job.
• In current big.LITTLE system implementations a ‘big’ ARM
Cortex-A15 processor is paired with a ‘LITTLE’ Cortex-
A7 processor to create a system that can accomplish both
high intensity and low intensity tasks in the most energy
efficient manner
– Cortex-A15 – heavy workloads
– Cortex-A7 - light workloads, like operating system activities,
user interface and other always on, always connected tasks.

46
Multithreading
• Performing multiple threads of execution in
parallel
– Replicate registers, PC, etc.
– Fast switching between threads
• Fine-grain multithreading
– Switch threads after each cycle
– Interleave instruction execution
– If one thread stalls, others are executed
• Coarse-grain multithreading
– Only switch on long stall (e.g., L2-cache miss)
– Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
Multithreaded Categories
Superscalar Fine-Grained Coarse-Grained Multiprocessing
Time (processor cycle)

Thread 1 Thread 3 Thread 5

Thread 2 Thread 4 Idle slot

48
Simultaneous Multithreading

• In multiple-issue dynamically scheduled

processor
– Schedule instructions from multiple threads
– Instructions from independent threads execute when
function units are available
– Within threads, dependencies handled by scheduling and
register renaming
• Example: Intel Pentium-4 HT
– Two threads: duplicated registers, shared function units
and caches

Chapter 7 — Multicores, Multiprocessors, and Clusters — 49

Multithreaded Categories
Simultaneous
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading
Time (processor cycle)

Thread 1 Thread 3 Thread 5

Thread 2 Thread 4 Idle slot

50
§7.6 SISD, MIMD, SIMD, SPMD, and Vector
Computing Device Classification:
Instruction and Data Streams

Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345

n SPMD: Single Program Multiple Data

n A parallel program on a MIMD computer
n Conditional code for different processors

Chapter 7 — Multicores, Multiprocessors, and Clusters — 51

Introduction
SIMD
n SIMD architectures can exploit significant data-
level parallelism for:
n matrix-oriented scientific computing
n media-oriented image and sound processors

n SIMD is more energy efficient than MIMD

n Only needs to fetch one instruction per data operation
n Makes SIMD attractive for personal mobile devices

n SIMD allows programmer to continue to think

sequentially
Copyright © 2012, Elsevier Inc. All rights reserved. 52
SIMD Instruction Set Extensions for Multimedia
SIMD Extensions
• Media applications operate on data types
narrower than the native word size
– 4-byte registers – R,G,B (byte per pixel)
• Implementations:
– Intel MMX (1996)
» Eight 8-bit integer ops or four 16-bit integer ops
– Streaming SIMD Extensions (SSE) (1999)
» Eight 16-bit integer ops
» Four 32-bit integer/fp ops or two 64-bit integer/fp ops
– Advanced Vector Extensions (2010)
» Four 64-bit integer/fp ops

– Operands must be consecutive and aligned memory

locations
Vector Architectures
Vector Architectures
n Basic idea:
n Read sets of data elements into “vector registers”
(Gather)
n Operate on those registers
n Highly pipelined function units
n Disperse the results back into memory (Scatter)

Copyright © 2012, Elsevier Inc. All rights reserved. 54

Vector Extension to RISC-V
n v0 to v31: 64 × 64-bit element registers
n Vector instructions
n fld.v, fsd.v: load/store vector
n fadd.d.v: add vectors of double
n fadd.d.vs: add scalar to each element of vector of
double
n Significantly reduces instruction-fetch bandwidth

55
Example: DAXPY (Y = a × X + Y)
Conventional RISC-V code:
fld f0,a(x3) // load scalar a
addi x5,x19,512 // end of array X
loop: fld f1,0(x19) // load x[i]
fmul.d f1,f1,f0 // a * x[i]
fld f2,0(x20) // load y[i]
fadd.d f2,f2,f1 // a * x[i] + y[i]
fsd f2,0(x20) // store y[i]
addi x19,x19,8 // increment index to x
addi x20,x20,8 // increment index to y
bltu x19,x5,loop // repeat if not done

Vector RISC-V code:

fld f0,a(x3) // load scalar a
fld.v v0,0(x19) // load vector x
fmul.d.vs v0,v0,f0 // vector-scalar multiply
fld.v v1,0(x20) // load vector y
fadd.d.v v1,v1,v0 // vector-vector add
fsd.v v1,0(x20) // store vector y

Chapter 6 — Parallel Processors from Client to Cloud — 56

Vector vs. Scalar

• Vector architectures and compilers

– Simplify data-parallel programming
– Explicit statement of absence of loop-carried
dependences
» Reduced checking in hardware
– Regular access patterns benefit from interleaved and
burst memory
– Avoid control hazards by avoiding loops
• More general than ad-hoc media extensions
(such as MMX, SSE)
– Better match with compiler technology

Chapter 7 — Multicores, Multiprocessors, and Clusters — 57

Multiple-Lane Vector units
n Vector units can be combination of pipelined and
arrayed functional units:

Chapter 6 — Parallel Processors from Client to Cloud — 58

What is GPU?
l Graphics Processing Units
l GPU is a device to compute massive vertices, pixels,
and general purpose data
l Feature
l High availability
l High computing performance
l Low price of computing capability
600
Tesla
C870

500

GeForce
8800 GTX
400
Quadro
GFLOPS

FX 5600
300
G71
G70-512
G70
200

3.0 GHz
100 Core 2 Quad
NV40 3.0 GHz
NV35
NV30 Core 2 Duo
3.0 GHz Pentium 4
0
59
Jan 2003 Jul 2003 Jan 2004 Jul 2004 Jan 2005 Jul 2005 Jan 2006 Jul 2006 Jan 2007 Jul 2007
GPU’s History and Evolution
l Early History
l In early 90’s, graphics are only performed by a video graphics array (VGA)
controller
l In 1997, VGA controllers start to incorporate 3D acceleration functions
l In 2000, the term GPU is coined to denote that the graphics devices had
become a processor
l GPU Evolution
l Fixed-Function → Programmable
l 1999, NVIDIA Geforce 256, Fixed-function vertex transform and pixel pipeline
l 2001, NVIDIA Geforce 3, 1st programmable vertex processor
l 2002, ATI Radeon 9700, 1st programmable pixel(fragment) processor
l Non-unified Processor → Unified Processor
l 2005, Microsoft XBOX 360, 1st unified shader architecture
l Tesla GPU series released in 2007
l Fermi Architecture released in 2009
l Kepler Architecture released in 2012
l Maxwell Architecture released in 2014
l Pascal Architecture released in 2016 (16 nm FinFET process)
l Volta Architecture released in 2017 – Tensor Core for AI (12 nm process)
l Turing Architecture released in 2018

60
Fixed-function 3D Graphics Pipeline
•Set render state
•Draw calls

•Vertices
•Textures

Programming languages/APIs
DirectX, OpenGL 61
Programmable 3D Graphics Pipeline
High Level Shader Language
(HLSL)
• Vertex computation

•In fixed-function pipeline, transformation, shading,

lighting are set via setting render state

•In programmable pipeline, these are done by•Pixel

shadercomputation
code written by engineers

62
Unified Shader Architecture
l Use the same shader
processors for all types
of computation
l Vertex threads
l Pixel threads
l Computation threads
l Advantage
l Better resource utilization
l Lower hardware
complexity

63
Modern GPUs: A Computing Device
l GPUs have orders of magnitudes more computing
power than CPU
l General-purpose tasks with high-degree of data level
parallelism running on GPU outperform those on CPU
=> General-Purpose computing on GPU (GPGPU)
l GPGPU programming models
l NVIDIA’s CUDA
GPGPU Performance
l AMD’s StreamSDK
Medical Imaging 300X
l OpenCL
Molecular Dynamics 150X
SPICE 130X
Fourier Transform 130X
Fluid Dynamics 100X

64
Fundamental Architectural
Differences between CPU & GPU
l Multi-core CPU
l Coarse-grain, heavyweight threads
l Memory latency is resolved though large on-chip caches &
out-of-order execution
l Modern GPU
l Fine-grain, lightweight threads
l Exploit thread-level parallelism for hiding latency

Out-of-order control logic

ALU ALU
Branch Predictor
Memory Prefetcher ALU ALU
CPU GPU
Non-blocking Cache

DRAM DRAM

65
66
SIMD processor

67
SIMT Execution Model of GPUs
l SIMT (Single Instruction Multiple Threads)
l Warp
l A group of threads (pixel, vertex, compute…)
l Basic scheduling/execution unit Time
Warp 1, Instruction 30
l Common PC value
Warp 4, Instruction 1

Warp 10, Instruction 13

1 ~ 32 thread ID Warp 16, Instruction 7

Thread block
… Warp 1

…
33 ~ 64 thread ID
… Warp 1, Instruction 31
… Warp 2
…

1 ~ x thread ID Warp 10, Instruction 14

… Warp n
Warp 4, Instruction 2

Warp 16, Instruction 868

GPU Memory Structures

Chapter 6 — Parallel
Processors from Client
Latency Hiding
l Interleaved warp execution
Time

Warp 1, Instruction 30

Executing
STALL Warp 4, Instruction 1
Waiting (ready)
Warp 10, Instruction 13
Stalled
STALL Warp 16, Instruction 7

Warp 1, Instruction 31

Warp 10, Instruction 14

Warp 4, Instruction 2

Warp 16, Instruction 8

Warp 1 Warp 4 Warp 10 Warp 16

70
Volta

71
Turing

72
Graphics in the System

Chapter 7 — Multicores, Multiprocessors, and Clusters — 73

CPU/GPU Integration:
CPU’s Advancement Meets GPU’s
Microprocessor Advancement

Multi-Core Heterogeneous
CPU/GPU
Single-Thread
Integration
Mainstream

Era Era Systems Era

High Performance
Programmability

Task Parallel Execution Heterogeneous

Computing System-Level
Progammable
Experts Only

GPU Advancement
Vertex/Pixel
Shader

Power-efficient
Unacceptable

Data Parallel Graphics

Execution Driver-based
programs

Throughput Performance 74
Heterogeneous Computing ~ 2011
AMD Fusion

l Intel SandyBridge
l Shared last-level cache (LLC) and main
memory

l AMD Fusion APU (Accelerated

Processing Unit) Intel SandyBridge
l Shared main memory

75
Evolution of Heterogeneous Computing
▪ Dedicated GPU
– GPU kernel is launched through the device driver
– Separate CPU/GPU address space
– Separate system/GPU memory
– Data copy between CPU/GPU via PCIe

Address space Address space

OpenCL managed by OS managed by driver
Application
OpenCL
Runtime Library CPU data preparation computation
GPU
…
User Space

Kernel Space !"#$! Core2 … CoreN CU1 CU2 CUN

L1 L1 L1
GPU L1/L2 L1/L2 L1/L2
Device Driver

LLC L2

= kernel launch
process
System memory GPU memory
(coherent) PCIe (Non-coherent)

76
Evolution of Heterogeneous Computing
▪ Integrated GPU architecture
– GPU kernel is launched through the device driver
– Separate CPU/GPU address space
– Separate system/GPU memory
– Data copy between CPU/GPU via memory bus

Address space Address space

OpenCL managed by OS managed by driver
Application
OpenCL
Runtime Library CPU data preparation Computation
GPU
…
User Space

Kernel Space !"#$! Core2 … CoreN CU1 CU2 CUN

L1 L1 L1
GPU L1/L2 L1/L2 L1/L2
Device Driver

LLC L2

= kernel launch
process
System
System
memory
memory GPU
GPU
memory
memory
(coherent)
(coherent) PCIe (Non-coherent)
(Non-coherent)

77
Evolution of Heterogeneous Computing
▪ Integrated GPU architecture
– GPU kernel is launched through the device driver
– Unified CPU/GPU address space (managed by OS)
– Unified system/GPU memory
– No data copy - data can be retrieved by pointer passing
Address space Address space
OpenCL managed by OS managed by driver
Application
OpenCL
Runtime Library CPU data preparation Computation
GPU
Address space
…
User Space

Kernel Space !"#$! Core2 … CoreN managed by OS CU1 CU2 CUN

L1 L1 L1
GPU L1/L2 L1/L2 L1/L2
Device Driver L2

LLC L2
LLC

= kernel launch
process
System memory GPU memory
Coherent
(coherent)
system memory
(Non-coherent)

78
CUDA Unified Memory

79
§6.10 Multiprocessor Benchmarks and Performance Models
Parallel Benchmarks
l Linpack: matrix linear algebra
l SPECrate: parallel run of SPEC CPU programs
l Job-level parallelism
l SPLASH: Stanford Parallel Applications for
Shared Memory
l Mix of kernels and applications, strong scaling
l NAS (NASA Advanced Supercomputing) suite
l computational fluid dynamics kernels
l PARSEC (Princeton Application Repository for
Shared Memory Computers) suite
l Multithreaded applications using Pthreads and
OpenMP
Chapter 6 — Parallel Processors
from Client to Cloud — 80
Modeling Performance~ Roofline Model
l Target performance metric
l Achievable GFLOPs/sec
l Hardware: for a given computer, determine
l Peak GFLOPS (from data sheet)
l Peak memory bytes/sec (using Stream benchmark)
l Software: Arithmetic intensity of a kernel
l FLOPs per byte of memory accessed

Chapter 6 — Parallel
Processors from Client
Roofline : A Simple Performance
Model

Floating-Point Ops/sec
= Bytes/sec
Floating-Point Ops/ byte

Arithmetic Intensity

(assuming 16GB/sec peak bandwidth)

Attainable GFLOPs/sec
= Min ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )

Chapter 6 — Parallel Processors

from Client to Cloud — 82
Comparing Systems
l Example: Opteron X2 vs. Opteron X4
l 2-core vs. 4-core, 2× FP performance/core, 2.2GHz
vs. 2.3GHz
l Same memory system

n To get higher performance

on X4 than X2
n Need high arithmetic intensity
n Or working set must fit in X4’s
2MB L-3 cache

Chapter 6 — Parallel
Processors from Client

Unit 4
No ratings yet
Unit 4
9 pages
Multiprocessing: Flynn's Classification (1966)
No ratings yet
Multiprocessing: Flynn's Classification (1966)
8 pages
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
No ratings yet
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
24 pages
Parallel Cache Coherence
No ratings yet
Parallel Cache Coherence
14 pages
Lecture 18: Coherence Protocols
No ratings yet
Lecture 18: Coherence Protocols
18 pages
EGC121lect20 Multicore MSI Protocol
No ratings yet
EGC121lect20 Multicore MSI Protocol
39 pages
Cache Coherence in Multiprocessors
100% (2)
Cache Coherence in Multiprocessors
46 pages
Lect4 Parallelsystem-Shared Memory
No ratings yet
Lect4 Parallelsystem-Shared Memory
31 pages
Cache Coherence (Part 1)
No ratings yet
Cache Coherence (Part 1)
13 pages
Chapter 4: Multiprocessor: Dr. Eng. Amr T. Abdel-Hamid Spring 2011
No ratings yet
Chapter 4: Multiprocessor: Dr. Eng. Amr T. Abdel-Hamid Spring 2011
22 pages
Snooping Cache and Directory Based Multiprocessors
No ratings yet
Snooping Cache and Directory Based Multiprocessors
59 pages
ACA Lecture 29 Cache-Coherence 2
No ratings yet
ACA Lecture 29 Cache-Coherence 2
42 pages
Cache Coherence
No ratings yet
Cache Coherence
18 pages
EECS 470 Final Review
No ratings yet
EECS 470 Final Review
16 pages
Shared Memory Architecture
No ratings yet
Shared Memory Architecture
39 pages
Multi Processor
No ratings yet
Multi Processor
63 pages
Cache Coherence
No ratings yet
Cache Coherence
53 pages
Advanced Shared Memory Systems
No ratings yet
Advanced Shared Memory Systems
25 pages
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Advanced Computer Architecture - WWW - Rgpvnotes.in
14 pages
Cache Coherence and Synchronization - Tutorialspoint
No ratings yet
Cache Coherence and Synchronization - Tutorialspoint
7 pages
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
No ratings yet
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
33 pages
Parallel 2
No ratings yet
Parallel 2
14 pages
Coherence
No ratings yet
Coherence
16 pages
Multiprocessor Cache Coherence
No ratings yet
Multiprocessor Cache Coherence
13 pages
Distributed OS: Memory & Multiprocessors
No ratings yet
Distributed OS: Memory & Multiprocessors
89 pages
CA Lecture 13
No ratings yet
CA Lecture 13
27 pages
L39 - Centralized Shared Memory Architectures
No ratings yet
L39 - Centralized Shared Memory Architectures
31 pages
Module 4
No ratings yet
Module 4
40 pages
CA-unit 5-Material-For Reference
No ratings yet
CA-unit 5-Material-For Reference
16 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Grade 12 IT Theory Notes PDF
No ratings yet
Grade 12 IT Theory Notes PDF
126 pages
Key Aspects of Shelved Issue
No ratings yet
Key Aspects of Shelved Issue
13 pages
Cache Coherence in SMP Systems
No ratings yet
Cache Coherence in SMP Systems
29 pages
Module1 DistributedSystemModels
No ratings yet
Module1 DistributedSystemModels
147 pages
Chapter 8 - Parallel Processing
No ratings yet
Chapter 8 - Parallel Processing
50 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Cache Coherence
No ratings yet
Cache Coherence
39 pages
Shared Memory Architecture Guide
No ratings yet
Shared Memory Architecture Guide
34 pages
Multiprocessor Architectures & Cache Coherence
No ratings yet
Multiprocessor Architectures & Cache Coherence
54 pages
MODULE 4 HPC
No ratings yet
MODULE 4 HPC
41 pages
Chapter 7
No ratings yet
Chapter 7
97 pages
1st Ia Preparation
No ratings yet
1st Ia Preparation
15 pages
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
No ratings yet
Multithreading: Multithreading Computers Have Hardware Support To Efficiently Execute Multiple
5 pages
MN Cache Coherence
No ratings yet
MN Cache Coherence
11 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Cache Coherency
No ratings yet
Cache Coherency
33 pages
Snoop-Based Multiprocessor Design
No ratings yet
Snoop-Based Multiprocessor Design
57 pages
Cache Coherence: Computer Science & Artificial Intelligence Lab
No ratings yet
Cache Coherence: Computer Science & Artificial Intelligence Lab
36 pages
Cs 6461 Computer Architecture Lecture 11
No ratings yet
Cs 6461 Computer Architecture Lecture 11
51 pages
Parallel Arch 2
No ratings yet
Parallel Arch 2
9 pages
Ec8552-Cao Unit 5
No ratings yet
Ec8552-Cao Unit 5
72 pages
Distributed Shared Memory Systems
No ratings yet
Distributed Shared Memory Systems
23 pages
Module 4
No ratings yet
Module 4
66 pages
Distributed Shared Memory: Introduction & Thisis
No ratings yet
Distributed Shared Memory: Introduction & Thisis
22 pages
cs8491 QB PDF
No ratings yet
cs8491 QB PDF
17 pages
Snooping vs. Directory Based Coherency: Professor David A. Patterson Computer Science 252 Fall 1996
No ratings yet
Snooping vs. Directory Based Coherency: Professor David A. Patterson Computer Science 252 Fall 1996
59 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Design Issues: SMT and CMP Architectures
No ratings yet
Design Issues: SMT and CMP Architectures
9 pages
Module 7
No ratings yet
Module 7
28 pages
Cache Coherence: CEG 4131 Computer Architecture III Slides Developed by Dr. Hesham El-Rewini
No ratings yet
Cache Coherence: CEG 4131 Computer Architecture III Slides Developed by Dr. Hesham El-Rewini
63 pages
09 - Thread Level Parallelism
50% (2)
09 - Thread Level Parallelism
34 pages
ECE 452: Computer Organization and Design
No ratings yet
ECE 452: Computer Organization and Design
9 pages
ECE VI Sem Computer Architecture
100% (1)
ECE VI Sem Computer Architecture
13 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
Design For Performance
100% (1)
Design For Performance
34 pages
CA - OS-Chapter 2 - Students
No ratings yet
CA - OS-Chapter 2 - Students
44 pages
Computer-Architecture Q&A
100% (3)
Computer-Architecture Q&A
37 pages
SMT and CMP Architectures
No ratings yet
SMT and CMP Architectures
19 pages
CS6303 Computer Architecture 2
No ratings yet
CS6303 Computer Architecture 2
56 pages
Multiprocessors & Thread-Level Parallelism
79% (19)
Multiprocessors & Thread-Level Parallelism
29 pages
CS8491 Ca QB
No ratings yet
CS8491 Ca QB
16 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Computer Architecture
No ratings yet
Computer Architecture
18 pages
COA UNIT-III Parallel Processors
No ratings yet
COA UNIT-III Parallel Processors
51 pages
Micro 3
No ratings yet
Micro 3
4 pages
Central Processing Unit: From Wikipedia, The Free Encyclopedia
No ratings yet
Central Processing Unit: From Wikipedia, The Free Encyclopedia
18 pages
Chapter 06
No ratings yet
Chapter 06
76 pages
Slot29 CH18 MultiCoreComputers 18 Slides
No ratings yet
Slot29 CH18 MultiCoreComputers 18 Slides
18 pages
Unit I: 1.1 OS Structures
No ratings yet
Unit I: 1.1 OS Structures
22 pages
4) Unit 3 MyClass
No ratings yet
4) Unit 3 MyClass
11 pages
Lesson Plan LP - CS6303 LP Rev. No: 00 Date: 20/06/2014 Page: 01 of 06 Sub Code: CS6303 Sub Name: Unit: I Branch: Be (Cse) Semester: Iii
No ratings yet
Lesson Plan LP - CS6303 LP Rev. No: 00 Date: 20/06/2014 Page: 01 of 06 Sub Code: CS6303 Sub Name: Unit: I Branch: Be (Cse) Semester: Iii
6 pages
ECE Advanced Architecture Guide
No ratings yet
ECE Advanced Architecture Guide
9 pages
SSZG 516
No ratings yet
SSZG 516
5 pages
Cache Coherence - MESI MOESI
No ratings yet
Cache Coherence - MESI MOESI
57 pages
Merge Sort Report
No ratings yet
Merge Sort Report
2 pages