Lecture 9 Multi-Processor
Lecture 9 Multi-Processor
Cloud
Parallel Computers
• Goal: connecting multiple computers
to get higher performance
– Multiprocessors
– Scalability, availability, power efficiency
• Update
CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
(read/write)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus
Snoopy-Cache State Machine-II
• State machine
for bus requests Write miss
for each for this block Shared
cache block Invalid
(read/only)
Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block; (abort
Exclusive memory access)
(read/write)
Example
Processor 1 Processor 2 Bus Memory
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 1
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 2
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 3
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 10
P2: Write 40 to A2 10
10
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 4
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 10
10
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Example: Step 5
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20
Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back
Coherency Misses: 4th C
Joins Compulsory, Capacity, Conflict
sum 0 1 2::::::::::::::999
P0
Program Example – Single-Address Space
: :
memory I/O memory I/O
Interconnection Network
memory I/O
: : memory I/O
match
Recv y, P, t
x Send x, Q, t y
Process P Process Q
Parallel Program – Message Passing
c c c
Bisection bandwidth =
BUS
the bandwidth between two equal parts of a
multiprocessor
Ring Multicore
p p p p
c c c c
s s s s
Network Topology
switch
Processor-memory
Fully-connected
2D torus
Ring
Cube
Multistage Networks
• Performance
– Latency per message (unloaded network)
– Throughput
» Link bandwidth
» Total network bandwidth
» Bisection bandwidth
– Congestion delays (depending on traffic)
• Cost
• Power
• Routability in silicon
chip
chip chip
bus
bus
Interconnection Network
L2
Scalable CMP Architecture
• Tiled CMP
– Each tile includes processor, L1, L2, and router
– Physically distributed last level cache
ARM Big.Little Technology
• ARM big.LITTLE processing is designed to deliver the
vision of the right processor for the right job.
• In current big.LITTLE system implementations a ‘big’ ARM
Cortex-A15 processor is paired with a ‘LITTLE’ Cortex-
A7 processor to create a system that can accomplish both
high intensity and low intensity tasks in the most energy
efficient manner
– Cortex-A15 – heavy workloads
– Cortex-A7 - light workloads, like operating system activities,
user interface and other always on, always connected tasks.
46
Multithreading
• Performing multiple threads of execution in
parallel
– Replicate registers, PC, etc.
– Fast switching between threads
• Fine-grain multithreading
– Switch threads after each cycle
– Interleave instruction execution
– If one thread stalls, others are executed
• Coarse-grain multithreading
– Only switch on long stall (e.g., L2-cache miss)
– Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
Multithreaded Categories
Superscalar Fine-Grained Coarse-Grained Multiprocessing
Time (processor cycle)
48
Simultaneous Multithreading
50
§7.6 SISD, MIMD, SIMD, SPMD, and Vector
Computing Device Classification:
Instruction and Data Streams
Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345
55
Example: DAXPY (Y = a × X + Y)
Conventional RISC-V code:
fld f0,a(x3) // load scalar a
addi x5,x19,512 // end of array X
loop: fld f1,0(x19) // load x[i]
fmul.d f1,f1,f0 // a * x[i]
fld f2,0(x20) // load y[i]
fadd.d f2,f2,f1 // a * x[i] + y[i]
fsd f2,0(x20) // store y[i]
addi x19,x19,8 // increment index to x
addi x20,x20,8 // increment index to y
bltu x19,x5,loop // repeat if not done
500
GeForce
8800 GTX
400
Quadro
GFLOPS
FX 5600
300
G71
G70-512
G70
200
3.0 GHz
100 Core 2 Quad
NV40 3.0 GHz
NV35
NV30 Core 2 Duo
3.0 GHz Pentium 4
0
59
Jan 2003 Jul 2003 Jan 2004 Jul 2004 Jan 2005 Jul 2005 Jan 2006 Jul 2006 Jan 2007 Jul 2007
GPU’s History and Evolution
l Early History
l In early 90’s, graphics are only performed by a video graphics array (VGA)
controller
l In 1997, VGA controllers start to incorporate 3D acceleration functions
l In 2000, the term GPU is coined to denote that the graphics devices had
become a processor
l GPU Evolution
l Fixed-Function → Programmable
l 1999, NVIDIA Geforce 256, Fixed-function vertex transform and pixel pipeline
l 2001, NVIDIA Geforce 3, 1st programmable vertex processor
l 2002, ATI Radeon 9700, 1st programmable pixel(fragment) processor
l Non-unified Processor → Unified Processor
l 2005, Microsoft XBOX 360, 1st unified shader architecture
l Tesla GPU series released in 2007
l Fermi Architecture released in 2009
l Kepler Architecture released in 2012
l Maxwell Architecture released in 2014
l Pascal Architecture released in 2016 (16 nm FinFET process)
l Volta Architecture released in 2017 – Tensor Core for AI (12 nm process)
l Turing Architecture released in 2018
60
Fixed-function 3D Graphics Pipeline
•Set render state
•Draw calls
•Vertices
•Textures
Programming languages/APIs
DirectX, OpenGL 61
Programmable 3D Graphics Pipeline
High Level Shader Language
(HLSL)
• Vertex computation
62
Unified Shader Architecture
l Use the same shader
processors for all types
of computation
l Vertex threads
l Pixel threads
l Computation threads
l Advantage
l Better resource utilization
l Lower hardware
complexity
63
Modern GPUs: A Computing Device
l GPUs have orders of magnitudes more computing
power than CPU
l General-purpose tasks with high-degree of data level
parallelism running on GPU outperform those on CPU
=> General-Purpose computing on GPU (GPGPU)
l GPGPU programming models
l NVIDIA’s CUDA
GPGPU Performance
l AMD’s StreamSDK
Medical Imaging 300X
l OpenCL
Molecular Dynamics 150X
SPICE 130X
Fourier Transform 130X
Fluid Dynamics 100X
64
Fundamental Architectural
Differences between CPU & GPU
l Multi-core CPU
l Coarse-grain, heavyweight threads
l Memory latency is resolved though large on-chip caches &
out-of-order execution
l Modern GPU
l Fine-grain, lightweight threads
l Exploit thread-level parallelism for hiding latency
DRAM DRAM
65
66
SIMD processor
67
SIMT Execution Model of GPUs
l SIMT (Single Instruction Multiple Threads)
l Warp
l A group of threads (pixel, vertex, compute…)
l Basic scheduling/execution unit Time
Warp 1, Instruction 30
l Common PC value
Warp 4, Instruction 1
…
33 ~ 64 thread ID
… Warp 1, Instruction 31
… Warp 2
…
… Warp n
Warp 4, Instruction 2
Chapter 6 — Parallel
Processors from Client
Latency Hiding
l Interleaved warp execution
Time
Warp 1, Instruction 30
Executing
STALL Warp 4, Instruction 1
Waiting (ready)
Warp 10, Instruction 13
Stalled
STALL Warp 16, Instruction 7
Warp 1, Instruction 31
Warp 4, Instruction 2
70
Volta
71
Turing
72
Graphics in the System
Multi-Core Heterogeneous
CPU/GPU
Single-Thread
Integration
Mainstream
High Performance
Programmability
GPU Advancement
Vertex/Pixel
Shader
Power-efficient
Unacceptable
Throughput Performance 74
Heterogeneous Computing ~ 2011
AMD Fusion
l Intel SandyBridge
l Shared last-level cache (LLC) and main
memory
75
Evolution of Heterogeneous Computing
▪ Dedicated GPU
– GPU kernel is launched through the device driver
– Separate CPU/GPU address space
– Separate system/GPU memory
– Data copy between CPU/GPU via PCIe
LLC L2
= kernel launch
process
System memory GPU memory
(coherent) PCIe (Non-coherent)
76
Evolution of Heterogeneous Computing
▪ Integrated GPU architecture
– GPU kernel is launched through the device driver
– Separate CPU/GPU address space
– Separate system/GPU memory
– Data copy between CPU/GPU via memory bus
LLC L2
= kernel launch
process
System
System
memory
memory GPU
GPU
memory
memory
(coherent)
(coherent) PCIe (Non-coherent)
(Non-coherent)
77
Evolution of Heterogeneous Computing
▪ Integrated GPU architecture
– GPU kernel is launched through the device driver
– Unified CPU/GPU address space (managed by OS)
– Unified system/GPU memory
– No data copy - data can be retrieved by pointer passing
Address space Address space
OpenCL managed by OS managed by driver
Application
OpenCL
Runtime Library CPU data preparation Computation
GPU
Address space
…
User Space
LLC L2
LLC
= kernel launch
process
System memory GPU memory
Coherent
(coherent)
system memory
(Non-coherent)
78
CUDA Unified Memory
79
§6.10 Multiprocessor Benchmarks and Performance Models
Parallel Benchmarks
l Linpack: matrix linear algebra
l SPECrate: parallel run of SPEC CPU programs
l Job-level parallelism
l SPLASH: Stanford Parallel Applications for
Shared Memory
l Mix of kernels and applications, strong scaling
l NAS (NASA Advanced Supercomputing) suite
l computational fluid dynamics kernels
l PARSEC (Princeton Application Repository for
Shared Memory Computers) suite
l Multithreaded applications using Pthreads and
OpenMP
Chapter 6 — Parallel Processors
from Client to Cloud — 80
Modeling Performance~ Roofline Model
l Target performance metric
l Achievable GFLOPs/sec
l Hardware: for a given computer, determine
l Peak GFLOPS (from data sheet)
l Peak memory bytes/sec (using Stream benchmark)
l Software: Arithmetic intensity of a kernel
l FLOPs per byte of memory accessed
Chapter 6 — Parallel
Processors from Client
Roofline : A Simple Performance
Model
Floating-Point Ops/sec
= Bytes/sec
Floating-Point Ops/ byte
Arithmetic Intensity
Attainable GFLOPs/sec
= Min ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )
Chapter 6 — Parallel
Processors from Client