COMPUTER ORGANIZATION AND DESIGN 5th
The Hardware/Software Interface Edition
Parallel Processors from
Client to Cloud
Introduction
Goal: connecting multiple computers
to get higher performance
Multiprocessors
Scalability, availability, power efficiency
Task-level (process-level)
parallelism
High throughput for independent jobs
Parallel processing program
Single program run on multiple
processors
Multicore microprocessors
Chips with multiple processors (cores)
Parallelism Basics- Introduction
Parallelism and Instructions
Synchronization
Parallelism and Computer Arithmetic
Subword Parallelism
Parallelism and Advanced
Instruction-Level Parallelism
Parallelism and Memory
Hierarchies
Cache Coherence
Parallel Computers
Definition: “A parallel computer is a collection of
processing elements that cooperate and
communicate to solve large problems fast.”
Questions about parallel computers:
How large a collection?
How powerful are processing elements?
How do they cooperate and communicate?
How are data transmitted?
What type of interconnection?
What are HW and SW primitives for programmer?
Does it translate into performance?
What level Parallelism?
Bit level parallelism: 1970 to ~1985
4 bits, 8 bit, 16 bit, 32 bit microprocessors
Instruction level parallelism (ILP):
~1985 through today
Pipelining
Superscalar
VLIW
Out-of-Order execution
Limits to benefits of ILP?
Process Level or Thread level parallelism;
mainstream for general purpose computing?
Servers are parallel
High-end Desktop dual processor PC
Why Multiprocessors?
1. Microprocessors as the fastest CPUs
• Collecting several much easier than redesigning 1
2. Complexity of current microprocessors
• Do we have enough ideas to sustain 2X/1.5yr?
• Can we deliver such complexity on schedule?
• Limit to ILP due to data dependency
3. Slow (but steady) improvement in parallel
software (scientific apps, databases, OS)
4. Emergence of embedded and server markets
driving microprocessors in addition to desktops
• Embedded functional parallelism
• Network processors exploiting packet-level parallelism
• SMP Servers and cluster of workstations for multiple users –
Less demand for parallel computing
Instruction-Level Parallelism (ILP)
Pipelining: executing multiple instructions in
parallel
To increase ILP
Deeper pipeline
Less work per stage shorter clock cycle
Multiple issue
Replicate pipeline stages multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC)
E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4
But dependencies reduce this in practice
Multiple Issue
Static multiple issue
Compiler groups instructions to be issued together
Packages them into “issue slots”
Compiler detects and avoids hazards
Dynamic multiple issue
CPU examines instruction stream and chooses
instructions to issue each cycle
Compiler can help by reordering instructions
CPU resolves hazards using advanced techniques at
runtime
Speculation
“Guess” what to do with an instruction
Start operation as soon as possible
Check whether guess was right
If so, complete the operation
If not, roll-back and do the right thing
Common to static and dynamic multiple
issue
Examples
Speculate on branch outcome
Roll back if path taken is different
Speculate on load
Roll back if location is updated
Speculation and Exceptions
What if exception occurs on a speculatively
executed instruction?
e.g., speculative load before null-pointer check
Static speculation
Can add ISA support for deferring exceptions
Dynamic speculation
Can buffer exceptions until instruction completion (which
may not occur)
Static Multiple Issue
Compiler groups instructions into “issue packets”
• Group of instructions that can be issued on a single cycle
• Determined by pipeline resources required
Think of an issue packet as a very long
instruction
• Specifies multiple concurrent operations
• Very Long Instruction Word (VLIW)
Scheduling Static Multiple
Issue
Compiler must remove some/all hazards
• Reorder instructions into issue packets
• No dependencies with a packet
• Possibly some dependencies between packets
• Varies between ISAs; compiler must know!
• Pad with nop if necessary
Dynamic Multiple Issue
“Superscalar” processors
CPU decides whether to issue 0, 1, 2, … each
cycle
• Avoiding structural and data hazards
Avoids the need for compiler scheduling
• Though it may still help
• Code semantics ensured by the CPU
Issues in Multiple Issue Processors
• True Data Dependency
• Procedural Dependency
• Resource Conflicts
• Output Dependency
• Antidependency
Instruction Issue Policy
•In-order Issue with in-order completion
•In-order Issue with out-of-order completion
•Out-of-order Issue with out-of-order completion
Amdahl’s Law and Parallel
Computers
A portion is sequential => limits parallel speedup
Speedup <= 1/ (1-FracX)
Ex. What fraction sequetial to get 80X speedup from
100 processors? Assume either 1 processor or 100
fully used
80 = 1 / [(FracX/100 + (1-FracX)]
0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1
FracX = (80-1)/79.2 = 0.9975
Only 0.25% sequential!
Strong vs Weak Scaling
Strong scaling: problem size fixed
As in example
Weak scaling: problem size proportional to
number of processors
10 processors, 10 × 10 matrix
Time = 20 × tadd
100 processors, 32 × 32 matrix
Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
Constant performance in this example
Instruction and Data Streams
An alternate classification
Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345
SPMD: Single Program Multiple Data
A parallel program on a MIMD computer
Conditional code for different processors
Vector Processors
Highly pipelined function units
Stream data from/to vector registers to units
Data collected from memory into registers
Results stored from registers to memory
Example: Vector extension to MIPS
32 × 64-element registers (64-bit elements)
Vector instructions
lv, sv: load/store vector
addv.d: add vectors of double
addvs.d: add scalar to each element of vector of double
Significantly reduces instruction-fetch bandwidth
Vector vs. Scalar
Vector architectures and compilers
Simplify data-parallel programming
Explicit statement of absence of loop-
carried dependences
Reduced checking in hardware
Regular access patterns benefit from
interleaved and burst memory
Avoid control hazards by avoiding
loops
More general than ad-hoc media
extensions (such as MMX,
SSE)
Better match with compiler
SIMD
Operate elementwise on vectors of data
E.g., MMX and SSE instructions in x86
Multiple data elements in 128-bit wide registers
All processors execute the same
instruction at the same time
Each with different data address,
etc.
Simplifies synchronization
Reduced instruction control
hardware
Works best for highly data-parallel
applications
Vector vs. Multimedia Extensions
Vector instructions have a variable vector width,
multimedia extensions have a fixed width
Vector instructions support strided access,
multimedia extensions do not
Vector units can be combination of pipelined and
arrayed functional units:
Hardware Multithreading
Performing multiple threads of execution in
parallel
Replicate registers, PC, etc.
Fast switching between threads
Fine-grain multithreading
Switch threads after each cycle
Interleave instruction execution
If one thread stalls, others are executed
Coarse-grain multithreading
Only switch on long stall (e.g., L2-cache miss)
Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
Simultaneous Multithreading
In multiple-issue dynamically scheduled
processor
Schedule instructions from multiple
threads
Instructions from independent threads execute
when function units are available
Within threads, dependencies handled by
scheduling and register renaming
Example: Intel Pentium-4 HT
Two threads: duplicated registers, shared
function units and caches
Multithreading Example
Future of Multithreading
Will it survive? In what form?
Power considerations simplified
microarchitectures
Simpler forms of multithreading
Tolerating cache-miss latency
Thread switch may be most
effective
Multiple simple cores might share
resources more effectively
Shared Memory
SMP: shared memory multiprocessor
Hardware provides single physical
address space for all processors
Synchronize shared variables
using locks
Memory access time
UMA (uniform) vs. NUMA
(nonuniform)
Example: Sum Reduction
Sum 100,000 numbers on 100 processor UMA
Each processor has ID: 0 ≤ Pn ≤ 99
Partition 1000 numbers per processor
Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < i = i + 1)
sum[Pn] = sum[Pn] + A[i];
1000*(Pn+1);
Now need to add these partial sums
Reduction: divide and conquer
Half the processors add pairs, then quarter, …
Need to synchronize between reduction steps
Example: Sum Reduction
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when
half is odd;
Processor0 gets missing
element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] +
sum[Pn+half];
until (half == 1);
§6.6 Introduction to Graphics Processing Units
History of GPUs
Early video cards
Frame buffer memory with address generation for
video output
3D graphics processing
Originally high-end computers (e.g., SGI)
Moore’s Law lower cost, higher density
3D graphics cards for PCs and game consoles
Graphics Processing Units
Processors oriented to 3D graphics tasks
Vertex/pixel processing, shading, texture mapping,
rasterization
Graphics in the System
GPU Architectures
Processing is highly data-parallel
GPUs are highly multithreaded
Use thread switching to hide memory latency
Less reliance on multi-level caches
Graphics memory is wide and high-bandwidth
Trend toward general purpose GPUs
Heterogeneous CPU/GPU systems
CPU for sequential code, GPU for parallel code
Programming languages/APIs
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Example: NVIDIA Tesla
Streaming
multiprocessor
8 × Streaming
processors
Example: NVIDIA Tesla
Streaming Processors
Single-precision FP and integer units
Each SP is fine-grained multithreaded
Warp: group of 32 threads
Executed in parallel,
SIMD style
8 SPs
× 4 clock cycles
Hardware contexts
for 24 warps
Registers, PCs, …
Classifying GPUs
Don’t fit nicely into SIMD/MIMD model
Conditional execution in a thread allows an
illusion of MIMD
But with performance degredation
Need to write general purpose code with care
Static: Discovered Dynamic: Discovered
at Compile Time at Runtime
Instruction-Level VLIW Superscalar
Parallelism
Data-Level SIMD or Vector Tesla Multiprocessor
Parallelism
GPU Memory Structures
§6.7 Clusters, WSC, and Other Message-Passing MPs
Message Passing
Each processor has private physical
address space
Hardware sends/receives messages
between processors
Loosely Coupled Clusters
Network of independent computers
Each has private memory and OS
Connected using I/O system
E.g., Ethernet/switch, Internet
Suitable for applications with independent tasks
Web servers, databases, simulations, …
High availability, scalable, affordable
Problems
Administration cost (prefer virtual machines)
Low interconnect bandwidth
c.f. processor/memory bandwidth on an SMP
Sum Reduction (Again)
Sum 100,000 on 100 processors
First distribute 100 numbers to each
The do partial sums
sum = 0;
for(i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
Reduction
Half the processors send, other half receive
and add
The quarter send, quarter receive and add,
…
Sum Reduction (Again)
Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
Send/receive also provide synchronization
Assumes send/receive take similar time to addition
Grid Computing
Separate computers interconnected by
long-haul networks
E.g., Internet connections
Work units farmed out, results sent back
Can make use of idle time on PCs
E.g., SETI@home, World Community
Grid
Interconnection Networks
Network topologies
Arrangements of processors, switches, and links
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
Multistage Networks