Parallel Processors From Client To Cloud: Omputer Rganization and Esign

The document discusses parallel processing from client to cloud computing. It covers parallelism basics including instruction level parallelism using techniques like pipelining, superscalar, and multithreading. It also discusses parallel computer architectures like multicore processors, vector processors, SIMD, and shared memory multiprocessors. Key aspects around parallelism like Amdahl's law, strong versus weak scaling, and hardware issues are summarized.

Uploaded by

Pavithra Janarthanan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views43 pages

Parallel Processors From Client To Cloud: Omputer Rganization and Esign

Uploaded by

Pavithra Janarthanan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

COMPUTER ORGANIZATION AND DESIGN 5th

The Hardware/Software Interface Edition

Parallel Processors from

Client to Cloud
Introduction
 Goal: connecting multiple computers
to get higher performance
 Multiprocessors
 Scalability, availability, power efficiency
 Task-level (process-level)
parallelism
 High throughput for independent jobs
 Parallel processing program
 Single program run on multiple
processors
 Multicore microprocessors
 Chips with multiple processors (cores)
Parallelism Basics- Introduction
 Parallelism and Instructions
 Synchronization
 Parallelism and Computer Arithmetic
 Subword Parallelism
 Parallelism and Advanced
Instruction-Level Parallelism
 Parallelism and Memory
Hierarchies
 Cache Coherence
Parallel Computers
Definition: “A parallel computer is a collection of
processing elements that cooperate and
communicate to solve large problems fast.”

Questions about parallel computers:

How large a collection?
How powerful are processing elements?
How do they cooperate and communicate?
How are data transmitted?
What type of interconnection?
What are HW and SW primitives for programmer?
Does it translate into performance?
What level Parallelism?
Bit level parallelism: 1970 to ~1985
4 bits, 8 bit, 16 bit, 32 bit microprocessors
Instruction level parallelism (ILP):
~1985 through today
Pipelining
Superscalar
VLIW
Out-of-Order execution
Limits to benefits of ILP?
Process Level or Thread level parallelism;
mainstream for general purpose computing?
Servers are parallel
High-end Desktop dual processor PC
Why Multiprocessors?
1. Microprocessors as the fastest CPUs
• Collecting several much easier than redesigning 1
2. Complexity of current microprocessors
• Do we have enough ideas to sustain 2X/1.5yr?
• Can we deliver such complexity on schedule?
• Limit to ILP due to data dependency
3. Slow (but steady) improvement in parallel
software (scientific apps, databases, OS)
4. Emergence of embedded and server markets
driving microprocessors in addition to desktops
• Embedded functional parallelism
• Network processors exploiting packet-level parallelism
• SMP Servers and cluster of workstations for multiple users –
Less demand for parallel computing
Instruction-Level Parallelism (ILP)
Pipelining: executing multiple instructions in
parallel
To increase ILP
Deeper pipeline
Less work per stage  shorter clock cycle
Multiple issue
Replicate pipeline stages  multiple pipelines
Start multiple instructions per clock cycle
CPI < 1, so use Instructions Per Cycle (IPC)
E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4
But dependencies reduce this in practice
Multiple Issue
Static multiple issue
Compiler groups instructions to be issued together
Packages them into “issue slots”
Compiler detects and avoids hazards
Dynamic multiple issue
CPU examines instruction stream and chooses
instructions to issue each cycle
Compiler can help by reordering instructions
CPU resolves hazards using advanced techniques at
runtime
Speculation
“Guess” what to do with an instruction
Start operation as soon as possible
Check whether guess was right
If so, complete the operation
If not, roll-back and do the right thing
Common to static and dynamic multiple
issue
Examples
Speculate on branch outcome
Roll back if path taken is different
Speculate on load
Roll back if location is updated
Speculation and Exceptions
What if exception occurs on a speculatively
executed instruction?
e.g., speculative load before null-pointer check
Static speculation
Can add ISA support for deferring exceptions
Dynamic speculation
Can buffer exceptions until instruction completion (which
may not occur)
Static Multiple Issue
Compiler groups instructions into “issue packets”
• Group of instructions that can be issued on a single cycle
• Determined by pipeline resources required
Think of an issue packet as a very long
instruction
• Specifies multiple concurrent operations
•  Very Long Instruction Word (VLIW)
Scheduling Static Multiple
Issue
Compiler must remove some/all hazards
• Reorder instructions into issue packets
• No dependencies with a packet
• Possibly some dependencies between packets
• Varies between ISAs; compiler must know!
• Pad with nop if necessary
Dynamic Multiple Issue
“Superscalar” processors
CPU decides whether to issue 0, 1, 2, … each
cycle
• Avoiding structural and data hazards
Avoids the need for compiler scheduling
• Though it may still help
• Code semantics ensured by the CPU
Issues in Multiple Issue Processors
• True Data Dependency
• Procedural Dependency
• Resource Conflicts
• Output Dependency
• Antidependency
Instruction Issue Policy

•In-order Issue with in-order completion

•In-order Issue with out-of-order completion

•Out-of-order Issue with out-of-order completion

Amdahl’s Law and Parallel
Computers
A portion is sequential => limits parallel speedup
Speedup <= 1/ (1-FracX)
Ex. What fraction sequetial to get 80X speedup from
100 processors? Assume either 1 processor or 100
fully used

80 = 1 / [(FracX/100 + (1-FracX)]
0.8*FracX + 80*(1-FracX) = 80 - 79.2*FracX = 1
FracX = (80-1)/79.2 = 0.9975
Only 0.25% sequential!
Strong vs Weak Scaling
 Strong scaling: problem size fixed
 As in example
 Weak scaling: problem size proportional to
number of processors
 10 processors, 10 × 10 matrix
 Time = 20 × tadd
 100 processors, 32 × 32 matrix
 Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
 Constant performance in this example
Instruction and Data Streams
 An alternate classification
Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345

 SPMD: Single Program Multiple Data

 A parallel program on a MIMD computer
 Conditional code for different processors
Vector Processors
 Highly pipelined function units
 Stream data from/to vector registers to units
 Data collected from memory into registers
 Results stored from registers to memory
 Example: Vector extension to MIPS
 32 × 64-element registers (64-bit elements)
 Vector instructions
 lv, sv: load/store vector
 addv.d: add vectors of double
 addvs.d: add scalar to each element of vector of double
 Significantly reduces instruction-fetch bandwidth
Vector vs. Scalar
 Vector architectures and compilers
 Simplify data-parallel programming
 Explicit statement of absence of loop-
carried dependences
 Reduced checking in hardware
 Regular access patterns benefit from
interleaved and burst memory
 Avoid control hazards by avoiding
loops
 More general than ad-hoc media
extensions (such as MMX,
SSE)
 Better match with compiler
SIMD
 Operate elementwise on vectors of data
 E.g., MMX and SSE instructions in x86
 Multiple data elements in 128-bit wide registers
 All processors execute the same
instruction at the same time
 Each with different data address,
etc.
 Simplifies synchronization
 Reduced instruction control
hardware
 Works best for highly data-parallel
applications
Vector vs. Multimedia Extensions
 Vector instructions have a variable vector width,
multimedia extensions have a fixed width
 Vector instructions support strided access,
multimedia extensions do not
 Vector units can be combination of pipelined and
arrayed functional units:
Hardware Multithreading
 Performing multiple threads of execution in
parallel
 Replicate registers, PC, etc.
 Fast switching between threads
 Fine-grain multithreading
 Switch threads after each cycle
 Interleave instruction execution
 If one thread stalls, others are executed
 Coarse-grain multithreading
 Only switch on long stall (e.g., L2-cache miss)
 Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
Simultaneous Multithreading
 In multiple-issue dynamically scheduled
processor
 Schedule instructions from multiple
threads
 Instructions from independent threads execute
when function units are available
 Within threads, dependencies handled by
scheduling and register renaming
 Example: Intel Pentium-4 HT
 Two threads: duplicated registers, shared
function units and caches
Multithreading Example
Future of Multithreading
 Will it survive? In what form?
 Power considerations  simplified
microarchitectures
 Simpler forms of multithreading
 Tolerating cache-miss latency
 Thread switch may be most
effective
 Multiple simple cores might share
resources more effectively
Shared Memory
 SMP: shared memory multiprocessor
 Hardware provides single physical
address space for all processors
 Synchronize shared variables
using locks
 Memory access time
 UMA (uniform) vs. NUMA
(nonuniform)
Example: Sum Reduction
 Sum 100,000 numbers on 100 processor UMA
 Each processor has ID: 0 ≤ Pn ≤ 99
 Partition 1000 numbers per processor
 Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < i = i + 1)
sum[Pn] = sum[Pn] + A[i];
1000*(Pn+1);
 Now need to add these partial sums
 Reduction: divide and conquer
 Half the processors add pairs, then quarter, …
 Need to synchronize between reduction steps
Example: Sum Reduction

half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when
half is odd;
Processor0 gets missing
element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] +
sum[Pn+half];
until (half == 1);
§6.6 Introduction to Graphics Processing Units
History of GPUs
 Early video cards
 Frame buffer memory with address generation for
video output
 3D graphics processing
 Originally high-end computers (e.g., SGI)
 Moore’s Law  lower cost, higher density
 3D graphics cards for PCs and game consoles
 Graphics Processing Units
 Processors oriented to 3D graphics tasks
 Vertex/pixel processing, shading, texture mapping,
rasterization
Graphics in the System
GPU Architectures
 Processing is highly data-parallel
 GPUs are highly multithreaded
 Use thread switching to hide memory latency
 Less reliance on multi-level caches
 Graphics memory is wide and high-bandwidth
 Trend toward general purpose GPUs
 Heterogeneous CPU/GPU systems
 CPU for sequential code, GPU for parallel code
 Programming languages/APIs
 DirectX, OpenGL
 C for Graphics (Cg), High Level Shader Language
(HLSL)
 Compute Unified Device Architecture (CUDA)
Example: NVIDIA Tesla
Streaming
multiprocessor

8 × Streaming
processors
Example: NVIDIA Tesla
 Streaming Processors
 Single-precision FP and integer units
 Each SP is fine-grained multithreaded
 Warp: group of 32 threads
 Executed in parallel,
SIMD style
 8 SPs
× 4 clock cycles
 Hardware contexts
for 24 warps
 Registers, PCs, …
Classifying GPUs
 Don’t fit nicely into SIMD/MIMD model
 Conditional execution in a thread allows an
illusion of MIMD
 But with performance degredation
 Need to write general purpose code with care

Static: Discovered Dynamic: Discovered

at Compile Time at Runtime
Instruction-Level VLIW Superscalar
Parallelism
Data-Level SIMD or Vector Tesla Multiprocessor

Parallelism
GPU Memory Structures
§6.7 Clusters, WSC, and Other Message-Passing MPs
Message Passing
 Each processor has private physical
address space
 Hardware sends/receives messages
between processors
Loosely Coupled Clusters
 Network of independent computers
 Each has private memory and OS
 Connected using I/O system
 E.g., Ethernet/switch, Internet
 Suitable for applications with independent tasks
 Web servers, databases, simulations, …
 High availability, scalable, affordable
 Problems
 Administration cost (prefer virtual machines)
 Low interconnect bandwidth
 c.f. processor/memory bandwidth on an SMP
Sum Reduction (Again)
 Sum 100,000 on 100 processors
 First distribute 100 numbers to each
 The do partial sums
sum = 0;
for(i = 0; i<1000; i = i + 1)
sum = sum + AN[i];
 Reduction
 Half the processors send, other half receive
and add
 The quarter send, quarter receive and add,
…
Sum Reduction (Again)
 Given send() and receive() operations
limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
sum = sum + receive();
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
 Send/receive also provide synchronization
 Assumes send/receive take similar time to addition
Grid Computing
 Separate computers interconnected by
long-haul networks
 E.g., Internet connections
 Work units farmed out, results sent back
 Can make use of idle time on PCs
 E.g., SETI@home, World Community
Grid
Interconnection Networks
 Network topologies
 Arrangements of processors, switches, and links

Bus Ring

N-cube (N = 3)
2D Mesh
Fully connected
Multistage Networks

Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Computer Architecture
No ratings yet
Computer Architecture
29 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Unit 1
No ratings yet
Unit 1
21 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
CS213 Parallel Processing Syllabus
No ratings yet
CS213 Parallel Processing Syllabus
26 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Advanced Parallel Computing Concepts
No ratings yet
Advanced Parallel Computing Concepts
38 pages
Lec 44 Multicore
No ratings yet
Lec 44 Multicore
23 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
CA8 2024S2 Newer
No ratings yet
CA8 2024S2 Newer
21 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
HPC Important Question
No ratings yet
HPC Important Question
19 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Parallel Computer Architecture Guide
No ratings yet
Parallel Computer Architecture Guide
44 pages
Parallel Computer Architecture
No ratings yet
Parallel Computer Architecture
44 pages
Flynns
No ratings yet
Flynns
41 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
Microprocessors: Single vs Multi-core
No ratings yet
Microprocessors: Single vs Multi-core
9 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Computer Architecture & Amdahl's Law
No ratings yet
Computer Architecture & Amdahl's Law
23 pages
Chapter 12 Multiprocessor Systems
No ratings yet
Chapter 12 Multiprocessor Systems
110 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
109 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Pda 2
No ratings yet
Pda 2
105 pages
Architecture
No ratings yet
Architecture
67 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
SSC Course 6 CPU
No ratings yet
SSC Course 6 CPU
17 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
47 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
CS0051 - Module 01 - Subtopic 1
No ratings yet
CS0051 - Module 01 - Subtopic 1
27 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
08 Parallel Algorithms Approches
No ratings yet
08 Parallel Algorithms Approches
12 pages
Unit Iv Parallelism
No ratings yet
Unit Iv Parallelism
80 pages
CS8552-Computer Architecture and Organization
No ratings yet
CS8552-Computer Architecture and Organization
2 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
DS LAB Manual
No ratings yet
DS LAB Manual
78 pages
CS6008 - Hci MCQ
No ratings yet
CS6008 - Hci MCQ
23 pages
Unit V
No ratings yet
Unit V
44 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Computer Architecture Basics
No ratings yet
Computer Architecture Basics
64 pages
Unit Iv
No ratings yet
Unit Iv
37 pages
Unit-I: Memory Interaction Human, Computer
No ratings yet
Unit-I: Memory Interaction Human, Computer
25 pages
Unit-Ii Design & Software Process: Prototyping in Practice Design Rationale Design Rules Universal Design
No ratings yet
Unit-Ii Design & Software Process: Prototyping in Practice Design Rationale Design Rules Universal Design
25 pages
Unit-Iii Models & Theories
No ratings yet
Unit-Iii Models & Theories
19 pages
Licensing-Sizing-Guide Forescout
No ratings yet
Licensing-Sizing-Guide Forescout
26 pages
MC9S12DG256: 16-Bit Microcontrollers
No ratings yet
MC9S12DG256: 16-Bit Microcontrollers
2 pages
ARM MMU and Memory Management
No ratings yet
ARM MMU and Memory Management
6 pages
Cloud Computing For Data Science
No ratings yet
Cloud Computing For Data Science
17 pages
GeForce6100PM - M2 (2 - 0A) 38 PDF
No ratings yet
GeForce6100PM - M2 (2 - 0A) 38 PDF
1 page
Philips Chassis Q551.1e La
No ratings yet
Philips Chassis Q551.1e La
290 pages
GPS Setup for Funke TRT800H Transponder
No ratings yet
GPS Setup for Funke TRT800H Transponder
9 pages
Iconnectivity AX 2.0 Manual - 2023-09-12
No ratings yet
Iconnectivity AX 2.0 Manual - 2023-09-12
29 pages
Thor's+Quick+Sheets+ +CISSP+Domain+2
No ratings yet
Thor's+Quick+Sheets+ +CISSP+Domain+2
5 pages
Paraf Guru-Guru
No ratings yet
Paraf Guru-Guru
8 pages
FusionCompute V100R005C10 Host and Cluster Management Guide 01
No ratings yet
FusionCompute V100R005C10 Host and Cluster Management Guide 01
137 pages
Sales Comparisons by Region
No ratings yet
Sales Comparisons by Region
17 pages
Computer Organization and Design RISC V Edition The Hardware Software Interface David A. Patterson
100% (5)
Computer Organization and Design RISC V Edition The Hardware Software Interface David A. Patterson
67 pages
Embedded Systems Design Guide
No ratings yet
Embedded Systems Design Guide
25 pages
Computer Architecture Note 2024
No ratings yet
Computer Architecture Note 2024
45 pages
Intel ICH9 Schematic Overview
No ratings yet
Intel ICH9 Schematic Overview
107 pages
HPM Board
No ratings yet
HPM Board
1 page
Linux & Networking Training Guide
No ratings yet
Linux & Networking Training Guide
9 pages
DDR5 Ram
No ratings yet
DDR5 Ram
3 pages
AT Keyboard Port: Hard Drives
No ratings yet
AT Keyboard Port: Hard Drives
3 pages
SX 1500 Sand Filter Manual
No ratings yet
SX 1500 Sand Filter Manual
23 pages
SEL TroubleshootingGuide
No ratings yet
SEL TroubleshootingGuide
131 pages
Introduction To IT: Instructor: Syed Muhammad Usman
No ratings yet
Introduction To IT: Instructor: Syed Muhammad Usman
40 pages
MiPad2 Dual-Boot MIIUI+WIN10 Brush Tutorial by Jigu (03 - 30 Update) - Mi Pad 2 - Xiaomi MIUI Official Forum
No ratings yet
MiPad2 Dual-Boot MIIUI+WIN10 Brush Tutorial by Jigu (03 - 30 Update) - Mi Pad 2 - Xiaomi MIUI Official Forum
17 pages
OS Outline
No ratings yet
OS Outline
4 pages
Page No. List of Figures List of Abbreviations 1
No ratings yet
Page No. List of Figures List of Abbreviations 1
3 pages
OS LAB MANUAL New
No ratings yet
OS LAB MANUAL New
119 pages
VMW Vcta DCV Exam Preparation
No ratings yet
VMW Vcta DCV Exam Preparation
4 pages
Network LTC Plus Sat. Parts List
No ratings yet
Network LTC Plus Sat. Parts List
4 pages

Parallel Processors From Client To Cloud: Omputer Rganization and Esign

Uploaded by

Parallel Processors From Client To Cloud: Omputer Rganization and Esign

Uploaded by

COMPUTER ORGANIZATION AND DESIGN 5th

The Hardware/Software Interface Edition

Parallel Processors from

Questions about parallel computers:

•In-order Issue with in-order completion

•In-order Issue with out-of-order completion

•Out-of-order Issue with out-of-order completion

 SPMD: Single Program Multiple Data

Static: Discovered Dynamic: Discovered

You might also like