0% found this document useful (0 votes)

32 views25 pages

Chapter 7

The document discusses different types of multiprocessor and parallel computing systems, including their advantages and challenges. It describes multiprocessors that have multiple processors, multicore microprocessors that have multiple processor cores, and computer clusters connected over a network that can function as a single large multiprocessor. The document also discusses shared memory multiprocessors and message passing architectures, noting it can be difficult to write parallel programs that efficiently utilize multiple processors.

Uploaded by

Ali Al-Ramadan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views25 pages

Chapter 7

Uploaded by

Ali Al-Ramadan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

CpE 440, Second 2020-2021, Yarmouk 2/26/2021

University

CpE 440
Computer Architecture
Dr. Haithem Al-Mefleh
Computer Engineering Department
Yarmouk University, Second 2020-2021

Multicores, Multiprocessors,
and Clusters

1
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Multiprocessor – a computer system with at least 2 processors

• 1 processor breaks  others continue

• Performance, reliability, availability

• Job level parallelism (or process-level parallelism)

• Different programs – different processors

• parallel processing program

• 1 program – different processors

• Cluster – a number of computers connected over a LAN and work

together as a one large multiprocessor

• Multicore microprocessor – a microprocessor that contains multiple

processors (cores) in a single integrated circuit.

• Parallel programming
• Execute efficiently in performance and power

2
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Hardware & Software

Challenge – effective use of parallel hardware

parallel processing program or parallel software =

sequential or concurrent software running on parallel hardware
5

Difficulty of writing Parallel Processing

Programs
• Difficult to write software that uses multiple processors to complete 1
task faster
• The problem gets worse as number of processors increases

• You must get better performance + efficiency OR

just use sequential program on a uniprocessor as it is easier,….
superscalar/out-of-order/… without programmers’ involvement

3
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Parallel processing programs – much harder to write than sequential

programs ?!

• Communication overhead – Scheduling

• Load balancing; Divide work equally

• Time Synchronization

• Scheduling

Even small parts must be parallelized for a program to make

good use of cores

0.1%; 0.001

4
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Getting good speedup on a multiprocessors while keeping the problem

size fixed is harder than it with increasing the problem size

Time for an addition = t 100 by 100 case

Time for an addition = t
Single processor: 10t+100t = 110t
Single processor: 10010t
10 processors: 10t + 100t/10 = 20t
10 processors: 1010t
Speedup = 5.5
Speedup = 9.9
(5.5/10)*100% = 55% of the potential speedup 99% of the potential speedup

100 processors: 10t + 100t/100 = 11t 100 processors: 110t

Speedup = 10 Speedup = 91
(10/100)*100% = 10% of the potential speedup 91% of the potential speedup

5
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

perfect load balance (previous example):

100 processors, each one 1% of load
Speedup 91
1 processor 2%x10000 = 200 additions 1 processor 5%x10000 = 500 additions
99 processors 9800 additions 99 processors 9500 additions

Time = max(200t, 9800t/99) + 10t = 210t Time = max(500t, 9500t/99) + 10t = 510t
Speedup = 48 Speedup = 20

Shared Memory Multiprocessors

6
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

How to simplify the task,… an option: SMP

• A single physical address space – all processors share
• Variables can be available any time to any Processor
• Can run independent jobs in virtual space
• Communicate using shared variables

2 styles of SMP,…
• UMA – Uniform Memory Access
• the same time to access main memory – any processor requests it
and any word is requested

• NUMA – Non-Uniform Memory Access

• depends on which processor requests which word
• programming challenges harder
• can scale to larger sizes
• can have lower latency to nearby memory

7
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Synchronization
• Processors should coordinate when sharing data
• Lock is one mechanism – one processor access a shared data at a
time

• Step 1 – equal subsets

8
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Step 2 – Reduction; divide to conquer

Clusters and Other Message-

Passing Multiprocessors

9
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Each processor –
private physical space

• Communicate with
message passing
• ACK is possible

• Some applications run well

on both shared or private
spaces

• Disadvantages –
• cost of administration of n machines 
cost of administration of n independent
machines
• cost of administration shared M with n processors  cost of
administration of 1 machine

• processors – interconnected using I/O interconnect of each computer

• Overhead of M division – n machines  n Ms, n OSs

10
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• There are 100 subsets  send one subset to each M

• Each computer find the sum of each subset

11
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Reduction – add partial sums

• Better availability
• Much easier to disconnect a machine, reinstall, replace, …

• Whole computers and independent scalable networks  easier to

expand without bringing down the application running on top of the
cluster

• Lower cost, high availability, improved power efficiency, and

rapid, incremental expandability 
• clusters attractive to service providers for the World Wide Web.

12
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Hardware Multithreading

• multiple threads – share functional units of a single processor in an

overlapping way
• One thread is stalled  switch to another one quickly
• Keep a copy of the state of each thread
• Memory can be shared through VM mechanisms, which already
support multiprogramming

• 2 approaches
- Interleaving
- Individual threads
- start-up overhead

13
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Simultaneous Multithreading - SMT

• a variation on hardware multithreading
• uses the resources of a multiple-issue, dynamically scheduled
processor
• to exploit thread-level parallelism at the same time it exploits
instruction-level parallelism.

• Multiple instructions from independent threads – can be issued

regardless of dependencies,….. Register Renaming + Dynamic
Scheduling
• Execute instructions from multiple threads, & leave Hardware to
associate instructions slots and renamed registers with their threads
27

Example

14
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

SISD, MIMD, SIMD, SPMD, and

Vector
A characterization of parallel hardware based on:
# of instruction streams, # of data streams

SISD – a uniprocessor

MIMD – a multiprocessor
• Different programs
• 1 program – conditional statements
• SPMD (Single Program Multiple Data)

15
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

SIMD – single instruction applied to many data

• Vector, array processors
• 1 add  send 64 data streams to 64 ALUs  64 sums in 1 cycle
• All units synchronized, and 1 PC
• Reduce cost of control unit over dozens of execution units
• Reduce size of program memory – 1 copy of code
• Best in array in for loops - identically structured data
• Data level parallelism

SIMD in x86: Multimedia Extensions

• MMX and SSE instructions
• Improve performance of multimedia programs
• Instructions allow the hardware to have many simultaneous ALUs, or
• split a wide ALU to many simultaneous ALUs
• a 64-bit ALU = two 32-bit ALUs = four 16-bit ALUs = eight 8-bit ALUs
• Stores/Loads – as wide as the widest ALU
• Width of operation and registers – in the opcode

16
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Vector
• Pipelined ALU
• Get data into registers, operate on them sequentially, store result
back to M
• Vector registers
• Like an entire loop
• Hardware doesn’t have to check for data hazards in the same Vector
• Control hazards in loops are nonexistent
• Number of elements in a separate register

Introduction to Graphics
Processing Units (GPU)

17
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Many processors were connected to the graphics displays

• Increasing processing time for graphics
• Improve it
• Controllers – accelerate 2D and 3D graphics
• Rapid growing market game
• Graphic Processing Units

• A GPU supplements a CPU – no need for all tasks

• Can do some tasks, poorly or not,..
• Heterogenous Combination – CPU-GPU not all identical processors
• Programming interface – high-level application programming
interfaces (APIs) + high-level graphics shading
• OpenGL, DirectX
• VIDIA’s C for Graphics (Cg), Microsoft’s High Level Shader Language (HLSL)
• Drawing of vertices of 3D geometry primitives like lines and shading
or rendering pixel fragments
• Each vertex or pixel – independent drawing/rendering
• Threads
• data types are vertices, consisting of (x, y, z, w) coordinates, and
pixels, consisting of (red, green, blue, alpha) color components.
36

18
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Working set can be hundreds of megabytes; not the same temporal

locality as data does in mainstream applications
• a great deal of data-level parallelism

• Do not rely on multilevel caches to overcome M latency

• Rely on having enough threads

• Rely on extensive parallelism for high performance – many parallel

processors and concurrent threads
• Each GPU processor – Highly multi-threaded

• Main memory oriented toward BW not Latency

• Heterogenous/Identical

• SIMD instructions historically,

• Recently, focusing on scalar instructions – to improve
programmability and efficiency

• Was no support for double precision floating-point arithmetic – no

need in graphics applications

19
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

General Purpose GPUs or GPGPUs

• Use GPU for applications – performance
• C-inspired programming languages – write directly for the GPUs
• Brook – a streaming language for GPUs
• NVIDIA’s CUDA
• Write C programs to execute on GPUs – some restrictions

• Also for parallel programming

Introduction to Multiprocessor
Network Topologies
Multicore chips  networks on chips to connect cores

20
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Cost depends on
• Switches
• Links on a switch
• Width (bits #) per link
• Length of links

• Performance
• Throughput – max # of messages in a time
• Latency to send and receive messages
• Contention
• …
• Fault tolerance
• Power Efficiency
41

• Links – bidirectional,…
• Processor-Memory Node

• Bus
• Total BW = BW of the bus
= 2xBWlink
• the bisection bandwidth = BWlink

21
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Ring
• Total BW = PxBWlink
• the bisection bandwidth = 2xBWlink

• Fully Connected
• Each P – a bidirectional link to every other P
• Total BW = P × (P - 1)/2
• the bisection bandwidth is (P/2)2

22
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Fallacies and Pitfalls

23
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

• Do not forget to try “Check Yourself” sections

• Answers given at end of chapter

24
CpE 440, Second 2020-2021, Yarmouk 2/26/2021
University

Any questions/comments?

Thank you
49

Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
Parallel Programming FDP
No ratings yet
Parallel Programming FDP
43 pages
Fundamentals of Programming Reviewer
No ratings yet
Fundamentals of Programming Reviewer
24 pages
JAN-701B, 901B Installation Manual
100% (2)
JAN-701B, 901B Installation Manual
140 pages
PARALLEL PROGRAMMING Module 1
No ratings yet
PARALLEL PROGRAMMING Module 1
20 pages
Softwares Used in Civil Engineering
No ratings yet
Softwares Used in Civil Engineering
25 pages
Advancedcomputer Architecture
No ratings yet
Advancedcomputer Architecture
91 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Parallel Programming & Multithreading
No ratings yet
Parallel Programming & Multithreading
168 pages
Parallel & Distributed Computing Course Overview
No ratings yet
Parallel & Distributed Computing Course Overview
47 pages
Pdf&rendition 1
No ratings yet
Pdf&rendition 1
126 pages
MCAP
No ratings yet
MCAP
32 pages
Lec 4
No ratings yet
Lec 4
36 pages
CSE 260 - Introduction To Parallel Computation: Larry Carter Carter@cs - Ucsd.edu
No ratings yet
CSE 260 - Introduction To Parallel Computation: Larry Carter Carter@cs - Ucsd.edu
22 pages
04 Architecture
No ratings yet
04 Architecture
22 pages
Unit 1
No ratings yet
Unit 1
21 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Patterson6e MIPS Ch06 PPT
No ratings yet
Patterson6e MIPS Ch06 PPT
63 pages
Parallel Computer Architecture
No ratings yet
Parallel Computer Architecture
44 pages
Comprehensive Computer Architecture Guide
100% (1)
Comprehensive Computer Architecture Guide
4 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Pda 2
No ratings yet
Pda 2
105 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Architecture
No ratings yet
Architecture
67 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
91 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Presentation 3
No ratings yet
Presentation 3
37 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Ayushagrawal HPC
No ratings yet
Ayushagrawal HPC
17 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
HPC Unit 1 Solution
No ratings yet
HPC Unit 1 Solution
8 pages
List of Important/ Useful SAP Technical Transaction Codes: ABAP Development
No ratings yet
List of Important/ Useful SAP Technical Transaction Codes: ABAP Development
6 pages
SAP HANA Database Server Management Console (Hdbcons)
No ratings yet
SAP HANA Database Server Management Console (Hdbcons)
15 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Db2 Components & Basics 1
No ratings yet
Db2 Components & Basics 1
18 pages
Cs8083 MCP Unit I Notes
No ratings yet
Cs8083 MCP Unit I Notes
31 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
34 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Parallel Computer Architecture Guide
No ratings yet
Parallel Computer Architecture Guide
44 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
50 C# Coding Interview Questions Every Developer Should Know.
100% (1)
50 C# Coding Interview Questions Every Developer Should Know.
25 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
Sap Script
No ratings yet
Sap Script
15 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Computer Architecture for CS Students
No ratings yet
Computer Architecture for CS Students
72 pages
TCL TK Tutorial
0% (1)
TCL TK Tutorial
19 pages
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
CS 258 Parallel Computer Architecture: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
44 pages
01intro PDF
No ratings yet
01intro PDF
60 pages
CMPE 478 Parallel Processing
No ratings yet
CMPE 478 Parallel Processing
60 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Unit1 RMD PDF
No ratings yet
Unit1 RMD PDF
27 pages
Avocet Web Services API
No ratings yet
Avocet Web Services API
245 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Topic 5: 8086 Assembly Language Programming (24 Marks)
No ratings yet
Topic 5: 8086 Assembly Language Programming (24 Marks)
38 pages
Increase Productivity With The Ultimate Power: Advanced Controllers SIMATIC S7-1500 With ET 200MP
No ratings yet
Increase Productivity With The Ultimate Power: Advanced Controllers SIMATIC S7-1500 With ET 200MP
8 pages
CS213 Parallel Processing Syllabus
No ratings yet
CS213 Parallel Processing Syllabus
26 pages
Flynns
No ratings yet
Flynns
41 pages
Open Office Base
No ratings yet
Open Office Base
157 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
Advanced Computer Architecture Unit 1
No ratings yet
Advanced Computer Architecture Unit 1
23 pages
Lec 44 Multicore
No ratings yet
Lec 44 Multicore
23 pages
Network Programming in Java
No ratings yet
Network Programming in Java
18 pages
Manual CVI
No ratings yet
Manual CVI
239 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
18 pages
Seminar
No ratings yet
Seminar
85 pages
CP Assignment 2
No ratings yet
CP Assignment 2
6 pages
The Control and Manipulation of Data Allows The Actual Status of An Action To Be Reported For Example How Many Cars Are Currently in The Car Park?
No ratings yet
The Control and Manipulation of Data Allows The Actual Status of An Action To Be Reported For Example How Many Cars Are Currently in The Car Park?
3 pages
Synopsis of Universal Web Based File Manager
No ratings yet
Synopsis of Universal Web Based File Manager
2 pages
EWM - Initial Stock Upload by LSMW Vs - SCWM - ISU - SAP Blogs
No ratings yet
EWM - Initial Stock Upload by LSMW Vs - SCWM - ISU - SAP Blogs
21 pages
PP+Creamino A Cost-Effective Open-Source EEG-Based BCI System
No ratings yet
PP+Creamino A Cost-Effective Open-Source EEG-Based BCI System
11 pages
SharePoint Patching
No ratings yet
SharePoint Patching
41 pages
Os Lab 2
No ratings yet
Os Lab 2
18 pages
Comp9242 Advanced Os: S2/2016 W01: Introduction To Sel4 @gernotheiser
No ratings yet
Comp9242 Advanced Os: S2/2016 W01: Introduction To Sel4 @gernotheiser
55 pages
Week 7 - Laboratory Activity: A. Using Java Arrays Problem
No ratings yet
Week 7 - Laboratory Activity: A. Using Java Arrays Problem
5 pages
TT LAB FILE - Ansh Saxena - 19SCSE1010502
No ratings yet
TT LAB FILE - Ansh Saxena - 19SCSE1010502
62 pages
845ref B2
No ratings yet
845ref B2
504 pages
S517 Lab Database Operationlab5
No ratings yet
S517 Lab Database Operationlab5
7 pages
Player Get Player Based On Level
No ratings yet
Player Get Player Based On Level
4 pages
To Run Code
No ratings yet
To Run Code
4 pages

Chapter 7

Uploaded by

Chapter 7

Uploaded by

CpE 440, Second 2020-2021, Yarmouk 2/26/2021

• Multiprocessor – a computer system with at least 2 processors

• Performance, reliability, availability

• Job level parallelism (or process-level parallelism)

• parallel processing program

• Cluster – a number of computers connected over a LAN and work

• Multicore microprocessor – a microprocessor that contains multiple

Hardware & Software

Challenge – effective use of parallel hardware

parallel processing program or parallel software =

Difficulty of writing Parallel Processing

• You must get better performance + efficiency OR

• Parallel processing programs – much harder to write than sequential

• Communication overhead – Scheduling

• Load balancing; Divide work equally

Even small parts must be parallelized for a program to make

Getting good speedup on a multiprocessors while keeping the problem

Time for an addition = t 100 by 100 case

100 processors: 10t + 100t/100 = 11t 100 processors: 110t

perfect load balance (previous example):

Shared Memory Multiprocessors

How to simplify the task,… an option: SMP

• NUMA – Non-Uniform Memory Access

• Step 1 – equal subsets

• Step 2 – Reduction; divide to conquer

Clusters and Other Message-

• Some applications run well

• processors – interconnected using I/O interconnect of each computer

• Overhead of M division – n machines  n Ms, n OSs

• There are 100 subsets  send one subset to each M

• Reduction – add partial sums

• Whole computers and independent scalable networks  easier to

• Lower cost, high availability, improved power efficiency, and

• multiple threads – share functional units of a single processor in an

Simultaneous Multithreading - SMT

• Multiple instructions from independent threads – can be issued

SISD, MIMD, SIMD, SPMD, and

SIMD – single instruction applied to many data

SIMD in x86: Multimedia Extensions

• Many processors were connected to the graphics displays

• A GPU supplements a CPU – no need for all tasks

• Working set can be hundreds of megabytes; not the same temporal

• Do not rely on multilevel caches to overcome M latency

• Rely on extensive parallelism for high performance – many parallel

• Main memory oriented toward BW not Latency

• SIMD instructions historically,

• Was no support for double precision floating-point arithmetic – no

General Purpose GPUs or GPGPUs

• Also for parallel programming

Fallacies and Pitfalls

• Do not forget to try “Check Yourself” sections

You might also like