0% found this document useful (0 votes)

51 views60 pages

01intro PDF

CMPE 478 discusses the history and evolution of parallel processing and computer architecture. It covers topics like Von Neumann architecture, memory hierarchy, Moore's Law, bit-level parallelism, instruction-level parallelism, thread-level parallelism, and models of parallel computers including PRAM and Flynn's taxonomy. It also discusses shared memory machines and multi-core programming.

Uploaded by

Hiro Raylon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views60 pages

01intro PDF

Uploaded by

Hiro Raylon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

CMPE 478 Parallel Processing

picture of
Tianhe, the most powerful
computer in the world in Nov-2013

CMPE 478 1
Von Neumann Architecture

CPU RAM Device Device

BUS

•  sequential computer

CMPE 478 2
Memory Hierarchy
Fast Registers

Cache

Real Memory

Disk

Slow CD

CMPE 478 3
History of Computer Architecture
•  4 Generations (identified by logic technology)
1.  Tubes
2.  Transistors
3.  Integrated Circuits
4.  VLSI (very large scale integration)

CMPE 478 4
PERFORMANCE TRENDS

CMPE 478 5
PERFORMANCE TRENDS

•  Traditional mainframe/supercomputer performance 25%

increase per year
•  But … microprocessor performance 50% increase per year
since mid 80’s.

CMPE 478 6
Moore’s Law
•  “Transistor density
doubles every 18
months”
•  Moore is co-founder of
Intel.
•  60 % increase per year
•  Exponential growth
•  PC costs decline.
•  PCs are building bricks
of all future systems. Intel 62 core xeon Phi 2012 5 billion

CMPE 478 7
VLSI Generation

CMPE 478 8
Bit Level Parallelism
(upto mid 80’s)
•  4 bit microprocessors replaced by 8 bit, 16 bit, 32 bit etc.
•  doubling the width of the datapath reduces the number of
cycles required to perform a full 32-bit operation
•  mid 80’s reap benefits of this kind of parallelism (full 32-
bit word operations combined with the use of caches)

CMPE 478 9
Instruction Level Parallelism
(mid 80’s to mid 90’s)

•  Basic steps in instruction processing (instruction decode,

integer arithmetic, address calculations, could be performed in
a single cycle)
•  Pipelined instruction processing
•  Reduced instruction set (RISC)
•  Superscalar execution
•  Branch prediction

CMPE 478 10
Thread/Process Level Parallelism
(mid 90’s to present)
• On average control transfers occur roughly once in five
instructions, so exploiting instruction level parallelism at a
larger scale is not possible
• Use multiple independent “threads” or processes
• Concurrently running threads, processes

CMPE 478 11
Evolution of the Infrastructure
•  Electronic Accounting Machine Era: 1930-1950
•  General Purpose Mainframe and Minicomputer Era: 1959-
Present
•  Personal Computer Era: 1981 – Present
•  Client/Server Era: 1983 – Present
•  Enterprise Internet Computing Era: 1992- Present

CMPE 478 12
Sequential vs Parallel Processing

•  physical limits reached •  “raw” power unlimited

•  easy to program •  more memory, multiple cache
•  expensive supercomputers •  made up of COTS, so cheap
•  difficult to program

CMPE 478 13
What is Multi-Core Programming ?
•  Answer: It is basically parallel programming on a single
computer box (e.g. a desktop, a notebook, a blade)

CMPE 478 14
Processor Trends

CMPE 478
Another Important Benefit of
Multi-Core : Reduced Energy Consumption
Single core Dual core

2 GHz 1 GHz 1 GHz

Single core executes Each core
workload of N executes
Clock cycles workload of N/2
Clock cycles

2 2
Energy per cycle(Ec) = C*Vdd Energy per cycle(E’c ) = C*(0.5*Vdd)
2
= 0.25*C*Vdd
Energy=Ec* N
Energy’ = 2*(E’c * 0.5 * N )
= E’c * N
= 0.25*(E c * N)
= 0.25*Energy
CMPE 478 16
SPMD Model
(Single Program Multiple Data)
•  Each processor executes the same program asynchronously
•  Synchronization takes place only when processors need to
exchange data
•  SPMD is extension of SIMD (relax synchronized instruction
execution)
•  SPMD is restriction of MIMD (use only one source/object)

CMPE 478 17
Parallel Processing Terminology
•  Embarassingly Parallel:
- applications which are trivial to parallelize
- large amounts of independent computation
- Little communication

• Data Parallelism:
- model of parallel computing in which a single operation can be
applied to all data elements simultaneously
- amenable to SIMD or SPMD style of computation

• Control Parallelism:
- many different operations may be executed concurrently
- require MIMD/SPMD style of computation

CMPE 478 18
Parallel Processing Terminology
•  Scalability:
-  If the size of problem is increased, number of processors that can be
effectively used can be increased (i.e. there is no limit on
parallelism).
-  Cost of scalable algorithm grows slowly as input size and the
number of processors are increased.
-  Data parallel algorithms are more scalable than control parallel
alorithms

•  Granularity:
-  fine grain machines: employ massive number of weak processors
each with small memory
-  coarse grain machines: smaller number of powerful processors each
with large amounts of memory
CMPE 478 19
Models of Parallel Computers
1.  Message Passing Model
-  Distributed memory
-  Multicomputer
2. Shared Memory Model
-  Multiprocessor
-  Multi-core
3. Theoretical Model
-  PRAM

•  New architectures: combination of 1 and 2.

CMPE 478 20
Theoretical PRAM Model
•  Used by parallel algorithm designers
•  Algorithm designers do not want to worry about low level
details: They want to concentrate on algorithmic details
•  Extends classic RAM model
•  Consist of :
–  Control unit (common clock), synchronous
–  Global shared memory
–  Unbounded set of processors, each with its private own
memory

CMPE 478 21
Theoretical PRAM Model
•  Some characteristics
–  Each processor has a unique identifier, mypid=0,1,2,…
–  All processors operate synhronously under the control of a
common clock
–  In each unit of time, each procesor is allowed to execute an
instruction or stay idle

CMPE 478 22
Various PRAM Models
weakest
EREW (exlusive read / exclusive write)

CREW (concurrent read / exclusive write)

CRCW (concurrent read / concurrent write)

Common (must write the same value)

Arbitrary (one processor is chosen arbitrarily)
Priority (processor with the lowest index writes)

(how write conflicts to the same memory location

strongest are handled)

CMPE 478 23
Flynn’s Taxonomy
•  classifies computer architectures according to:
1.  Number of instruction streams it can process at a time
2.  Number of data elements on which it can operate
simultaneously
Data Streams
Single Multiple

SISD SIMD Single

Instruction Streams
MISD MIMD Multiple

CMPE 478 24
Shared Memory Machines

Shared Address Space

process process process process process

(thread) (thread) (thread) (thread) (thread)

• Memory is globally shared, therefore processes (threads) see single address

space
• Coordination of accesses to locations done by use of locks provided by
thread libraries
• Example Machines: Sequent, Alliant, SUN Ultra, Dual/Quad Board Pentium PC
• Example Thread Libraries: POSIX threads, Linux threads.

CMPE 478 25
Shared Memory Machines
•  can be classified as:
- UMA: uniform memory access
- NUMA: nonuniform memory access
based on the amount of time a processor takes to access local and
global memory. P
P M
M M
P M P P
Inter-
M Inter-
Inter-
P connection M M connection M connection
M network
network/ network
.. .. .. ..
or BUS ..
P M P P
M
M M
(a)
(b) (c)
CMPE 478 26
Distributed Memory Machines
M process

process M

Network
M process
process M
process
M

• Each processor has its own local memory (not directly accessible by others)
• Processors communicate by passing messages to each other

• Example Machines: IBM SP2, Intel Paragon, COWs (cluster of workstations)

• Example Message Passing Libraries: PVM, MPI
CMPE 478 27
Beowulf Clusters
• Use COTS, ordinary PCs and networking equipment
• Has the best price/performance ratio

PC cluster

CMPE 478 28
Multi-Core Computing

•  A multi-core microprocessor is one which combines two or more

independent processors into a single package, often a single integrated
circuit.
•  A dual-core device contains only two independent microprocessors.

CMPE 478 29
Comparison of Different Architectures

CPU State

Execution
unit Cache

Single Core Architecture

CMPE 478 30
Comparison of Different Architectures

CPU State CPU State

Execution Execution
unit Cache Cache
unit

Multiprocessor

CMPE 478 31
Comparison of Different Architectures

CPU State CPU State

Execution
unit Cache

Hyper-Threading Technology

CMPE 478 32
Comparison of Different Architectures

CPU State CPU State

Execution Execution
unit Cache unit Cache

Multi-Core Architecture

CMPE 478 33
Comparison of Different Architectures

CPU State CPU State

Execution Execution
unit unit

Cache

Multi-Core Architecture with Shared Cache

CMPE 478 34
Comparison of Different Architectures

CPU State CPU State CPU State CPU State

Execution Execution
unit Cache unit Cache

Multi-Core with Hyper-Threading Technology

CMPE 478 35
CMPE 478 36
Top 500 Most Power
Supercomputer Lists
•  http://www.top500.org/
•  ……..

CMPE 478 37
PARALLEL PERFORMANCE MODELS 
and  
ALGORITHMS

38
CMPE 478
Amdahl’s Law
•  The serial percentage of a program is fixed. So speed-up obtained by
employing parallel processing is bounded.
•  Lead to pessimism in in the parallel processing community and prevented
development of parallel machines for a long time.

1
Speedup =
1-s
s +
P

•  In the limit:
Spedup = 1/s
s
CMPE 478 39
Gustafson’s Law
•  Serial percentage is dependent on the number of
processors/input.
•  Demonstrated achieving more than 1000 fold speedup using
1024 processors.
•  Justified parallel processing

CMPE 478 40
Algorithmic Performance Parameters
•  Notation
Input size

Time Complexity of the best sequential algorithm

Number of processors

Time complexity of the parallel algorithm when run on P

processors

Time complexity of the parallel algorithm when run on 1

processors

CMPE 478 41
Algorithmic Performance Parameters
•  Speed-Up

•  Efficiency

CMPE 478 42
Algorithmic Performance Parameters
•  Work = Processors X Time
–  Informally: How much time a parallel algorithm will take to
simulate on a serial machine
–  Formally:

CMPE 478 43
Algorithmic Performance Parameters
•  Work Efficient:
–  Informally: a work efficient parallel algorithm does no more
work than the best serial algorithm
–  Formally: a work efficient algorithm satisfies:

CMPE 478 44
Algorithmic Performance Parameters
•  Scalability:
–  Informally, scalability implies that if the size of the problem
is increased, the number of processors effectively used can
be increased (i.e. there is no limit on parallelism)
–  Formally, scalability means:

CMPE 478 45
Algorithmic Performance Parameters
•  Some remarks:
–  Cost of scalable algorithm grows slowly as input size and
the number of procesors are increased
–  Level of ‘control parallelism’ is usually a constant
independent of problem size
–  Level of ‘data parallelism’ is an increasing function of
problem size
–  Data parallel algorithms are more scalable than control
parallel algorithms

CMPE 478 46
Goals in Designing Parallel Algorithms
•  Scalability:
–  Algorithm cost grows slowly, preferably in a
polylogarithmic manner
•  Work Efficient:
–  We do not want to waste CPU cycles
–  May be an important point when we are worried about
power consumption or ‘money’ paid for CPU usage

CMPE 478 47
Summing N numbers in Parallel
x1 x2 x3 x4 x5 x6 x7 x8

step 1

x1+x2 x2 x3+x4 x4 x5+x6 x6 x7+x8 x8

step 2

x1+..+x4 x2 x3+x4 x4 x5+..+x8 x6 x7+x8 x8

step 3
x1+..+x8 x2 x3+x4 x4 x5+..+x8 x6 x7+x8 x8
result

• Array of N numbers can be summed in log(N) steps using

N/2 processors
CMPE 478
Prefix Summing N numbers in Parallel
x1 x2 x3 x4 x5 x6 x7 x8

step 1

x1+x2 x2+x3 x3+x4 x4+x5 x5+x6 x6+x7 x7+x8 x8

step 2

x1+..+x4 x2+..+x4 x3+..+x6 x4+..+x7 x5+..+x8 x6+..+x8 x7+x8 x8

step 3
x1+..+x8 x2+..+x8 x3+..+x8 x4+..+x8 x5+..+x8 x6+..+x8 x7+x8 x8

• Computing partial sums of an array of N numbers can be done in

log(N) steps using N processors
CMPE 478
Prefix Paradigm for Parallel Algorithm
Design
• Prefix computation forms a paradigm for parallel algorithm
development, just like other well known paradigms such as:
–  divide and conquer, dynamic programming, etc.

• Prefix Paradigm:
–  If possible, transform your problem to prefix type
  computation
–  Apply the efficient logarithmic prefix computation

• Examples of Problems solved by Prefix Paradigm:

–  Solving linear recurrence equations
–  Tridiagonal Solver
–  Problems on trees
–  Adaptive triangular mesh refinement
CMPE 478
Solving Linear Recurrence Equations
•  Given the linear recurrence equation:
zi = ai zi −1 + bi zi −2
•  we can rewrite it as:
⎡ zi ⎤ ⎡ai bi ⎤ ⎡ zi −1 ⎤
⎢z ⎥ = ⎢ 1 0 ⎥⎦ ⎢⎣zi − 2 ⎥⎦
⎣ i −1 ⎦ ⎣
•  if we expand it, we get the solution in terms of partial products of
coefficients and the initial values z1 and z0 :

⎡ zi ⎤ ⎡ai bi ⎤ ⎡ai −1 bi −1 ⎤ ⎡ai −2 bi − 2 ⎤ ⎡a2 b2 ⎤ ⎡ z1 ⎤

•  use ⎥ = to⎢ 1compute
⎢z prefix partial products⎥ ⎢ ... ⎢
⎣ i −1 ⎦ ⎣
⎥ ⎢
0 ⎦⎣ 1 0 ⎦⎣ 1 0 ⎦ ⎣1⎥ 0 ⎥⎦ ⎢⎣z0 ⎥⎦

CMPE 478
Pointer Jumping Technique
x1 x2 x3 x4 x5 x6 x7 x8

step 1
x1+.x2 x2+x3 x3+x4 x4+x5 x5+x6 x6+x7 x7+x8 x8

step 2

x1+..+x4 x2+..+x5 x3+..+x6 x4+..+x7 x5+..+x8 x6+x7 x7+x8 x8

step 3
x1+..+x8 x2+..+x8 x3+..+x8 x4+..+x8 x5+..+x8 x6+..+x8 x7+x8 x8

• A linked list of N numbers can be prefix-summed in log(N)

steps using N processors
CMPE 478
Euler Tour Technique
a
Tree Problems:
• Preorder numbering
b c d
• Postorder numbering
• Number of Descendants
• Level of each node
e f g

h i

• To solve such problems, first transform the tree by linearizing it

  into a linked-list and then apply the prefix computation
CMPE 478
Computing Level of Each Node by Euler
Tour Technique
a -1
1 weight assignment:
1 -1
-1 1 1 -1
b c d
1
-1
1
-1 -1 1
e f g
level(v) = pw(<v,parent(v)>)
1 -1 level(root) = 0
1
-1
h i
initial weights: w(<u,v>)
-1 1 -1 1 -1 -1 -1 1 -1 1 1 -1 1 -1 1 1
a d a c a b g i g h g b f b e b a
0 1 0 1 0 1 2 3 2 3 2 1 2 1 2 1
prefix:
CMPE 478 pw(<u,v>)
Computing Number of Descendants by
Euler Tour Technique
a 1
0 weight assignment:
0 1
1 0 0 1
b c d
0
1
0
1 1 0
e f g
# of descendants(v) = pw(<parent(v),v>) -
0 1 pw(<v,parent(v)>)
0 # of descendants(root) = n
1
h i
initial weights: w(<u,v>)
1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0
a d a c a b g i g h g b f b e b a
8 7 7 6 6 5 4 3 3 2 2 2 1 1 0 0
prefix:
CMPE 478 pw(<u,v>)
Preorder Numbering by Euler Tour
1
Technique
a 0
1 weight assignment:
1 0
2 0 1 0
1
b c d
1 8 9
0
1
0 0 1 5
3
e 4 f g
preorder(v) = 1 + pw(<v,parent(v)>)
1 0 preorder(root) = 1
1
0
6 h i 7
initial weights: w(<u,v>)
0 1 0 1 0 0 0 1 0 1 1 0 1 0 1 1
a d a c a b g i g h g b f b e b a
8 8 7 7 6 6 6 6 5 5 4 3 3 2 2 1
prefix:
CMPE 478 pw(<u,v>)
Postorder Numbering by Euler Tour
9
Technique
a 1
0 weight assignment:
6 0 1
1 0 0 1
b c d 8
0 7
1
0
1 1 0 5
e f g
1 2 postorder(v) = pw(<parent(v),v>)
0 1
0 postorder(root) = n
1
3 h i 4
initial weights: w(<u,v>)
1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0
a d a c a b g i g h g b f b e b a
8 7 7 6 6 5 4 3 3 2 2 2 1 1 0 0
prefix:
CMPE 478 pw(<u,v>)
Binary Tree Traversal
•  Preorder
•  Inorder
•  Postorder

CMPE 478
Brent’s Theorem
•  Given a parallel algorithm with computation time D, if parallel
algorithm performs W operations then P processors can execute
the algorithm in time D + (W-D)/P

For proof: consider DAG representation of computation

CMPE 478
Work Efficiency
•  Parallel Summation
•  Parallel Prefix Summation

CMPE 478

Hyundai Engine HMC l4kb9 Shop Manual
100% (64)
Hyundai Engine HMC l4kb9 Shop Manual
10 pages
CMPE 478 Parallel Processing
No ratings yet
CMPE 478 Parallel Processing
60 pages
CS213 Parallel Processing Syllabus
No ratings yet
CS213 Parallel Processing Syllabus
26 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
Unit 5
No ratings yet
Unit 5
96 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
Chapter 7
No ratings yet
Chapter 7
25 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Unit V
No ratings yet
Unit V
95 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Unit Iv Parallelism
No ratings yet
Unit Iv Parallelism
80 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
‎⁨نسخة Lecture 4⁩
No ratings yet
‎⁨نسخة Lecture 4⁩
21 pages
Architecture
No ratings yet
Architecture
67 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
109 pages
CO1 Evoluation of Processors and Modern Processor
No ratings yet
CO1 Evoluation of Processors and Modern Processor
29 pages
Introduction To Parallel Processing Architecture
No ratings yet
Introduction To Parallel Processing Architecture
31 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
No ratings yet
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
17 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Parallel Processing Explained
No ratings yet
Parallel Processing Explained
22 pages
Chapter 12 Multiprocessor Systems
No ratings yet
Chapter 12 Multiprocessor Systems
110 pages
Parallel Computer Architecture
No ratings yet
Parallel Computer Architecture
44 pages
Parallel Computer Architecture Guide
No ratings yet
Parallel Computer Architecture Guide
44 pages
L38 TLP
No ratings yet
L38 TLP
13 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Coa Chapter 5
No ratings yet
Coa Chapter 5
96 pages
Lec 4
No ratings yet
Lec 4
36 pages
Parallel Arch 2
No ratings yet
Parallel Arch 2
9 pages
Flynn's Classification
No ratings yet
Flynn's Classification
46 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
20BCE2351 Micro Assignment-02
No ratings yet
20BCE2351 Micro Assignment-02
5 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
Cs6303-Computer Architecture Unit-Iv Parallelism Part A: Svcet
No ratings yet
Cs6303-Computer Architecture Unit-Iv Parallelism Part A: Svcet
4 pages
Unit 1 - Part - 3
No ratings yet
Unit 1 - Part - 3
29 pages
Unit 5
No ratings yet
Unit 5
96 pages
Flynn's Taxonomy & Parallel Models
No ratings yet
Flynn's Taxonomy & Parallel Models
27 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
Ceg4131 Models
No ratings yet
Ceg4131 Models
27 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Parallel Computer Models: PCA Chapter 1
No ratings yet
Parallel Computer Models: PCA Chapter 1
61 pages
Parallel Processor Computing Unit 1
No ratings yet
Parallel Processor Computing Unit 1
10 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
Computer Architecture for CS Students
No ratings yet
Computer Architecture for CS Students
72 pages
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
No ratings yet
Cs405-Computer System Architecture: Module - 1 Parallel Computer Models
72 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Reflective Essay On Module
No ratings yet
Reflective Essay On Module
5 pages
Unidad 4
No ratings yet
Unidad 4
12 pages
Dokumen - Tips - Registered Trademark of Basf Se Magnafloc Magnafloc 155 Is A High Molecular Weight
No ratings yet
Dokumen - Tips - Registered Trademark of Basf Se Magnafloc Magnafloc 155 Is A High Molecular Weight
2 pages
Construction of A New Bridge Across The River Nile at Jinja UVCO/1407/JNJ/PY/ / Abutment A1 Structural Outline Sheet - 1
No ratings yet
Construction of A New Bridge Across The River Nile at Jinja UVCO/1407/JNJ/PY/ / Abutment A1 Structural Outline Sheet - 1
71 pages
Siemens PBX & Cisco CallManager Guide
No ratings yet
Siemens PBX & Cisco CallManager Guide
37 pages
Adcps: Question Paper Cum Answer Sheet
No ratings yet
Adcps: Question Paper Cum Answer Sheet
5 pages
Elektor Electronics 2020-07 08 USA
100% (1)
Elektor Electronics 2020-07 08 USA
116 pages
Processed Food Industry Pakistan
0% (1)
Processed Food Industry Pakistan
6 pages
Model Analysis
100% (3)
Model Analysis
7 pages
Lentil & Legume Price Guide
No ratings yet
Lentil & Legume Price Guide
15 pages
Chapter 1+2+GSCM
No ratings yet
Chapter 1+2+GSCM
45 pages
Purposive Communication - Lesson 3
No ratings yet
Purposive Communication - Lesson 3
7 pages
Business English Vocabulary Guide
No ratings yet
Business English Vocabulary Guide
27 pages
Form 4 pH Determination Guide
0% (1)
Form 4 pH Determination Guide
3 pages
Inspection of The Building Signature by Pinnacle.: (Estructure and Electromechanic Equipment Surveying.)
No ratings yet
Inspection of The Building Signature by Pinnacle.: (Estructure and Electromechanic Equipment Surveying.)
12 pages
U.S.S. Europa Starship Specs
No ratings yet
U.S.S. Europa Starship Specs
1 page
2.5. Database+File+Layout
No ratings yet
2.5. Database+File+Layout
9 pages
Kumerahou: Pomaderris Kumeraho
No ratings yet
Kumerahou: Pomaderris Kumeraho
1 page
How To Mount A Remote File System Using Network File System (NFS)
No ratings yet
How To Mount A Remote File System Using Network File System (NFS)
3 pages
M1000H
No ratings yet
M1000H
2 pages
Maths Grade-8 Model 2015
No ratings yet
Maths Grade-8 Model 2015
7 pages
Research On The Impact of E-Commerce On Offline Re
No ratings yet
Research On The Impact of E-Commerce On Offline Re
5 pages
11 Ergonomics in Osh
No ratings yet
11 Ergonomics in Osh
9 pages
Research Paper 12 Abm Efficient Honrados Group
No ratings yet
Research Paper 12 Abm Efficient Honrados Group
34 pages
Park psm74b - 1
No ratings yet
Park psm74b - 1
9 pages
Controledge Hc900 Io Modules Specifications: 51-52-03-41, November 2019
No ratings yet
Controledge Hc900 Io Modules Specifications: 51-52-03-41, November 2019
35 pages
Material Safety Data Sheet Avafulflow
No ratings yet
Material Safety Data Sheet Avafulflow
4 pages
Maintenance of Capital
No ratings yet
Maintenance of Capital
36 pages
RF Heating: Created in COMSOL Multiphysics 5.3a
No ratings yet
RF Heating: Created in COMSOL Multiphysics 5.3a
22 pages

01intro PDF

Uploaded by

01intro PDF

Uploaded by

CMPE 478 Parallel Processing

CPU RAM Device Device

• Traditional mainframe/supercomputer performance 25%

• Basic steps in instruction processing (instruction decode,

• physical limits reached • “raw” power unlimited

2 GHz 1 GHz 1 GHz

• New architectures: combination of 1 and 2.

CREW (concurrent read / exclusive write)

CRCW (concurrent read / concurrent write)

Common (must write the same value)

(how write conflicts to the same memory location

SISD SIMD Single

Shared Address Space

process process process process process

• Memory is globally shared, therefore processes (threads) see single address

• Example Machines: IBM SP2, Intel Paragon, COWs (cluster of workstations)

• A multi-core microprocessor is one which combines two or more

Single Core Architecture

CPU State CPU State

CPU State CPU State

CPU State CPU State

CPU State CPU State

Multi-Core Architecture with Shared Cache

CPU State CPU State CPU State CPU State

Multi-Core with Hyper-Threading Technology

Time Complexity of the best sequential algorithm

Time complexity of the parallel algorithm when run on P

Time complexity of the parallel algorithm when run on 1

x1+x2 x2 x3+x4 x4 x5+x6 x6 x7+x8 x8

x1+..+x4 x2 x3+x4 x4 x5+..+x8 x6 x7+x8 x8

• Array of N numbers can be summed in log(N) steps using

x1+x2 x2+x3 x3+x4 x4+x5 x5+x6 x6+x7 x7+x8 x8

x1+..+x4 x2+..+x4 x3+..+x6 x4+..+x7 x5+..+x8 x6+..+x8 x7+x8 x8

• Computing partial sums of an array of N numbers can be done in

• Examples of Problems solved by Prefix Paradigm:

⎡ zi ⎤ ⎡ai bi ⎤ ⎡ai −1 bi −1 ⎤ ⎡ai −2 bi − 2 ⎤ ⎡a2 b2 ⎤ ⎡ z1 ⎤

x1+..+x4 x2+..+x5 x3+..+x6 x4+..+x7 x5+..+x8 x6+x7 x7+x8 x8

• A linked list of N numbers can be prefix-summed in log(N)

• To solve such problems, first transform the tree by linearizing it

For proof: consider DAG representation of computation

You might also like

•  Traditional mainframe/supercomputer performance 25%

•  Basic steps in instruction processing (instruction decode,

•  physical limits reached •  “raw” power unlimited

•  New architectures: combination of 1 and 2.

• Memory is globally shared, therefore processes (threads) see single address

• Example Machines: IBM SP2, Intel Paragon, COWs (cluster of workstations)

•  A multi-core microprocessor is one which combines two or more

• Array of N numbers can be summed in log(N) steps using

• Computing partial sums of an array of N numbers can be done in

• Examples of Problems solved by Prefix Paradigm:

• A linked list of N numbers can be prefix-summed in log(N)

• To solve such problems, first transform the tree by linearizing it