CS416 – Parallel
and Distributed
Computing
Lecture # 03
Spring 2021
FAST – NUCES, Faisalabad Campus
Agenda
2
A Quick Review
Multi-processor vs Multi-computer
Flynn’s Taxonomy
SISD
MISD
SIMD
MIMD
Physical Organization of Parallel Platforms
PRAM
Routing techniques and Costs
Interconnections for Parallel platforms
CS416 - Spring 2021
Review to the Previous Lecture
3
Amdahl’s Law of Parallel Speedup
Purpose, derivation, and examples
Karp-Flatt Metric
Finding sequential fraction in the given parallel setup
Types of Parallelism
Data-parallelism
Same operation on different data elements
Functional-parallelism
Different independent tasks with different operations on
different data elements can be parallelized
Pipelining
Overlapping the instructions in a single instruction cycle to
achieve parallelism
CS416 - Spring 2021
4 Karp-Flatt Metric
CS416 - Spring 2021
Karp-Flatt Metric(Review)
5
The metric is used to calculate serial fraction for a
given parallel configuration.
i.e., if a parallel program is exhibiting a speedup S while
using P processing units then the serial fraction e is given by
:-
1ൗ − 1ൗ𝑝
𝑆
𝑒=
1 − 1ൗ𝑝
Example task: Suppose in a parallel program, for 5
processors, you gained a speedup of 1.25x, determine
sequential fraction of your program.
CS416 - Spring 2021
6
CS416 - Spring 2021
7
CS416 - Spring 2021
Quick Review to the Previous Lecture
9
Multiprocessor
Centralized multiprocessor
Distributed multiprocessor
Shared address space(NUMA) vs Shared memory(UMA)
Multicomputer
Asymmetrical
Symmetrical
Cluster vs Network of Workstations
CS416 - Spring 2021
Multi-processor
vs
10
Multi-Computer
CS416 - Spring 2021
Multi-Processor
11
Multiple-CPUs with a shared memory
The same address on two different CPUs refers to the
same memory location.
Generally, two categories:-
1. Centralized Multi-processors
2. Distributed Multi-processor
CS416 - Spring 2021
Multi-Processor
i. Centralized Multi-processor
Additional CPUs are attached
to the system bus, and all the
processors share the same
primary memory
All the memory is at one place
and has the same access time
from every processor
Also known as UMA(Uniform
Memory Access) multi-
processor or SMP (symmetrical
Multi-processor )
12 CS416 - Spring 2021
Multi-Processor
ii. Distributed Multi-processor
Distributed collection of
memories forms one logical
address space
Again, the same address on
different processors refers to
the same physical memory
location.
Also known as non-uniform
memory access (NUMA)
architecture
Memory access time varies
significantly depending on the
physical location of the
referenced address
13 CS416 - Spring 2021
Cache consistency issues [assigned reading]
14
CS416 - Spring 2021
Multi-Computer
Distributed-memory, multi-CPU computer
Unlike NUMA architecture, a multicomputer has
disjoint local address spaces
The same address on different processors refers to
two different physical memory locations.
Each processor has direct access to their local
memories only.
Processors interact with each other through
Message-Passing
15 CS416 - Spring 2021
Multi-Computer
Asymmetric Multi-Computers
A front-end computer that
interacts with users and I/O
devices
The back-end processors are
dedicatedly used for
“number crunching”
Front-end computer executes
a fully multiprogram OS and
provides all the functions
needed for program
development
The backend-ones are
reserved for executing
parallel programs
16 CS416 - Spring 2021
Multi-Computer
Symmetric Multi-
Computers
Every computer executes
same OS
Users may log into any of the
computers
This enables multiple users to
concurrently login, edit and
compile their programs.
All the nodes can participate
in execution-invocation of a
parallel program
17 CS416 - Spring 2021
Network of Workstations
vs
18
Cluster
CS416 - Spring 2021
Cluster Network of workstations
Usually, a co-located collection of A dispersed collection of
low-cost computers and switches, computers. Individual workstations
dedicated for running parallel jobs. may have different Operating
All computer run the same version systems and executable programs
of operating system.
Some of the computers may not User have the power to login and
have interfaces for the users to power off their workstations
login
Commodity cluster uses high speed Ethernet speed for this network is
networks for communication such usually slower. Typical in range of 10
as fast Ethernet@100Mbps, gigabit Mbps
Ethernet@1000 Mbps and
Myrinet@1920 Mbps.
19 CS416 - Spring 2021
Architectural
20
Classification of Systems
CS416 - Spring 2021
Flynn’s Taxonomy
21
Widely used architectural classification scheme
Classifies architectures into four types
The classification is based on how data and
instructions flow through the cores.
CS416 - Spring 2021
Flynn’s
Taxonomy
SISD (Single Instruction Single Data)
Refers to traditional computer: a
serial architecture
This architecture includes single
core computers
Single instruction stream is in
execution at a given time
Similarly, only one data stream is
active at any time
22 CS416 - Spring 2021
Flynn’s
Taxonomy
MISD (Multiple Instructions Single Data)
Multiple instruction stream and
single data stream
A pipeline of multiple
independently executing
functional units
Each operating on a single
stream of data and forwarding
results from one to the next
Rarely used in practice
E.g., Systolic arrays : network of
primitive processing elements that
pump data.
23 CS416 - Spring 2021
Flynn’s Taxonomy
SIMD (Single Instruction Multiple Data)
Refers to parallel architecture with
multiple cores
All the cores execute the same
instruction stream at any time but, data
stream is different for the each.
Well-suited for the scientific operations
requiring large matrix-vector operations
Vector computers (Cray vector
processing machine) and Intel co-
processing unit ‘MMX’ fall under this
category.
Used with array operations, image
processing and graphics
24 CS416 - Spring 2021
Flynn’s
Taxonomy
MIMD (Multiple Instructions Multiple
Data)
Multiple instruction streams and
multiple data streams
Different CPUs can simultaneously
execute different instruction
streams manipulating different
data
Most of the contemporary parallel
architectures fall under this
category e.g., Multiprocessor and
multicomputer architectures
Many MIMD architectures include
SIMD executions by default.
25 CS416 - Spring 2021
Flynn’s Taxonomy
A typical SIMD architecture (a) and a typical MIMD architecture (b).
26 CS416 - Spring 2021
SIMD-MIMD Comparison
27
SIMD computers require less hardware than MIMD
computers (single control unit).
However, since SIMD processors are specially
designed, they tend to be expensive and have long
design cycles.
Not all applications are naturally suited to SIMD
processors.
In contrast, platforms supporting the SPMD (Same
Program Multiple Data) paradigm can be built from
inexpensive off-the-shelf components with relatively
little effort in a short amount of time.
The Term SPMD is close variant of MIMD
CS416 - Spring 2021
Physical Organization of
28
Parallel Platforms
CS416 - Spring 2021
Architecture of an Ideal Parallel Computer
29
Parallel Random Access Machine (PRAM)
An extension to ideal sequential model: random
access machine (RAM)
PRAMs consist of p processors
A global memory
Unbounded size
Uniformly accessible to all processors with same address
space
Processors share a common clock but may execute
different instructions in each cycle.
Based on simultaneous memory access
mechanisms, PRAM can further be classified.
CS416 - Spring 2021
Architecture of an Ideal Parallel Computer
30
Parallel Random Access Machine (PRAM)
PRAMs can be divided into four subclasses.
1. Exclusive-read, exclusive-write (EREW) PRAM
No concurrent read/write operations allowed
Weakest PRAM model, provides minimum memory access
concurrency
2. Concurrent-read, exclusive-write (CREW) PRAM
Multiple write accesses to a memory location are serialized
3. Exclusive-read, concurrent-write (ERCW) PRAM
4. Concurrent-read, concurrent-write (CRCW) PRAM
Most powerful PRAM model
CS416 - Spring 2021
Architecture of an Ideal Parallel Computer
31
Parallel Random Access Machine (PRAM)
Exclusive reads do not create any sematic
inconsistencies
But, What about the concurrent writes?
Need of an arbitration(mediation) mechanism to
resolve concurrent write access
CS416 - Spring 2021
Architecture of an Ideal Parallel Computer
32
Parallel Random Access Machine (PRAM)
Mostly used arbitration protocols: -
Common: write only if all values are identical
Arbitrary: write the data from a randomly selected
processor and ignore the rest.
Priority: follow a predetermined priority order.
Processor with highest priority succeeds and the rest
fail.
Sum: Write the sum of the data items in all the write
requests. The model can be extended for any of the
associative operators, that is defined for data being
written.
CS416 - Spring 2021
Architecture of an Ideal Parallel Computer
33
Physical Complexity of an Ideal Parallel Computer
Assume realizations of EREW PRAM
Processors and memories are connected via
switches.
Since these switches must operate in O(1) time at
the level of words, for a system of p processors and
m words, the switch complexity is O(mp).
Clearly, for meaningful values of p and m, a true
PRAM is not realizable.
CS416 - Spring 2021
34
Communication Costs
in Parallel Machines
CS416 - Spring 2021
Communication Costs in Parallel
35 Machines
Along with idling, communication is a major
overhead in parallel programs.
The communication cost is usually dependent on a
number of features including the following:
Programming model for communication
Network topology
Data handling and routing
Associated network protocols
Usually, distributed systems suffer from major
communication overheads.
CS416 - Spring 2021
Message Passing Costs in Parallel
36 Computers
The total time to transfer a message over a network
comprises of the following:
Startup time (ts): Time spent at sending and receiving nodes
(preparing the message[adding headers, trailers, and parity
information ] , executing the routing algorithm, establishing
interface between node and router, etc.).
Per-hop time (th): This time is a function of number of hops
and includes factors such as switch latencies, network
delays, etc.
Also known as node latency.
Per-word transfer time (tw): This time includes all overheads
that are determined by the length of the message. This
includes bandwidth of links, and buffering overheads, etc.
CS416 - Spring 2021
Message Passing Costs in Parallel
37 Computers
Store-and-Forward Routing
A message traversing multiple hops is completely
received at an intermediate hop before being
forwarded to the next hop.
The total communication cost for a message of size
m words to traverse l communication links is
In most platforms, th is small and the above
expression can be approximated by
CS416 - Spring 2021
Message Passing Costs in Parallel
38 Computers
Packet Routing
Store-and-forward makes poor use of
communication resources.
Packet routing breaks messages into packets and
pipelines them through the network.
Since packets may take different paths, each
packet must carry routing information, error
checking, sequencing, and other related header
information.
The total communication time for packet routing is
approximated by:
Here factor tw also accounts for overheads in packet
headers.
CS416 - Spring 2021
Message Passing Costs in Parallel
39 Computers
Cut-Through Routing
Takes the concept of packet routing to an extreme
by further dividing messages into basic units called
flits or flow control digits.
Since flits are typically small, the header information
must be minimized.
This is done by forcing all flits to take the same path,
in sequence.
A tracer message first programs all intermediate
routers. All flits then take the same route.
Error checks are performed on the entire message,
as opposed to flits.
No sequence numbers are needed.
CS416 - Spring 2021
Message Passing Costs in Parallel
40 Computers
Cut-Through Routing
The total communication time for cut-through
routing is approximated by:
This is identical to packet routing, however, tw is
typically much smaller than the tw in packet routing
CS416 - Spring 2021
Message Passing Costs in Parallel
41 Computers
(a) through a store-and-forward
communication network;
b) and (c) extending the concept to
cut-through routing.
CS416 - Spring 2021
Message Passing Costs in Parallel
42 Computers
Simplified Cost Model for Communicating Messages
The cost of communicating a message between
two nodes l hops away using cut-through routing is
given by
In this expression, th is typically smaller than ts and tw.
For this reason, the second term in the RHS does not
show, particularly, when m is large.
For these reasons, we can approximate the cost of
message transfer by
CS416 - Spring 2021
Message Passing Costs in Parallel
43 Computers
Simplified Cost Model for Communicating Messages
It is important to note that the original expression for
communication time is valid for only uncongested
networks.
Different communication patterns congest different
networks to varying extents.
It is important to understand and account for this in
the communication time accordingly.
CS416 - Spring 2021
61
Questions
CS416 - Spring 2021
References
62
1. Flynn, M., “Some Computer Organizations and Their Effectiveness,” IEEE Transactions on Computers, Vol. C-21,
No. 9, September 1972.
2. Kumar, V., Grama, A., Gupta, A., & Karypis, G. (1994). Introduction to parallel computing (Vol. 110). Redwood City,
CA: Benjamin/Cummings.
3. Quinn, M. J. Parallel Programming in C with MPI and OpenMP,(2003).
CS416 - Spring 2021