Dr.
Muhammad Anwaar Saeed
Dr. Said Nabi
Ms. Hina Ishaq
CS621 Parallel and Distributed
Computing
Performance and
Scalability
CS621 Parallel and Distributed
Computing
What is performance?
Objective
s
Introduction to Analytical
Modeling
What is Performance?
“Computation performance is a measure of how well a
computer system can execute a given set of
instructions. It can be measured in terms of execution
time, overhead, speedup, and cost among others.
Analytical Modeling - Basics
The parallel runtime of a
program depends on the A sequential algorithm is
input size, the number of evaluated by its runtime
processors, and the (in general, asymptotic
communication runtime as a function of
parameters of the input size).
machine.
Analytical Modeling - Basics
An algorithm must therefore be analyzed in the context of
the underlying platform.
The asymptotic runtime of a sequential program is identical
on any serial platform.
The parallel runtime of a program depends on the input
size, the number of processors, and the communication
parameters of the machine.
Wall clock time - the time from the start of the first
A number processor to the stopping time of the last processor in a
parallel ensemble. But how does this scale when the
of number of processors is changed of the program is ported
to another machine altogether?
performanc
e measures
How much faster is the parallel version?
are This begs the obvious follow up question –
what's the baseline serial version with which we compare?
intuitive Can we use a suboptimal serial program to make our
parallel program look .
Raw FLOP count - What good are FLOP counts when they
don't solve a problem?
Sources of Overhead in
Parallel Programs
CS621 Parallel and Distributed
Computing
Introduction Execution
Overhead.
Objective
s
Sources of Overhead in Parallel
Programs
Sources of Overhead in Parallel Programs
If I use two processors, should not my program run twice as
fast?
No - a number of overheads, including wasted computation,
communication, idling, and contention cause degradation
in performance.
Sources of Overhead in
Parallel Programs
• The execution profile of a
hypothetical parallel program
executing on eight processing
elements.
• Profile indicates times spent
performing computation (both
essential and excess),
communication, and idling.
Sources of Overheads in Parallel
Programs
Inter-process interactions:
Processors working on any non-trivial parallel problem will need to talk to each
other.
Idling:
Processes may idle because of load imbalance, synchronization, or serial
components.
Sources of Overheads in Parallel
Programs
• The difference in computation performed by the parallel
program and the best serial program is the excess
computation overhead incurred by the parallel program.
• This is computation not performed by the serial version.
Excess
Comp • This might be because the serial algorithm is difficult to
parallelize, or that some computations are repeated across
utatio processors to minimize communication.
n:
Performance Metrics for
Parallel Systems
CS621 Parallel and Distributed
Computing
Serial vs Parallel Performance.
Objective
s
Performance Metrics
Performance Metrics for
Parallel Systems
• It is important to study the performance of parallel
programs with a view to determining the best
algorithm, evaluating hardware platforms, and
examining the benefits from parallelism.
• A number of metrics have been used based on the
desired outcome of performance analysis.
Performance Metrics for Parallel
Systems: Execution Time
Serial runtime of a program is the time elapsed between the beginning and the
end of its execution on a sequential computer.
The parallel runtime is the time that elapses from the moment the first processor
starts to the moment the last processor finishes execution.
We denote the serial runtime by TS and the parallel runtime by TP .
Performance Metrics: Total Parallel
Overhead/Overhead function
The total time collectively spent by all the processing elements over and above
that required by the fastest known sequential algorithm for solving the same
problem on a single processing element.
Let Tall be the total time collectively spent by all the processing elements and TS
is the serial time.
Tall - TS is then the total time spend by all processors combined in non-useful
work. This is called the total overhead.
Performance Metrics: Total Parallel
Overhead/Overhead function
The total time collectively spent by all the processing elements
Tall = p TP (p is the number of processors).
The overhead function (To) is therefore given by
To = p TP - TS
Performance Metrics for
Parallel Systems: Speedup
CS621 Parallel and Distributed
Computing
Introduction to Speedup.
Objective
s
Speedup Example
Performance Metrics for
Parallel Systems: Speedup
“Speedup (S) is the ratio of the time taken to solve a problem
on a single processor to the time required to solve the same
problem on a parallel computer with p identical processing
elements”
Performance Metrics: Example
If n is a power of two, we can
Consider the problem of adding perform this operation in log n
n numbers by using n steps by propagating partial sums
processing elements. up a logical binary tree of
processors.
Performance Metrics: Example Cont..
• This figure illustrates the procedure for n
= 16.
• The processing elements are labeled
from 0 to 15.
• Similarly, the 16 numbers to be added
are labeled from 0 to 15.
• The sum of the numbers with
consecutive labels from i to j is denoted
by Σji .
• Each step shown in Figure consists of
one addition and the communication of a
single word.
If an addition takes constant time, say, tc and
communication of a single word takes time ts + tw,
Performa we have the parallel time
TP = Θ (log n)
nce
Metrics:
Example
Cont.. We know that TS = Θ (n)
Speedup S is given by S = Θ (n / log n)
Performance Metrics: Speedup
For a given problem, there might be many serial algorithms available.
These algorithms may have different asymptotic runtimes and may be
parallelizable to different degrees.
For the purpose of computing speedup, we always consider the best
sequential program as the baseline.
Performance Metrics: Speedup
Example
• Consider the
problem of parallel • The speedup • What if serial
bubble sort. would appear quicksort only took 30
• The serial time for to be 150/40 = seconds? In this case,
bubble sort is 150 3.75. the speedup is 30/40
seconds. = 0.75. This is a more
• The parallel time for • But is this realistic assessment
odd-even sort really a fair of the system. .
(efficient assessment of
parallelization of the system?
bubble sort) is 40
seconds
Speedup can be as low as 0 (the parallel program never
terminates).
Performan
ce Metrics: Speedup can never exceed the number of processing
elements, p.
Speedup
Bounds A speedup greater than p is possible only if each
processing element spends less than time TS / p solving
the problem
In this case, a single processor could be time-slided to
achieve a faster serial program, which contradicts our
assumption of fastest serial program as basis for speedup.
Performance Metrics: Super-
linear Speedups
• The phenomenon when the
speedup become greater than p is
known as superlinear speedup.
• One reason for super linearity is
that the parallel version does less
work than corresponding serial
algorithm.
Performance Metrics: Super-linear Speedups
Resource-based
Example:
super-linearity:
A processor with 64KB of
cache yields an 80% hit If DRAM access time
ratio. If two processors is 100 ns, cache
The higher are used, since the access time is 2 ns,
aggregate problem size/processor is and remote memory
cache/memory smaller, the hit ratio goes access time is 400ns,
bandwidth can up to 90%. Of the this corresponds to a
result in better remaining 10% access,
8% come from local
speedup of 2.43!
cache-hit ratios, and
memory and 2% from
therefore super- remote memory.
linearity.
Efficiency is a measure of the fraction of time for
which a processing element is usefully employed
Performan
ce Metrics: Mathematically, it is given by
Efficiency
Following the bounds on speedup, efficiency can be
as low as 0 and as high as 1.
Performance Metrics: Super-
linear Speedups
• The speedup of adding numbers on processors is given by
• Efficiency is given by
𝐸=¿
¿
Cost of a Parallel System
Cost is the product of parallel runtime and the number of processing
elements used (p x TP ).
Cost reflects the sum of the time that each processing element spends
solving the problem.
A parallel system is said to be cost-optimal if the cost of solving a
problem on a parallel computer is asymptotically identical to serial cost.
Since E = TS / p TP, for cost optimal systems, E = O(1).
Cost is sometimes referred to as work or processor-time product
Cost of a Parallel System: Example
Consider the problem of adding numbers on processors
We have, TP = log n (for p = n).
The cost of this system is given by p TP = n log n.
Since the serial runtime of this operation is Θ(n), the algorithm is not
cost optimal.
Impact of Non-Cost Optimality
Consider a sorting algorithm that uses n processing
elements to sort the list in time (log n)2.
Since the serial runtime of a (comparison-based) sort
is n log n, the speedup and efficiency of this
algorithm are given by n / log n and 1 / log n,
respectively.
The p TP product of this algorithm is n (log n)2.
This algorithm is not cost optimal but only by a factor
of log n.
Impact of Non-Cost Optimality
If p < n, assigning n tasks to p processors gives TP = n
(log n)2 / p
The corresponding speedup of this formulation is
p / log n.
This speedup goes down as the problem size n is
increased for a given p !.
Scalability of Parallel
Systems
CS621 Parallel and Distributed
Computing
The effect of Granularity on Performance.
Objective Introduction to Scalability of Parallel
s Systems
Scaling Characteristics
Often, using fewer processors improves
performance of parallel systems.
Effect of
Granularit
y on Using fewer than the maximum possible number of
processing elements to execute a parallel algorithm
Performan is called scaling down a parallel system.
ce
A naive way of scaling down is to think of each
processor in the original case as a virtual processor
and to assign virtual processors equally to scaled
down processors.
Since the number of processing elements decreases by a
factor of n / p, the computation at each processing element
increases by a factor of n / p.
Effect of
Granularit
y on The communication cost should not increase by this factor
since some of the virtual processors assigned to a physical
Performan processors might talk to each other.
ce
This is the basic reason for the improvement from building
granularity.
Scalability of Parallel Systems
Can we build granularity in the previous
example in a cost-optimal fashion?
• Each processing element locally adds its n /
p numbers in time Θ (n / p).
• The p partial sums on p processing
elements can be added in time Θ(n /p).
A cost-optimal way of computing the sum of
16 numbers using four processing elements.
Scaling Characteristics of
Parallel Programs
The efficiency of a parallel program can be written as:
or
The total overhead function To is an increasing
function of p.
Scaling Characteristics of
Parallel Programs: Example
• Consider the problem of adding numbers on processing elements.
• We have seen that:
• These expressions can be used to calculate the speedup and efficiency
for any pair of n and p
Scaling Characteristics of Parallel
Programs: Example (cont.)
• Plotting the speedup for various input
sizes gives us:
• Speedup versus the number of
processing elements for adding a list of
numbers.
• Speedup tends to saturate and
efficiency drops as a consequence of
Amdahl's law
Total overhead function To is a function of both problem
size Ts and the number of processing elements p.
Scaling In many cases, To grows sub-linearly with respect to Ts.
Characteristics
of Parallel In such cases, the efficiency increases if the problem size
is increased keeping the number of processing elements
Programs constant.
For such systems, we can simultaneously increase the
problem size and number of processors to keep efficiency
constant.
We call such systems scalable parallel systems.
Recall that cost-optimal parallel systems have an efficiency
of Θ(1).
Scaling
Characteristics
of Parallel
Programs Scalability and cost-optimality are therefore related.
A scalable parallel system can always be made cost-
optimal if the number of processing elements and the size
of the computation are chosen appropriately.
Isoefficiency Metric of
Scalability
CS621 Parallel and Distributed
Computing
Isoefficiency Metric
Objective
s
Isoefficiency Metric Example
Isoefficiency Metric of Scalability
For a given problem size, as we For some systems, the efficiency
increase the number of of a parallel system increases if
processing elements, the overall the problem size is increased
efficiency of the parallel system while keeping the number of
goes down for all systems. processing elements constant.
Isoefficiency Metric of Scalability
Variation of efficiency: (a) as the
number of processing elements is
increased for a given problem size;
and (b) as the problem size is
increased for a given number of
processing elements.
The phenomenon illustrated in graph
(b) is not common to all parallel
systems.
What is the rate at which the problem size must increase
with respect to the number of processing elements to keep
the efficiency fixed?
Isoefficien
cy Metric
of This rate determines the scalability of the system. The
slower this rate, the better.
Scalability
Before we formalize this rate, we define the problem size W
as the asymptotic number of operations associated with the
best serial algorithm to solve the problem.
Isoefficiency Metric of
Scalability
• We can write parallel runtime as:
• The resulting expression for speedup is
Isoefficiency Metric of
Scalability
• Finally, we write the expression for efficiency as:
Isoefficiency Metric of
Scalability
• For scalable parallel systems, efficiency can be maintained at a fixed value (between 0
and 1) if the ratio To / W is maintained at a constant value.
• For a desired value E of efficiency,
• If K = E / (1 – E) is a constant depending on the efficiency to be maintained, since
To is a function of W and p, we have:
The problem size W can usually be obtained as a function
of p by algebraic manipulations to keep efficiency constant.
Isoefficien
cy Metric
of
This function is called the isoefficiency function.
Scalability
This function determines the ease with which a parallel
system can maintain a constant efficiency and hence
achieve speedups increasing in proportion to the number of
processing elements
The overhead function for the problem of adding n
numbers on p processing elements is approximately 2p log
p.
Isoefficienc
K 2p log p
Substituting To by 2p log p , we get
y Metric:
Example
Thus, the asymptotic isoefficiency function for this parallel
system is: Θ(p log p )
If the number of processing elements is increased from p to
p’, the problem size (in this case, n ) must be increased by
a factor of (p’ log p’) / (p log p) to get the same
efficiency as on p processing elements.
Isoefficiency Metric: Example
Consider a more complex example where:
• Using only the first term of To in Equation, we get
• Using only the second term, Equation yields the following relation
between W and p:
• The larger of these two asymptotic rates determines the isoefficiency.
This is given by Θ(p3)
Isoefficiency Function and
Performance metrics
CS621 Parallel and Distributed
Computing
Lower Bound and the Isoefficiency
Function
Objective
s
Degree of Concurrency and Isoefficiency
Function
Cost-Optimality and the Isoefficiency
Function
A parallel system is cost-optimal if and only
if: pTp = Θ(W)
From this, we have:
W + To(W) = Θ(W)
To (W) = Ο(W)
W = Ω(To ( W, p))
If we have an isoefficiency function f(p), then it follows that the
relation W = Ω(f(p)) must be satisfied to ensure the cost-
optimality of a parallel system as it is scaled up.
Lower
Bound on The problem size must
For a problem consisting increase at least as fast
the of W units of work, no as Θ(p) to maintain fixed
more than W processing efficiency; hence, Ω(p) is
Isoefficienc elements can be used
cost-optimally.
the asymptotic lower
bound on the isoefficiency
y Function function.
Degree of
Concurrenc The maximum number
If C(W) is the degree
of concurrency of a
of tasks that can be
y and the executed
parallel algorithm, then
for a problem of size
simultaneously at any
Isoefficienc time in a parallel
algorithm is called its
W, no more than C(W)
processing elements
can be employed
y Function degree of concurrency.
effectively.
Degree of Concurrency and the
Isoefficiency Function: Example
Consider solving a system of n equations in n variables by using Gaussian elimination (W =
Θ(n3))
The n variables must be eliminated one after the other, and eliminating each variable
requires Θ(n2) computations.
At most Θ(n2) processing elements can be kept busy at any time.
Since W = Θ(n3) for this problem, the degree of concurrency C(W) is Θ(W2/3)
Given p processing elements, the problem size should be at least Ω(p3/2) to use them all.
Minimum Execution Time and
Minimum Cost-Optimal Execution
Time
Often, we are interested in the minimum time to
solution.
We can determine the minimum parallel
runtime TPmin for a given W by differentiating
the expression for TP w.r.t. p and equating it to
zero.
TP
If p0 is the value of p as determined by this
equation, TP(p0) is the minimum parallel time.
Minimum Execution Time: Example
Consider the minimum execution time
for adding n numbers.
TP + 2 log p
Setting the derivative w.r.t. p to zero, we have
p = n/2
(One may verify that this is indeed a min by
verifying that the second derivative is positive).
Note: that at this point, the formulation is not
cost-optimal.
Minimum Cost-Optimal Parallel Time
• Let TPcost_opt be the minimum cost-optimal parallel time.
• If the isoefficiency function of a parallel system is Θ(f(p)) , then a problem of size W
can be solved cost-optimally if and only if W= Ω(f(p)) .
• In other words, for cost optimality, p = O(f--1(W)) .
• For cost-optimal systems, TP = Θ(W/p) , therefore,
=
Minimum Cost-Optimal Parallel
Time: Example
Consider the problem of adding n numbers.
• The isoefficiency function f(p) of this parallel system is Θ(p log p).
• From this, we have p ≈ n /log n .
• At this processor count, the parallel runtime is:
• Note: that both TPmin and TPcost_opt for adding n numbers are
Θ(log n). This may not always be the case.
Asymptotic Analysis of Parallel
Programs
Consider the problem of sorting a list of n
numbers. The fastest serial programs for this
problem run in time Θ(n log n). Consider four
parallel algorithms, A1, A2, A3, and A4.
• Comparison of four different algorithms for
sorting a given list of numbers. The table shows
number of processing elements, parallel
runtime, speedup, efficiency and the pTP
product.
Asymptotic Analysis of Parallel
Programs
If the metric is speed, algorithm A1 is the best, followed by A3, A4, and A2 (in order of
increasing TP).
In terms of efficiency, A2 and A4 are the best, followed by A3 and A1.
In terms of cost, algorithms A2 and A4 are cost optimal, A1 and A3 are not.
It is important to identify the objectives of analysis and to use appropriate metrics!