Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
47 views69 pages

Parallel Computing Performance Metrics

Uploaded by

humii2224
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views69 pages

Parallel Computing Performance Metrics

Uploaded by

humii2224
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Dr.

Muhammad Anwaar Saeed


Dr. Said Nabi
Ms. Hina Ishaq

CS621 Parallel and Distributed


Computing
Performance and
Scalability

CS621 Parallel and Distributed


Computing
What is performance?

Objective
s
Introduction to Analytical
Modeling
What is Performance?

“Computation performance is a measure of how well a


computer system can execute a given set of
instructions. It can be measured in terms of execution
time, overhead, speedup, and cost among others.
Analytical Modeling - Basics

The parallel runtime of a


program depends on the A sequential algorithm is
input size, the number of evaluated by its runtime
processors, and the (in general, asymptotic
communication runtime as a function of
parameters of the input size).
machine.
Analytical Modeling - Basics
An algorithm must therefore be analyzed in the context of
the underlying platform.

The asymptotic runtime of a sequential program is identical


on any serial platform.

The parallel runtime of a program depends on the input


size, the number of processors, and the communication
parameters of the machine.
Wall clock time - the time from the start of the first
A number processor to the stopping time of the last processor in a
parallel ensemble. But how does this scale when the
of number of processors is changed of the program is ported
to another machine altogether?
performanc
e measures
How much faster is the parallel version?
are This begs the obvious follow up question –
what's the baseline serial version with which we compare?
intuitive Can we use a suboptimal serial program to make our
parallel program look .

Raw FLOP count - What good are FLOP counts when they
don't solve a problem?
Sources of Overhead in
Parallel Programs

CS621 Parallel and Distributed


Computing
Introduction Execution
Overhead.

Objective
s
Sources of Overhead in Parallel
Programs
Sources of Overhead in Parallel Programs

If I use two processors, should not my program run twice as


fast?

No - a number of overheads, including wasted computation,


communication, idling, and contention cause degradation
in performance.
Sources of Overhead in
Parallel Programs

• The execution profile of a


hypothetical parallel program
executing on eight processing
elements.
• Profile indicates times spent
performing computation (both
essential and excess),
communication, and idling.
Sources of Overheads in Parallel
Programs

Inter-process interactions:
Processors working on any non-trivial parallel problem will need to talk to each
other.

Idling:

Processes may idle because of load imbalance, synchronization, or serial


components.
Sources of Overheads in Parallel
Programs
• The difference in computation performed by the parallel
program and the best serial program is the excess
computation overhead incurred by the parallel program.

• This is computation not performed by the serial version.


Excess
Comp • This might be because the serial algorithm is difficult to
parallelize, or that some computations are repeated across
utatio processors to minimize communication.
n:
Performance Metrics for
Parallel Systems

CS621 Parallel and Distributed


Computing
Serial vs Parallel Performance.

Objective
s
Performance Metrics
Performance Metrics for
Parallel Systems

• It is important to study the performance of parallel


programs with a view to determining the best
algorithm, evaluating hardware platforms, and
examining the benefits from parallelism.
• A number of metrics have been used based on the
desired outcome of performance analysis.
Performance Metrics for Parallel
Systems: Execution Time

Serial runtime of a program is the time elapsed between the beginning and the
end of its execution on a sequential computer.

The parallel runtime is the time that elapses from the moment the first processor
starts to the moment the last processor finishes execution.

We denote the serial runtime by TS and the parallel runtime by TP .


Performance Metrics: Total Parallel
Overhead/Overhead function

The total time collectively spent by all the processing elements over and above
that required by the fastest known sequential algorithm for solving the same
problem on a single processing element.

Let Tall be the total time collectively spent by all the processing elements and TS
is the serial time.

Tall - TS is then the total time spend by all processors combined in non-useful
work. This is called the total overhead.
Performance Metrics: Total Parallel
Overhead/Overhead function

The total time collectively spent by all the processing elements


Tall = p TP (p is the number of processors).

The overhead function (To) is therefore given by


To = p TP - TS
Performance Metrics for
Parallel Systems: Speedup

CS621 Parallel and Distributed


Computing
Introduction to Speedup.

Objective
s
Speedup Example
Performance Metrics for
Parallel Systems: Speedup

“Speedup (S) is the ratio of the time taken to solve a problem


on a single processor to the time required to solve the same
problem on a parallel computer with p identical processing
elements”
Performance Metrics: Example

If n is a power of two, we can


Consider the problem of adding perform this operation in log n
n numbers by using n steps by propagating partial sums
processing elements. up a logical binary tree of
processors.
Performance Metrics: Example Cont..

• This figure illustrates the procedure for n


= 16.
• The processing elements are labeled
from 0 to 15.
• Similarly, the 16 numbers to be added
are labeled from 0 to 15.
• The sum of the numbers with
consecutive labels from i to j is denoted
by Σji .
• Each step shown in Figure consists of
one addition and the communication of a
single word.
If an addition takes constant time, say, tc and
communication of a single word takes time ts + tw,
Performa we have the parallel time
TP = Θ (log n)
nce
Metrics:
Example
Cont.. We know that TS = Θ (n)

Speedup S is given by S = Θ (n / log n)


Performance Metrics: Speedup

For a given problem, there might be many serial algorithms available.


These algorithms may have different asymptotic runtimes and may be
parallelizable to different degrees.

For the purpose of computing speedup, we always consider the best


sequential program as the baseline.
Performance Metrics: Speedup
Example

• Consider the
problem of parallel • The speedup • What if serial
bubble sort. would appear quicksort only took 30
• The serial time for to be 150/40 = seconds? In this case,
bubble sort is 150 3.75. the speedup is 30/40
seconds. = 0.75. This is a more
• The parallel time for • But is this realistic assessment
odd-even sort really a fair of the system. .
(efficient assessment of
parallelization of the system?
bubble sort) is 40
seconds
Speedup can be as low as 0 (the parallel program never
terminates).

Performan
ce Metrics: Speedup can never exceed the number of processing
elements, p.
Speedup
Bounds A speedup greater than p is possible only if each
processing element spends less than time TS / p solving
the problem

In this case, a single processor could be time-slided to


achieve a faster serial program, which contradicts our
assumption of fastest serial program as basis for speedup.
Performance Metrics: Super-
linear Speedups

• The phenomenon when the


speedup become greater than p is
known as superlinear speedup.
• One reason for super linearity is
that the parallel version does less
work than corresponding serial
algorithm.
Performance Metrics: Super-linear Speedups

Resource-based
Example:
super-linearity:
A processor with 64KB of
cache yields an 80% hit If DRAM access time
ratio. If two processors is 100 ns, cache
The higher are used, since the access time is 2 ns,
aggregate problem size/processor is and remote memory
cache/memory smaller, the hit ratio goes access time is 400ns,
bandwidth can up to 90%. Of the this corresponds to a
result in better remaining 10% access,
8% come from local
speedup of 2.43!
cache-hit ratios, and
memory and 2% from
therefore super- remote memory.
linearity.
Efficiency is a measure of the fraction of time for
which a processing element is usefully employed

Performan
ce Metrics: Mathematically, it is given by
Efficiency

Following the bounds on speedup, efficiency can be


as low as 0 and as high as 1.
Performance Metrics: Super-
linear Speedups

• The speedup of adding numbers on processors is given by

• Efficiency is given by
𝐸=¿

¿
Cost of a Parallel System

Cost is the product of parallel runtime and the number of processing


elements used (p x TP ).

Cost reflects the sum of the time that each processing element spends
solving the problem.

A parallel system is said to be cost-optimal if the cost of solving a


problem on a parallel computer is asymptotically identical to serial cost.

Since E = TS / p TP, for cost optimal systems, E = O(1).

Cost is sometimes referred to as work or processor-time product


Cost of a Parallel System: Example

Consider the problem of adding numbers on processors

We have, TP = log n (for p = n).

The cost of this system is given by p TP = n log n.

Since the serial runtime of this operation is Θ(n), the algorithm is not
cost optimal.
Impact of Non-Cost Optimality

Consider a sorting algorithm that uses n processing


elements to sort the list in time (log n)2.

Since the serial runtime of a (comparison-based) sort


is n log n, the speedup and efficiency of this
algorithm are given by n / log n and 1 / log n,
respectively.

The p TP product of this algorithm is n (log n)2.

This algorithm is not cost optimal but only by a factor


of log n.
Impact of Non-Cost Optimality

If p < n, assigning n tasks to p processors gives TP = n


(log n)2 / p

The corresponding speedup of this formulation is


p / log n.

This speedup goes down as the problem size n is


increased for a given p !.
Scalability of Parallel
Systems

CS621 Parallel and Distributed


Computing
The effect of Granularity on Performance.

Objective Introduction to Scalability of Parallel


s Systems

Scaling Characteristics
Often, using fewer processors improves
performance of parallel systems.
Effect of
Granularit
y on Using fewer than the maximum possible number of
processing elements to execute a parallel algorithm
Performan is called scaling down a parallel system.

ce
A naive way of scaling down is to think of each
processor in the original case as a virtual processor
and to assign virtual processors equally to scaled
down processors.
Since the number of processing elements decreases by a
factor of n / p, the computation at each processing element
increases by a factor of n / p.
Effect of
Granularit
y on The communication cost should not increase by this factor
since some of the virtual processors assigned to a physical
Performan processors might talk to each other.

ce

This is the basic reason for the improvement from building


granularity.
Scalability of Parallel Systems

Can we build granularity in the previous


example in a cost-optimal fashion?
• Each processing element locally adds its n /
p numbers in time Θ (n / p).
• The p partial sums on p processing
elements can be added in time Θ(n /p).

A cost-optimal way of computing the sum of


16 numbers using four processing elements.
Scaling Characteristics of
Parallel Programs
The efficiency of a parallel program can be written as:

or

The total overhead function To is an increasing


function of p.
Scaling Characteristics of
Parallel Programs: Example
• Consider the problem of adding numbers on processing elements.
• We have seen that:

• These expressions can be used to calculate the speedup and efficiency


for any pair of n and p
Scaling Characteristics of Parallel
Programs: Example (cont.)

• Plotting the speedup for various input


sizes gives us:
• Speedup versus the number of
processing elements for adding a list of
numbers.
• Speedup tends to saturate and
efficiency drops as a consequence of
Amdahl's law
Total overhead function To is a function of both problem
size Ts and the number of processing elements p.

Scaling In many cases, To grows sub-linearly with respect to Ts.


Characteristics
of Parallel In such cases, the efficiency increases if the problem size
is increased keeping the number of processing elements
Programs constant.

For such systems, we can simultaneously increase the


problem size and number of processors to keep efficiency
constant.

We call such systems scalable parallel systems.


Recall that cost-optimal parallel systems have an efficiency
of Θ(1).
Scaling
Characteristics
of Parallel
Programs Scalability and cost-optimality are therefore related.

A scalable parallel system can always be made cost-


optimal if the number of processing elements and the size
of the computation are chosen appropriately.
Isoefficiency Metric of
Scalability

CS621 Parallel and Distributed


Computing
Isoefficiency Metric

Objective
s
Isoefficiency Metric Example
Isoefficiency Metric of Scalability

For a given problem size, as we For some systems, the efficiency


increase the number of of a parallel system increases if
processing elements, the overall the problem size is increased
efficiency of the parallel system while keeping the number of
goes down for all systems. processing elements constant.
Isoefficiency Metric of Scalability

Variation of efficiency: (a) as the


number of processing elements is
increased for a given problem size;
and (b) as the problem size is
increased for a given number of
processing elements.

The phenomenon illustrated in graph


(b) is not common to all parallel
systems.
What is the rate at which the problem size must increase
with respect to the number of processing elements to keep
the efficiency fixed?

Isoefficien
cy Metric
of This rate determines the scalability of the system. The
slower this rate, the better.
Scalability

Before we formalize this rate, we define the problem size W


as the asymptotic number of operations associated with the
best serial algorithm to solve the problem.
Isoefficiency Metric of
Scalability
• We can write parallel runtime as:

• The resulting expression for speedup is


Isoefficiency Metric of
Scalability
• Finally, we write the expression for efficiency as:
Isoefficiency Metric of
Scalability
• For scalable parallel systems, efficiency can be maintained at a fixed value (between 0
and 1) if the ratio To / W is maintained at a constant value.
• For a desired value E of efficiency,

• If K = E / (1 – E) is a constant depending on the efficiency to be maintained, since


To is a function of W and p, we have:
The problem size W can usually be obtained as a function
of p by algebraic manipulations to keep efficiency constant.
Isoefficien
cy Metric
of
This function is called the isoefficiency function.
Scalability

This function determines the ease with which a parallel


system can maintain a constant efficiency and hence
achieve speedups increasing in proportion to the number of
processing elements
The overhead function for the problem of adding n
numbers on p processing elements is approximately 2p log
p.

Isoefficienc
K 2p log p
Substituting To by 2p log p , we get

y Metric:
Example
Thus, the asymptotic isoefficiency function for this parallel
system is: Θ(p log p )

If the number of processing elements is increased from p to


p’, the problem size (in this case, n ) must be increased by
a factor of (p’ log p’) / (p log p) to get the same
efficiency as on p processing elements.
Isoefficiency Metric: Example

Consider a more complex example where:


• Using only the first term of To in Equation, we get

• Using only the second term, Equation yields the following relation
between W and p:

• The larger of these two asymptotic rates determines the isoefficiency.


This is given by Θ(p3)
Isoefficiency Function and
Performance metrics

CS621 Parallel and Distributed


Computing
Lower Bound and the Isoefficiency
Function

Objective
s
Degree of Concurrency and Isoefficiency
Function
Cost-Optimality and the Isoefficiency
Function

A parallel system is cost-optimal if and only


if: pTp = Θ(W)

From this, we have:


W + To(W) = Θ(W)

To (W) = Ο(W)

W = Ω(To ( W, p))

If we have an isoefficiency function f(p), then it follows that the


relation W = Ω(f(p)) must be satisfied to ensure the cost-
optimality of a parallel system as it is scaled up.
Lower
Bound on The problem size must
For a problem consisting increase at least as fast
the of W units of work, no as Θ(p) to maintain fixed
more than W processing efficiency; hence, Ω(p) is
Isoefficienc elements can be used
cost-optimally.
the asymptotic lower
bound on the isoefficiency
y Function function.
Degree of
Concurrenc The maximum number
If C(W) is the degree
of concurrency of a
of tasks that can be
y and the executed
parallel algorithm, then
for a problem of size
simultaneously at any
Isoefficienc time in a parallel
algorithm is called its
W, no more than C(W)
processing elements
can be employed
y Function degree of concurrency.
effectively.
Degree of Concurrency and the
Isoefficiency Function: Example
Consider solving a system of n equations in n variables by using Gaussian elimination (W =
Θ(n3))

The n variables must be eliminated one after the other, and eliminating each variable
requires Θ(n2) computations.

At most Θ(n2) processing elements can be kept busy at any time.

Since W = Θ(n3) for this problem, the degree of concurrency C(W) is Θ(W2/3)

Given p processing elements, the problem size should be at least Ω(p3/2) to use them all.
Minimum Execution Time and
Minimum Cost-Optimal Execution
Time
Often, we are interested in the minimum time to
solution.

We can determine the minimum parallel


runtime TPmin for a given W by differentiating
the expression for TP w.r.t. p and equating it to
zero.

TP

If p0 is the value of p as determined by this


equation, TP(p0) is the minimum parallel time.
Minimum Execution Time: Example

Consider the minimum execution time


for adding n numbers.
TP + 2 log p
Setting the derivative w.r.t. p to zero, we have
p = n/2

(One may verify that this is indeed a min by


verifying that the second derivative is positive).

Note: that at this point, the formulation is not


cost-optimal.
Minimum Cost-Optimal Parallel Time

• Let TPcost_opt be the minimum cost-optimal parallel time.


• If the isoefficiency function of a parallel system is Θ(f(p)) , then a problem of size W
can be solved cost-optimally if and only if W= Ω(f(p)) .
• In other words, for cost optimality, p = O(f--1(W)) .

• For cost-optimal systems, TP = Θ(W/p) , therefore,

=
Minimum Cost-Optimal Parallel
Time: Example
Consider the problem of adding n numbers.
• The isoefficiency function f(p) of this parallel system is Θ(p log p).
• From this, we have p ≈ n /log n .
• At this processor count, the parallel runtime is:

• Note: that both TPmin and TPcost_opt for adding n numbers are
Θ(log n). This may not always be the case.
Asymptotic Analysis of Parallel
Programs

Consider the problem of sorting a list of n


numbers. The fastest serial programs for this
problem run in time Θ(n log n). Consider four
parallel algorithms, A1, A2, A3, and A4.

• Comparison of four different algorithms for


sorting a given list of numbers. The table shows
number of processing elements, parallel
runtime, speedup, efficiency and the pTP
product.
Asymptotic Analysis of Parallel
Programs
If the metric is speed, algorithm A1 is the best, followed by A3, A4, and A2 (in order of
increasing TP).

In terms of efficiency, A2 and A4 are the best, followed by A3 and A1.

In terms of cost, algorithms A2 and A4 are cost optimal, A1 and A3 are not.

It is important to identify the objectives of analysis and to use appropriate metrics!

You might also like