Principles of Scalable Performance
• Performance measures
• Speedup laws
• Scalability principles
• Scaling up vs. scaling down
1
Performance metrics and measures
• Parallelism profiles
• Asymptotic speedup factor
• System efficiency, utilization and quality
• Standard performance measures
2
Parallelism profile in Programs
• The degree of parallelism reflects the extent to
which software parallelism matches hardware
parallelism
Degree of parallelism
• Execution of a program on a parallel computers-
use different number of processor at different time
periods during the execution cycle
• For each period –number of processor used to
execute a program – degree of parallelism (DOP)
• Discrete time function- only non negative integer
value
4
Degree of parallelism
• Parallelism profile is a plot of the DOP as a
function of time
• Ideally have unlimited resources
• Software tools –available to trace the
parallelism profile
Factors affecting parallelism profiles
• Algorithm structure
• Program optimization
• Resource utilization
• Run-time conditions
6
Degree of parallelism
• DOP-assumption – unbounded number of
available processors and other necessary
resources
• DOP not achievable on a real computer with
limited resources
• DOP exceeds maximum number of available
processor – parallel branches executed in
chunks sequentially
Degree of parallelism
• Parallelism still exists within each chunk , limited
by machine size
• Limited by memory & other non processor
resources
Average parallelism variables
• n – homogeneous processors
• m – maximum parallelism in a profile
• - computing capacity of a single processor
(execution rate only, no overhead)
• DOP=i – # processors busy during an observation
period
9
Average parallelism
• Total amount of work performed is proportional to
the area under the profile curve
t2
W DOP(t )dt
t1
m
W i ti
i 1
• ti total amount of time that DOP = I
• t2 –t1-total elapsed time
10
Average parallelism
1 t2
A
t 2 t1 t 1
DOP (t )dt
m
m
A i ti / ti
i 1 i 1
11
Example: parallelism profile and average
parallelism
12
Available Parallelism
• Potential parallelism in application programs
• Engineering & scientific codes exhibit a high DOP due
to data parallelism
• Computation is less –little parallelism when basic
boundaries are ignored
• Basic block- block of instructions that has single entry
and single exit points
• Complier organization & algorithm redesign –increase
available parallelism
Asymptotic speedup
m
m
T (1) ti (1)
Wim
T (1) W i
i 1 i 1 S i 1
T ( ) m
m
T ( ) ti ( )
Wim
W / i
i 1
i
i 1 i 1 i = A in the ideal case
(response time)
14
Performance measures
• Consider n processors executing m programs in
various modes with different performance levels
• Want to define the mean performance of these
multimode computers:
• Arithmetic mean performance
• Geometric mean performance
• Harmonic mean performance
15
Arithmetic mean performance
m
Ra Ri / m Arithmetic mean execution rate
(assumes equal weighting)
i 1
m
R ( f i Ri )
* Weighted arithmetic mean
execution rate
a
i 1
-proportional to the sum of the inverses of
execution times
16
Geometric mean performance
m
Rg R 1/ m
i
Geometric mean execution rate
i 1
m
R Ri
*
g
fi Weighted geometric mean
execution rate
i 1
-does not summarize the real performance since it does
not have the inverse relation with the total time
17
Harmonic mean performance
Mean execution time per instruction
Ti 1 / Ri For program i
1 m 1 m 1
Ta Ti Arithmetic mean execution time
per instruction
m i 1 m i 1 Ri
18
Harmonic mean performance
m
Rh 1 / Ta m
Harmonic mean execution rate
(1 / R )
i 1
i
1
R
*
h m
Weighted harmonic mean execution rate
( f
i 1
i / Ri )
-corresponds to total # of operations divided by
the total time (closest to the real performance)
19
Harmonic Mean Speedup
• Ties the various modes of a program to the
number of processors used
• Program is in execution mode i, if i processors
used
1
S T1 / T
*
n
i 1
f i / Ri
• Sequential execution time T1 = 1/R1 = 1
20
Harmonic Mean Speedup Performance
21
Amdahl’s Law
• Assume Ri = i, w = (, 0, 0, …, 1- )
• System is either sequential, with
probability , or fully parallel with prob.
1-
n
Sn
1 (n 1)
• Implies S 1/ as n
22
Speedup Performance
23
System Efficiency
• O(n) is the total # of unit operations
• T(n) is execution time in unit time steps
• T(n) < O(n) and T(1) = O(1)
𝑆 (𝑛)=𝑇 (1)/𝑇 (𝑛)
𝑆 (𝑛) 𝑇 (1)
𝐸 (𝑛)= =
𝑛 𝑛𝑇 (𝑛)
24
Redundancy and Utilization
• Redundancy signifies the extent of matching
software and hardware parallelism
R (n) O (n) / O(1)
• Utilization indicates the percentage of resources
kept busy during execution
O ( n)
U ( n) R ( n) E ( n)
nT (n)
25
Quality of Parallelism
• Directly proportional to the speedup and efficiency
and inversely related to the redundancy
• Upper-bounded by the speedup S(n)
S ( n) E ( n) T 3 (1)
Q ( n)
R ( n) nT 2 (n)O(n)
26
Example of Performance
• Given O(1) = T(1) = n3, O(n) = n3 + n2log n, and T(n) =
4 n3/(n+3)
S(n) = (n+3)/4
E(n) = (n+3)/(4n)
R(n) = (n + log n)/n
U(n) = (n+3)(n + log n)/(4n2)
Q(n) = (n+3)2 / (16(n + log n))
27
Standard Performance Measures
• MIPS and Mflops
• Describe the instruction execution rate & floating point
capability of a parallel computer
• MIPS= fx Ic/ C x 10^6
• MIPS-Depends on instruction ,performance
• Mflops – depends on machine hardware design and on
program behavior
28
Standard Performance Measures
• Dhrystone results
• CPU intensive benchmark
• Consists of 100 high level language instructions
& data types
• Balanced with respect to statement type, data
type, locality of reference , with no operating
system calls and making no use of library
functions or subroutines
• Measure of integer performance of modern
processor
Standard Performance Measures
• Whestone results
• Fortran based synthetic benchmark
• Measure of floating-point performance
• Benchmark includes both integer & floating
point operations involving array
indexing ,subroutine calls, parameter
passing ,conditional branching
Standard Performance Measures
• Performance depends on compliers used
• Dhrystone – to test CPU
• Procedure in-lining compiler technique –affect
dhrystone performance
• Sensitivity to compliers – drawback
Standard Performance Measures
• TPS and KLIPS ratings
• On line transaction processing applications
demand rapid, interactive processing for a
large number of relatively simple transaction
• Supported by very large database
• Automated teller machine & airline reservation
-examples
• Transaction performance
Standard Performance Measures
• Throughput of computers –on-line transaction
processing –transaction per second
• Transaction involve – database search , query
answering , database update operations
• In AI applications , the measure KLIPS(Kilo logic
interference per second)
• reasoning power of AI machine
Standard Performance Measures
• Japan fifth generation computer system –
performance of 400 KLIPS
• 400 KLIPS = 40 MIPS
• Logic inference demands symbolic manipulation