Unit 1.
Performance
ECE369
1
Defining Performance
• Which airplane has the best performance?
ECE369
2
Response Time and Throughput
• Response time
– How long it takes to do a task
• Throughput
– Total work done per unit time
• e.g., tasks/transactions/… per hour
• How are response time and throughput
affected by
– Replacing the processor with a faster version?
– Adding more processors?
• We’ll focus on response time for now…
ECE369
3
Relative Performance
• Define Performance = 1
Execution Time
• “X is n time faster than Y”
Performance X /PerformanceY
Execution time Y /Execution time X =n
Example: time taken to run a program
10s on A, 15s on B
Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
So A is 1.5 times faster than B
ECE369
4
Execution Time
• Elapsed Time
– Total response time, including all aspects of Processing,
such as I/O, OS overhead, idle time
– a useful number, but often not good for comparison
purposes
• CPU time
– Time spent processing a given job
• Discounts I/O time, other jobs’ shares
– can be broken up into system time, and user time
• Our focus: user CPU time
– time spent executing the lines of code that are "in" our
program
ECE369
5
Clock Cycles
• Instead of reporting execution time in seconds, we often use cycles
seconds cycles seconds Operation of digital hardware
= ´ governed by a constant-rate clock
program programClockcycle
period
Clock (cycles)
Data transfer
and computation
Update state
• clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)
Clock period: duration of a clock cycle
e.g., 250ps = 0.25ns = 250×10–12s Since 1 picoseconds = 0.001 ns
Clock frequency (rate): cycles per second
e.g., 4.0GHz = 4000MHz = 4.0×109Hz
ECE369
6
CPU Time
CPU Time=CPU Clock Cycles×Clock Cycle Time
CPU Clock Cycles
=
Clock Rate
• Performance improved by
– Reducing number of clock cycles
– Increasing clock rate
– Hardware designer must often trade off
clock rate against cycle count
ECE369
7
CPU Time Example
• Computer A: 2GHz clock, 10s CPU time
• Designing Computer B
– Aim for 6s CPU time
– Can do faster clock, but causes 1.2 × clock cycles
• How fast must Computer B clock be?
Clock Cycles B 1.2×Clock Cycles A
Clock Rate B = =
CPU Time B 6s
Clock Cycles A =CPU Time A ×Clock Rate A
9 ; Since 1 GHz = 109 HZ
¿10s×2GHz=20×10
1.2×20×10 9 24×109
Clock Rate B = = =4GHz
6s 6s
ECE369
8
Instruction Count and CPI
Clock Cycles=Instruction Count×Cycles per Instruction
CPU Time=Instruction Count×CPI×Clock Cycle Time
Instruction Count×CPI
=
Clock Rate
• Instruction Count for a program
– Determined by program, ISA and compiler
• Average cycles per instruction
– Determined by CPU hardware
– If different instructions have different CPI
• Average CPI affected by instruction mix
ECE369
9
CPI Example
• Computer A: Cycle Time = 250ps, CPI = 2.0
• Computer B: Cycle Time = 500ps, CPI = 1.2
• Same ISA
• Which is faster, and by how much?
CPU Time A =Instruction Count×CPI A ×Cycle TimeA
=I×2. 0×250ps=I×500ps A is faster…
CPU TimeB =Instruction Count×CPI B×Cycle TimeB
=I ×1.2×500ps=I ×600ps
CPU Time B I ×600ps
= =1.2 …by this much
CPU Time A I ×500ps
ECE369
10
CPI in More Detail
• If different instruction classes take
different numbers of cycles
n
Clock Cycles=∑ ( CPIi ×Instruction Count i )
i=1
Weighted average CPI
n
CPI=
Clock Cycles
=∑ CPI i×
Instruction Count i=1 (Instruction Count i
Instruction Count )
Relative frequency
ECE369
11
CPI Example
• Alternative compiled code sequences using
instructions in classes A, B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
Sequence 1: IC = 5 Sequence 2: IC = 6
Clock Cycles Clock Cycles
= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3
= 10 =9
Avg. CPI = 10/5 = 2.0 Avg. CPI = 9/6 = 1.5
ECE369
12
Performance Summary
Instructions Clock cycles Seconds
CPU Time= × ×
Program Instruction Clock cycle
• Performance depends on
– Algorithm: affects IC, possibly CPI
– Programming language: affects IC, CPI
– Compiler: affects IC, CPI
– Instruction set architecture: affects IC, CPI, Tc
ECE369
13
Component Analysis
ECE369
14
Example
• Our favorite program runs in 10 seconds on computer A, which has a
4 GHz. clock. We are trying to help a computer designer build a new
machine B, that will run this program in 6 seconds. The designer can use
new (or perhaps more expensive) technology to substantially increase the
clock rate, but has informed us that this increase will affect the rest of the
CPU design, causing machine B to require 1.2 times as many clock cycles as
machine A for the same program. What clock rate should we tell the
designer to target?"
• Don't Panic, can easily work this out from basic principles
ECE369
15
Example
seconds cycles seconds
• Our favorite program runs in 10 = ´
seconds on computer A, which has
a 4 GHz. clock. We are trying to program program cycle
help a computer designer build a
new machine B, that will run this
program in 6 seconds. The
designer can use new (or perhaps
more expensive) technology to
substantially increase the clock
rate, but has informed us that this
increase will affect the rest of the
CPU design, causing machine B to
require 1.2 times as many clock
cycles as machine A for the same
program. What clock rate should
we tell the designer to target?"
ECE369
16
CPI Example ( Repeated problem )
• Suppose we have two
implementations of the same seconds cycles seconds
instruction set = ´
architecture (ISA). program program cycle
For some program,
Machine A has a clock cycle time
of 250 ps and a CPI of 2.0
Machine B has a clock cycle time
of 500 ps and a CPI of 1.2
What machine is faster for this
program, and by how much?
ECE369
17
Let’s Complicate Things A Little bit… ( Repeated problem )
Which sequence will be faster? How much?
A compiler designer is trying to
decide between two code
sequences for a particular seconds cycles
= ×
seconds
machine. Based on the hardware program program cycle
implementation, there are three
different classes of instructions:
Class A, Class B, and Class C, and
they require one, two, and three
cycles (respectively).
What is the CPI for each sequence?
The first code sequence has 5
instructions: 2 of A, 1 of B, and 2 of C
The second sequence has 6 instructions:
4 of A, 1 of B, and 1 of C.
ECE369
18
Scary Stuff ( New problem )
Op Frequency Cycle Count
ALU 43% 1
Load 21% 1
Store 12% 2
Branch 24% 2
Let’s say we were able to reduce the cycle count for
“Store” operations to 1 with a cost of slowing our
clock by15%. Is this new design feasible?
( )
n
∑ CPI i×IC i n
CPI original=
i =1
Instruction Count
=∑ CPI i ×
i=1
( IC i
InstructionCount )
ECE369
19
Example(Contd.)
Old CPI = 0.43 + 0.21 + 0.12x2 + 0.24x2 = 1.38
New CPI = 0.43 + 0.21 + 0.12 + 0.24x2 = 1.24
Speed up = old time/new time
= (ICx oldCPI x T)/(IC x newCPI x 1.15T)
= 1.38 / ( 1.24 * 1.15 ) = 1.38 / 1.426
=0.97
so, don't make this change.
ECE369
20
Practice problems
ECE369
21
What is MIPS?
• Instruction execution rate => higher is better
• Issues:
– Can not compare processors with different instruction sets
– Varies between programs on the same processor
– Can vary inversely with the performance… ?
ECE369
22
MIPS Example
ECE369
23
Performance Measurement Overview
CPUtime=CPUclock cycles for the
× Clock Cycle Time
pogram
CPUclock cycles for the
CPUtime= pogram
Clock Rate
CPUclock cycles for the
CPI = pogram
IC
CPUtime= IC×CPI ×Clock Cycle Time
IC ×CPI
CPUtime=
Clock Rate
Seconds Instructions ClcokCycles Seconds
CPUtime= = × ×
Pr ogram Pr ogram Instruction ClockCycle
ECE369
24
Performance Measurement Overview
n
CPU clock cycles for
=∑ CPI i× IC i
the
program i=1
(∑ )
n
CPUtime= CPI i ×IC i ×Clock Cycle Time
i=1
(∑ )
n
CPI i ×IC i n
overall CPI =
i=1
Instruction Count
=∑ CPI i ×
i=1
(
IC i
Instruction Count )
ECE369
25
Exercise
Suppose we have made the following measurements:
• Frequency of FP operations = 25%
• Average CPI of FP operations = 4.0
• Average CPI of other instructions = 1.33
• Frequency of FPSQR = 2%
• CPI of FPSQR = 20
Assume that the two design alternatives are:
a) to reduce the CPI of FPSQR to 2 or
b) to reduce the average CPI of all FP operations to 2.
Compare these alternatives.
ECE369
26
Solution
(∑ )
n
CPI i×IC i n
CPI original=
i =1
Instruction Count
=∑ CPI i ×
i=1
( IC i
InstructionCount )
=4×25 +1 . 33×75 =2 . 0
CPI Saved onFPSQR
=2 ×( CPI oldFPSQR−CPI newFPSQR )=2 ×( 20−2)=0.36
CPI overall for new
=CPI original −CPI Saved onFPSQR
=2−0 . 36=1 . 64
FPSQR
CPI overall for new
=75 ×1 . 33 +25 ×2 . 0=1 . 5
FP
CPUTimeoriginal IC×ClockCycle×CPI original CPI original 2.00
SpeedupFP= = = = =1.33
CPUTime new IC×ClockCycle×CPI new CPI new 1.5
ECE369
27
Amdahl's Law
ECE369
28
Amdahl's Law
• The performance enhancement of an improvement is limited by how
much the improved feature is used. In other words: Don’t expect an
enhancement proportional to how much you enhanced something.
• Example:
"Suppose a program runs in 100 seconds on a machine, with
multiply operations responsible for 80 seconds of this time. How much
do we have to improve the speed of multiplication if we want the
program to run 4 times faster?"
How about making it 5 times faster?
ECE369
29
Amdahl’s Law
1. Speed up = 4
2. Old execution time = 100
3. New execution time = 100/4 = 25
4. If 80 seconds is used by the affected part =>
5. Unaffected part = 100-80 = 20 sec
6. Execution time new = Execution time unaffected +
Execution time affected / Improvement
7. 25= 20 + 80/Improvement
8. Improvement = 16
ECE369
30
Example: Speed up using parallel processors
Suppose an application is “almost all” parallel: 90%.
What is the speedup using 10, 100, and 1000 processors?
new time = old time * 10% + ( old time * 90% ) / 10
Speed up (P=10) = old time / new time
Speedup (P=10) = 5.3
Speedup (P = 100 ) = 9.1
Speedup ( P = 1000 ) = 9.9
ECE369
31
Amdahl’s Law Overview
ECE369
32
Example
• Suppose we are considering an enhancement that
runs 10times faster than the original machine but is
only usable 40% of the time. What is the overall
speedup gained by incorporating the enhancement?
1
Speedup= ≃1 . 56
0 .4
+0 . 6
10
ECE369
33
Example
• Implementations of floating point square root vary significantly in
performance. Suppose FP square root (FPSQR) is responsible for
20% of the execution time of a critical benchmark on am machine,
One proposal is to add FPSQR hardware that will speed up this
operation by a factor of 10. The other alternative is just to try to make
all FP instructions run faster; FP instructions are responsible for a
total of 50% of the execution time. The design team believes that they
can make all FP instructions run two times faster with the same effort
as required for the fast square root. Compare those two design
alternatives.
1
SpeedupFPSQR= ≃1.22
0.2
+( 1−0.2)
10
1
SpeedupFP= ≃1 . 33
0.5
+(1−0 . 5 )
2
ECE369
34