ELEC2300 Computer Organization
Lecture 4: Performance
Evaluation
qProfessor George Yuan
qOffice: Rm. 2527
qEmail:[email protected]
Note: some of the slides are adapted from Computer Organization and Design.
Copyright 1998 Morgan Kaufmann Publishers and Notes of Prof. Pattersons CS152
Class, Copyright 1997 UCB.
OUTLINE
qWhat is the computer performance?
qHow to evaluate the performance?
ELEC2300 Computer Organization Fall 2013
Page 2
Which of these airplanes has the best performance?
4 types of airplanes fly between Hong Kong & Shanghai
(distance: D mi.)
Airplane
Boeing 737-100
Passengers
Range (mi)
101
630
Boeing 747
470
BAC/Sud Concorde
132
Douglas
DC-8-50
Time
to perform146
the
Speed (mph)
598
4150
610
4000
1350
8720
544
(Execution
Time)
task
execution time, response time, latency
D
L=
S
Tasks per day, hour, week, sec, ns. ..
1 S
1
T=
C = S C
throughput, bandwidth
D
D
Latency and throughput often are in opposition
ELEC2300 Computer Organization Fall 2013
Page 3
Example
Execution time of Concorde vs. 747:
vConcorde is 1350 mph / 610 mph = 2.2 times faster
q Throughput of Concorde vs. 747:
vBoeing is 286700 pmph / 178200 pmph = 1.6 times
faster (470*610=286700, 132*1350=178200)
q Conclusions:
vConcorde is 2.2 times faster in terms of flying time.
v747 is 1.6 times faster in terms of throughput.
q
ELEC2300 Computer Organization Fall 2013
Page 4
Execution Time vs. Throughput
q Execution time
v How long does it take for my job to run?
v How long does it take to execute a job?
v How long must I wait for the database query?
q Throughput:
v How many tasks can the machine run at once?
v What is the average execution rate?
v How much work is getting done?
qComputer upgrade:
1.P3 -> P4
2.1 P3 -> 2 P3
qWe will focus primarily on execution time for a
single job.
ELEC2300 Computer Organization Fall 2013
Page 5
Definitions
qFor computer study,
1
performanceX =
execution
time X
" X is n times faster than Y" means
n = performance X = execution time Y
performance Y execution
time X
Problem:
vmachine A runs a program in 20 seconds (1 program/20
sec)
vmachine B runs the same program in 25 seconds (1
program/25 sec)
ELEC2300 Computer Organization Fall 2013
Page 6
Execution Time
qElapsed time or response time
vcount everything (disk and memory accesses, I/O , etc.)
va useful number, but often not good for comparison purposes
qCPU time
vDoes not count I/O or time spent running other programs
vcan be broken up into system time, and user time
qOur focus: user CPU time
vtime spent executing the lines of code that are "in" our program
vSystem CPU time: time the CPU spends executing system
(kernal) code in order to run your program, such as, reading
files, moving information into and out of virtual memory, etc.
performanceX =
1
user CPU
time X
ELEC2300 Computer Organization Fall 2013
Page 7
CPU Time Measurement: Clock Cycles
qInstead of reporting execution time in seconds, we often
use cycles
seconds
cycles seconds
=
program program cycle
qProcessor runs machine instructions based on clock
clock cycle time
time
qclock rate (frequency) = cycles per second (1 Hz. = 1
cycle/sec)
A 200 Mhz. clock cycle time is
ELEC2300 Computer Organization Fall 2013
Page 8
Relating the Metrics
qCPU time for a program
CPU time = CPU clock cycles * clock cycle time
= CPU clock cycles/clock rate
qCommon ways to improve performance
(i.e. shorten CPU execution time):
vReduce number of required CPU clock cycles for
a program
vShorten clock cycle time (i.e. increase clock rate)
ELEC2300 Computer Organization Fall 2013
Page 9
Example-Problem
q Description:
vA program takes 10 seconds to run on a 400 MHz
machine (computer A). We want to design a faster
machine (computer B) that can run the same program
in 6 seconds.
vThe increase in clock rate affects the rest of the CPU
design, causing machine B to require 1.2 times as
many clock cycles as machine A for the program.
q Problem to solve:
vWhat clock rate should machine B have?
ELEC2300 Computer Organization Fall 2013 Page 10
Example - Answer
ELEC2300 Computer Organization Fall 2013 Page 11
Cycle Number Calculation
qCPU time for a program
CPU time = CPU clock cycles * clock cycle time
= CPU clock cycles/clock rate
compiler
program
assembly program
assembler
compiler
Instruction #
machine instructions
ISA
processor
clock cycles/instruction (CPI)
Cycle # = Instruction # CPI
ELEC2300 Computer Organization Fall 2013 Page 12
Cycles Per Instruction
qWrong assumption:
v# of CPU clock cycles in a program = # of instructions in the
program,
qActual situation
vFor some processors, some instructions may take more cycles
than the others:
E.g. multiplication takes more cycles than addition
Floating point operations takes more cycles than integer
operations
Memory access takes more cycles than accessing registers
vConclusion: not all instructions require the same # of cycles to
execute.
qCycle per instructions (CPI) an average number of
clock cycles that each instruction in a program takes to
execute.
ELEC2300 Computer Organization Fall 2013 Page 13
Cycles Per Instruction (CPI)
qDefinition (for a given program):
CPI = (CPU clock cycles)/(instruction count)
qA program has the same instruction count on two
different implementations of the same instruction set
architecture, but it may have different CPIs (because an
instruction may require different numbers of clock cycles
on different implementations). If the number of clock
cycles for a program is known, knowing either the
instruction count or the CPI can determine the other.
qCPI provides a measure for comparing implementations.
qInstruction count can be measured using software tools
or simulators.
ELEC2300 Computer Organization Fall 2013 Page 14
Cycles Per Instruction
qLet there be n different instruction classes
(with different CPIs). For a given program,
suppose we know:
vCPIi = CPI for instruction class i
vCi = # of instruction of class I
qCPU clock cycles = CPI * instruction count. It
can be generalized to
n
CPU _ clock _ cycles = (CPI i Ci )
i =1
and
i =1
i =1
CPI = (CPI i Ci ) / Ci
ELEC2300 Computer Organization Fall 2013 Page 15
CPI Example
qSuppose we have two implementations of the
same instruction set architecture (ISA)
qFor some program, machine A has a clock cycle
time of 1 ns (1 GHz) and a CPI of 2.0. Machine
B has a clock cycle time of 2 ns (500MHz) and a
CPI of 1.2. Which machine is faster for this
program, and by how much?
qIf two machines have the same ISA which of our
quantities (e.g., clock rate, CPI, execution time, # of
instructions, MIPS) will always be identical?
ELEC2300 Computer Organization Fall 2013 Page 16
Example - Solution
ELEC2300 Computer Organization Fall 2013 Page 17
Relating the metrics
qFor a given program X running on a machine A
Time =
seconds
# of instructions
=
program
a program
# of clocks
second
*
*
# of instructions
clock
= instruction count * CPI * clock cycle time
= instruction count * CPI / clock rate
qThe only complete and reliable measure is CPU execution
time
qOther measures are unreliable. E.g. changing the
instruction set to lower the instruction count may lead to a
larger CPI or an organization with a slower clock rate.
Either case can offset the improvement in instruction count.
ELEC2300 Computer Organization Fall 2013 Page 18
Example Comparing Code Segments
q Description
vA particular machine has the following hardware facts:
Instruction class
A
B
C
CPI for this instruction class
1
2
3
vFor a given C++ statement, a compiler designer considers two
code sequences with the following instruction counts:
Code sequence
1
2
Instruction counts for instruction classes
A
2
4
B
1
1
C
2
1
q Problem to solve
vWhich code sequence executes the most instructions? Which is
faster? What is the CPI for each sequence?
ELEC2300 Computer Organization Fall 2013 Page 19
Example - Answer
ELEC2300 Computer Organization Fall 2013 Page 20
A misleading measure - MIPS
qThere are some performance measures that are
famous among computer manufacturers and
sellers but are misleading!
qMIPS (million instructions per second)
(meaningless indication of processor speed)
vMIPS = (instruction count)/(execution time * 106)
vMIPS depends on
Instruction set (instructions have different capabilities)
Program
vMIPS can vary inversely with performance
vPeak performance
ELEC2300 Computer Organization Fall 2013 Page 21
Some Processors in MIPS
Processor
IPS
Year
Motorola 68000
1MIPS @ 8MHz
1979
Intel 386DX
8.5MIPS @ 25MHz
1988
Intel 486DX
54MIPS @ 66MHz
1992
PowerPC G2
35MIPS @ 33MHz
1994
Intel Pentium Pro
541MIPS @ 200MHz
1996
ARM 7500FE
35.9MIPS @ 40MHz
1996
PowerPC G3
525MIPS @ 233MHz
1997
Zilog eZ80
80MIPS @ 50MHz
1999
Intel Pentium III
1354MIPS @ 500MHz
1999
AMD Athlon
3561MIPS @ 1.2GHz
2000
Pentium 4
9726MIPS @ 3.2GHz
2003
ARM Cortex A8
2000MIPS @ 1.0GHz
2005
Xbox360 IBM Xenon Triple Core
6400MIPS @ 3.2GHz
2005
AMD Athlon 64 3800+ X2(Dual Core) 14564MIPS @ 2.0GHz
2005
Intel Core2 Extreme QX6700
2006
57063MIPS @ 3.33GHz
ELEC2300 Computer Organization Fall 2013 Page 22
Another misleading measure - MFLOPS
qMFLOPS (million floating-point operations per second):
vMFLOPS =
(# of floating point operations)/(execution time * 106)
vMFLOPS considers only floating-point operations
(addition, subtraction, multiplication, or division
operation applied to a number in a single or double
precision floating-point representation).
vMFLOPS depends on:
Floating-point operation
(e.g., addition and multiplication differ in complexity)
Program
vMeaningless if there is little or no floating-point
arithmetic.
ELEC2300 Computer Organization Fall 2013 Page 23
MIPS example
q Two different compilers are being tested for a 100 MHz. machine
with three different classes of instructions: Class A, Class B, and
Class C, which require one, two, and three cycles (respectively).
Both compilers are used to produce code for a large piece of
software.
vThe first compiler's code uses 5 million Class A instructions, 1
million Class B instructions, and 1 million Class C instructions.
vThe second compiler's code uses 10 million Class A instructions,
1 million Class B instructions, and 1 million Class C instructions.
q What are the execution times for each sequence?
q What is the MIPS index for this processor based on the two testing
sequence?
ELEC2300 Computer Organization Fall 2013 Page 24
Summary
qSome related terminology:
vclock, clock cycle, cycle
vclock cycle time, cycle time (seconds, us, ns)
vclock rate, cycle rate (Hz, MHz)
vCPI (cycles per instruction)
vMIPS (millions of instructions per second)
qPerformance is determined by the execution time
qExecution time calculation:
Execution Time = instruction count * CPI * clock cycle time
= instruction count * CPI / clock rate
ELEC2300 Computer Organization Fall 2013 Page 25
OUTLINE
qWhat is the computer performance?
qHow to evaluate the performance?
ELEC2300 Computer Organization Fall 2013 Page 26
Benchmarks
q Execution time calculation:
Execution Time = instruction count * CPI * clock cycle time
= instruction count * CPI / clock rate
q Benchmark: a set of specially designed programs to test the
performance of a computer
q Performance best determined by running a real application
vBenchmarks are application specific
CPU performance, graphics, high-performance computing, objectoriented computing, Java applications, client-server models, mail
systems, file systems, Web servers.
q SPEC (System Performance Evaluation Cooperative)
vcompanies have agreed on a set of real program and inputs
vvaluable indicator of computer performance
Processor (ISA implementation) + compiler
ELEC2300 Computer Organization Fall 2013 Page 27
SPEC 89
q Compiler enhancements and performance
800
700
SPEC performance ratio
600
500
400
300
200
100
gcc
espresso
spice
doduc
nasa7
li
eqntott
matrix300
fpppp
tomcatv
Benchmark
Compiler
Enhanced compiler
ELEC2300 Computer Organization Fall 2013 Page 28
SPEC CPU2000
qSPEC ratio
vReference: Sun Ultra 5_10 with a 300MHz
processor
qCINT2000, CFP2000
vGeometric mean of SPEC ratios
ELEC2300 Computer Organization Fall 2013 Page 29
SPEC CPU2000 Benchmarks
ELEC2300 Computer Organization Fall 2013 Page 30
SPEC CPU2000 ratings
ELEC2300 Computer Organization Fall 2013 Page 31
Amdahl's Law
Execution Time After Improvement =
Execution Time Unaffected +
( Execution Time Affected / Amount of Improvement )
Example:
"Suppose a program runs in 100 seconds on a machine, with
multiplication responsible for 80 seconds of this time. How much do we
have to improve the speed of multiplication if we want the program to run
4 times faster?"
How about making the program 5 times faster?
Principle: Make the common case fast
ELEC2300 Computer Organization Fall 2013 Page 32
Example
q Suppose we enhance a machine making all floating-point instructions
five times faster. If the execution time of some benchmark before the
floating-point enhancement is 10 seconds, what will the speedup be if
half of the 10 seconds is spent executing floating-point instructions?
q We are looking for a benchmark to show off the new floating-point
unit described above, and want the overall benchmark to show a
speedup of 3. One benchmark we are considering runs for 100
seconds with the old floating-point hardware. How much of the
execution time would floating-point instructions have to account for
in this program in order to yield our desired speedup on this
benchmark?
ELEC2300 Computer Organization Fall 2013 Page 33
Remember
qPerformance is specific to a particular program
vTotal execution time is a consistent summary of performance
qFor a given architecture performance increases come
from:
vincreases in clock rate (without adverse CPI affects)
vimprovements in processor organization that lower CPI
vcompiler enhancements that lower CPI and/or instruction count
qPitfall: expecting improvement in one aspect of a
machines performance to affect the total performance
qYou should not always believe everything you read!
Read carefully!
ELEC2300 Computer Organization Fall 2013 Page 34