Module-2 Introduction and Performance Analysis
Module-2 Introduction and Performance Analysis
2
Performance
Two important metrics
• Response Time or Latency or Execution Time – Time
taken for completion of a single job ( time between the
start and the completion of an event). Smaller is better.
• Throughput – Number of jobs done per unit of time.
Larger is better.
3
Comparing Design Alternatives
• “X is n times faster than Y”
5
Measuring Performance
• Time is not always the metric quoted in
comparing the performance of computers.
• Reliable measure of performance is the
execution time of real programs.
• What is Time?
– wall-clock time
– response time or elapsed time
– disk accesses, memory accesses, input/output activities
– operating system overhead
– CPU time(user CPU time or System CPU time) or I/O
time
6
Measuring Performance
• Computer designers measure that how
fast the hardware can perform basic
functions.
• Computers are constructed using a clock
that runs at a constant rate and
determines when events take place in the
hardware.
• clock cycles, clock period, clock
rate(inverse of clock period)
7
How can we Improve Performance ?
9
CPU Performance Evaluation: CPI
• Most computers run synchronously utilizing a CPU clock running at a
constant clock rate (Or clock frequency f ) Clock cycle
where: Clock rate = 1 / clock cycle time
cycle 1 cycle 2 cycle 3
f = 1 /C
• The CPU clock rate depends on the specific CPU organization (design) and
hardware implementation technology (VLSI) used.
Instruction
Determine required actions and instruction size
Decode
11
Computer Performance Measures: Program
Execution Time
• For a specific program compiled to run on a specific machine (CPU)
“A”, has the following parameters:
– I: The total executed instruction count of the program.
– CPI: The average number of cycles per instruction (average CPI).
– C: Clock cycle of machine “A” Or effective CPI
12
So…The Questions are..
13
Comparing Computer Performance Using Execution Time
• To compare the performance of two machines (or CPUs) “A”, “B” running
the same program:
PerformanceA = 1 / Execution TimeA
PerformanceB = 1 / Execution TimeB
• Machine A is n times faster than machine B means (or slower if n < 1) :
PerformanceA Execution TimeB
Speedup = n = Performance =
B Execution TimeA
• Example:
For a given program:
Execution time on machine A: ExecutionA = 1 second
Execution time on machine B: ExecutionB = 10 seconds
Speedup=
PerformanceA / PerformanceB = Execution TimeB / Execution TimeA
= 10 / 1 = 10
14
CPU Execution Time: The CPU Equation
• A program is comprised of a number of instructions executed I
– Measured in: instructions/program
• The average instruction executed takes a number of cycles per
instruction (CPI) to be completed.
– Measured in: cycles/instruction, CPI
• CPU has a fixed clock cycle time C = 1/clock rate
– Measured in: seconds/cycle
T = I x CPI x C
execution Time Number of Average CPI for program CPU Clock Cycle
per program in seconds instructions executed
15
This equation is commonly known as the CPU performance equation
CPU Performance Equation
For a given program executed on a given machine (CPU):
CPI = Total program execution cycles / Instructions count
(i.e average or effective CPI)
CPU
CPUtime
time == Seconds
Seconds ==Instructions
Instructions xx Cycles
Cycles xx Seconds
Seconds
Program
Program Program
Program Instruction
Instruction Cycle
Cycle
CPU time = Instruction count x CPI x Clock cycle Time
= 10,000,000 x 2.5 x 1 / clock rate
= 10,000,000 x 2.5 x 5x10 -9
= 0.125 seconds
17
T = I x CPI x C
Aspects of CPU Execution Time
CPU Time = Instruction count executed x CPI x Clock cycle
IC
(executed)
Depends on:
Program Used Depends on:
Compiler CPI CCT CPU Organization
ISA Technology (VLSI)
(Average
CPU Organization
CPI)
18
Factors Affecting CPU
Performance
CPU
CPUtime
time == Seconds
Seconds ==Instructions
Instructions xx Cycles
Cycles xx Seconds
Seconds
Program
Program Program
Program Instruction
Instruction Cycle
Cycle
Compiler
Instruction Set
Architecture (ISA)
Organization
(CPU Design)
Technology
(VLSI)
19
T = I x CPI x C
Performance Comparison: Example
• From the previous example: A Program is running on a specific machine
(CPU) with the following parameters:
– Total executed instruction count, IC: 10,000,000 instructions
– Average CPI for the program: 2.5 cycles/instruction.
– CPU clock rate: 200 MHz. Thus: CCT = 1/(200x10 )= 5x10 seconds
6 -9
Speedup
Speedup == Old
OldExecution
ExecutionTime
Time ==Iold
Iold xx CPI
CPIoldold xx Clock
Clockcycle
cycle
Time
Timeoldold
20
New
NewExecution
ExecutionTime
Time Inew
Inew xx CPI
CPInew xx Clock
ClockCycle
Cycle
new
Time
Instruction Types & CPI
• Given a program with n types or classes of instructions executed on a given
CPU with the following characteristics:
21
Instruction Types & CPI: An Example
• An instruction set has three instruction classes:
Instruction class CPI
A 1 For a specific
B 2 CPU design
C 3
CPIi x Fi
Fraction of total execution time for instructions of type i =
CPI
23
T = I x CPI x C
Instruction Type Frequency & CPI:
A RISC Example
CPIi x Fi
Program Profile or Executed Instructions Mix
CPI
Base Machine (Reg / Reg) Depends on CPU Design
Typical Mix
Sum = 2.2
n
CPI CPI i F i
i.e average or effective CPI i 1
CPI = .5 x 1 + .2 x 5 + .1 x 3 + .2 x 2 = 2.2
= .5 + 1 + .3 + .4
24
T = I x CPI x C
Metrics of Computer
(Measures)
Performance
Application Execution time: Target workload,
SPEC, etc.
Programming
Language
Compiler
(millions) of Instructions per second – MIPS
(millions) of (F.P.) operations per second – MFLOP/s
ISA
Datapath
Control Megabytes per second.
Function Units
Transistors Wires Pins Cycles per second (clock rate).
25
Choosing Programs To Evaluate
Performance
Levels of programs or benchmarks that could be used to evaluate
performance:
– Actual Target Workload: Full applications that run on the
target machine.
– Real Full Program-based Benchmarks:
• Select a specific mix or suite of programs that are typical of
targeted applications or workload (e.g SPEC95, SPEC
CPU2000). Also called synthetic benchmarks
– Small “Kernel” Benchmarks:
• Key computationally-intensive pieces extracted from real programs.
– Examples: Matrix factorization, FFT, tree search, etc.
• Best used to test specific aspects of the machine.
– Microbenchmarks:
• Small, specially written programs to isolate a specific aspect of
performance characteristics: Processing: integer, floating point,
local memory, input/output, etc.
26
Computer Performance Measures :
MIPS (Million Instructions Per Second) Rating
• For a specific program running on a specific CPU the MIPS rating is a measure of
how many millions of instructions are executed per second:
MIPS Rating = Instruction count / (Execution Time x 10 6)
= Instruction count / (CPU clocks x Cycle time x 10 6)
= (Instruction count x Clock rate) / (Instruction count x CPI x 10 6)
= Clock rate / (CPI x 106)
• Major problem with MIPS rating: As shown above the MIPS rating does not account for the
count of instructions executed (I).
– A higher MIPS rating in many cases may not mean higher performance or
better execution time. i.e. due to compiler design variations.
• In addition, the MIPS rating:
– Does not account for the instruction set architecture (ISA) used.
• Thus it cannot be used to compare computers/CPUs with different instruction sets.
– Easy to abuse: Program used to get the MIPS rating is often omitted.
• Often the Peak MIPS rating is provided for a given CPU which is obtained using a
program comprised entirely of instructions with the lowest CPI for the given CPU
design which does not represent real programs.
T = I x CPI x C
27
Computer Performance Measures :
MIPS (Million Instructions Per Second) Rating
28
Compiler Variations, MIPS & Performance:
An Example
• For a machine (CPU) with instruction classes:
29
Compiler Variations, MIPS & Performance:
An Example (Continued)
MIPS = Clock rate / (CPI x 106) = 100 MHz / (CPI x 106)
– CPI1 = (5 x 1 + 1 x 2 + 1 x 3) / (5 + 1 + 1) = 10 / 7 = 1.43
– MIPS Rating1 = 100 / (1.428 x 106) = 70.0 MIPS
– CPU time1 = ((5 + 1 + 1) x 106 x 1.43) / (100 x 106) = 0.10 seconds
• For compiler 2:
– CPI2 = (10 x 1 + 1 x 2 + 1 x 3) / (10 + 1 + 1) = 15 / 12 = 1.25
– MIPS Rating2 = 100 / (1.25 x 106) = 80.0 MIPS
– CPU time2 = ((10 + 1 + 1) x 106 x 1.25) / (100 x 106) = 0.15 seconds
MIPS rating indicates that compiler 2 is better while in reality the code produced by
compiler 1 is faster 30
Computer Performance Measures :
MFLOPS (Million FLOating-Point Operations Per Second)
• A floating-point operation is an addition, subtraction, multiplication, or division
operation applied to numbers represented by a single or a double precision
floating-point representation.
• MFLOPS, for a specific program running on a specific computer, is a measure of
millions of floating point-operation (megaflops) per second:
Before:
Execution Time without enhancement E: (i.e., before enhancement is applied)
• Shown normalized to 1 = { (1-F) + F }
Unaffected fraction: (1- F) Affected fraction: F
Unchanged
35
An Alternative Solution Using CPU Equation
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45% CPI = 2.2
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
If a CPU design enhancement improves the CPI of load instructions
from 5 to 2, what is the resulting performance improvement from this
enhancement:
New CPI of load is now 2 instead of 5
Old CPI = 2.2
New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6
Original Execution Time Instruction count x old CPI x clock cycle
Speedup(E) = ----------------------------------- = ----------------------------------------------------------------
New Execution Time Instruction count x new CPI x clock cycle
T = I x CPI x C 36
Performance Enhancement Example
• A program runs in 100 seconds on a machine with multiply operations
responsible for 80 seconds of this time. By how much must the
speed of multiplication be improved to make the program four times
faster?
100
Desired speedup = 4 = -----------------------------------------------------
Execution Time with enhancement
Execution time with enhancement = 100/4 = 25 seconds
25 seconds = (100 - 80 seconds) + 80 seconds / S
25 seconds = 20 seconds + 80 seconds / S
5 = 80 seconds / S
S = 80/5 = 16
Alternatively, it can also be solved by finding enhanced fraction of execution time:
F = 80/100 = .8
and then solving Amdahl’s speedup equation for desired enhancement factor S
38
Extending Amdahl's Law To Multiple Enhancements
n enhancements each affecting a different portion of execution time
((1 F ) F )
after the enhancements were applied?
i How would you solve the problem?
(i.e find expression for speedup)
i i i
S i
• While all three enhancements are in place in the new design, each
enhancement affects a different portion of the code and only one
enhancement can be used at a time.
• What is the resulting overall speedup?
1
Speedup
((1 F ) F ) i
i i i
S i
/ 10 / 15 / 30
Unchanged
After:
Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833
What if the fractions given are
Speedup = 1 / .5833 = 1.71 after the enhancements were applied?
How would you solve the problem?
41
“Reverse” Multiple Enhancements Amdahl's
Law
• Multiple Enhancements Amdahl's Law assumes that the fractions given
refer to original execution time.
Speedup
((1 F ) F S ) XResulting Execution Time
i i i i i
Unaffected fraction
(1 i F i) i F i S i
Speedup (1 F ) F S
i i i
1 i i
= .55 + 2 + 2.25 + 3
= 7.8
43
Probable Conclusions
1. Total Number of instructions is definitely
not a good metric.
2. MIPS is a good metric.
44
Conclusion
Total time of execution is always a better metric as
it sums up all factors and can not be replaced
by considering
1. MIPS
2. Total number of instructions
3. Clock Rate
alone.
45
Measuring Performance
Now that we know that performance is
dependent upon program, which
program(s) should be used to measure
performance?
Benchmarks.
46
Benchmarks
• Are a set of programs that are specifically
chosen for measuring performance.
• Types of Benchmarks
– Real Programs
– Kernel
• Extract the key feature from a program
– Component
– Synthetic
• Dhrystone – floating Point
• Whetstone – Integer and String Arithemetic
– I/O
– Parallel
47
Challenges
1. Vendors may tinker with benchmark to
make them run better on their platform.
At-times this is permitted.
2. Give data set rather than a single
performance number.
3. Concentrate only on computational
power.
48
Popular Benchmarks
• SPEC - Standard Performance Evaluation Corporation
– Floating point
– Integer
– Web
– Graphics
• TPC – Transaction Processing Performance Council
– Web Server
– Transaction Processing
– Decision Support Systems
• BAPCo – Business Applications Performance Corporation
– Popular business applications
• EEMBC – Embedded Microprocessor Benchmark Consortium
– Embedded Applications
49
Statistical Summarization of Data
For Response time metric
Arithmetic Mean
50
Reference
• Computer Organization and Design The
hardware/software interface by David
A. Patterson, University of California,
Berkeley and John L. Hennessy,
Stanford University
51