0% found this document useful (0 votes)

35 views194 pages

Unit 1

The document discusses the fundamentals of computer architecture, focusing on performance improvements through advancements in semiconductor technology and computer architectures. It highlights the shift from single-processor performance to multi-processor systems, emphasizing various classes of parallelism and trends in technology, power, and cost. Key principles of computer design, including Amdahl's Law and pipelining, are also addressed to optimize performance.

Uploaded by

Nandan Hr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views194 pages

Unit 1

Uploaded by

Nandan Hr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 194

PADP(18CS73)

Unit 1
Dr. Minal Moharir

1
Computer Architecture
A Quantitative Approach, Fifth Edition

Chapter 1
Fundamentals of Quantitative
Design and Analysis

Copyright © 2012, Elsevier Inc. All rights reserved. 2

Copyright © 2012, Elsevier Inc. All rights reserved. 3
Copyright © 2012, Elsevier Inc. All rights reserved. 4
Introduction
Computer Technology
■ Performance improvements:
■ Improvements in semiconductor technology
■ Feature size, clock speed
■ Improvements in computer architectures
■ Enabled by HLL compilers, UNIX
■ Lead to RISC architectures

■ Together have enabled:

■ Lightweight computers
■ Productivity-based managed/interpreted
programming languages

Copyright © 2012, Elsevier Inc. All rights reserved. 5

Introduction
Single Processor Performance
Move to multi-processor

RISC

Copyright © 2012, Elsevier Inc. All rights reserved. 6

Introduction
Current Trends in Architecture
■ Cannot continue to leverage Instruction-Level
parallelism (ILP)
■ Single processor performance improvement ended in
2003

■ New models for performance:

■ Data-level parallelism (DLP)
■ Thread-level parallelism (TLP)
■ Request-level parallelism (RLP)

■ These require explicit restructuring of the

application

Copyright © 2012, Elsevier Inc. All rights reserved. 7

Classes of Computers
Parallelism at multiple levels is now the driving force of computer design across

Parallelism all four classes of computers, with energy and cost being the primary
constraints.

■ Classes of parallelism in applications:

■ Data-Level Parallelism (DLP)
■ Task-Level Parallelism (TLP)

■ Classes of architectural parallelism:

■ Instruction-Level Parallelism (ILP)
■ Vector architectures/Graphic Processor Units (GPUs)
■ Thread-Level Parallelism
■ Request-Level Parallelism

Copyright © 2012, Elsevier Inc. All rights reserved. 8

Computer hardware in turn can exploit these two kinds of application parallelism

Parallelism in four major ways:

■ 1. Instruction-Level Parallelism exploits DLP eg. pipelining and speculative

execution.

■ 2. Vector Architectures and Graphic Processor Units (GPUs) exploit DLP by

applying a single instruction to a collection of data in parallel.

■ 3. Thread-Level Parallelism exploits either DLP or TLP in a tightly coupled

hardware model that allows for interaction among parallel threads.

■ 4. Request-Level Parallelism exploits parallelism among largely decoupled

tasks specified by the programmer or the operating system.

Copyright © 2012, Elsevier Inc. All rights reserved. 9

Classes of Computers
Flynn’s Taxonomy
■ Single instruction stream, single data stream (SISD)

■ Single instruction stream, multiple data streams (SIMD)

■ Vector architectures
■ Multimedia extensions
■ Graphics processor units

■ Multiple instruction streams, single data stream (MISD)

■ No commercial implementation

■ Multiple instruction streams, multiple data streams

(MIMD)
■ Tightly-coupled MIMD
■ Loosely-coupled MIMD:Cluster Computing

Copyright © 2012, Elsevier Inc. All rights reserved. 10

Defining Computer Architecture
Defining Computer Architecture
■ “Old” view of computer architecture:
■ Instruction Set Architecture (ISA) design
■ i.e. decisions regarding:
■ registers, memory addressing, addressing modes,
instruction operands, available operations, control flow
instructions, instruction encoding

■ “Real” computer architecture:

■ Specific requirements of the target machine
■ Design to maximize performance within constraints:
cost, power, and availability
■ Includes ISA, microarchitecture, hardware

Copyright © 2012, Elsevier Inc. All rights reserved. 11

Technology
■ If an ISA is to be successful, it must be designed to survive rapid
changes in computer technology. as successful ISA may last
decades—for example, the core of the IBM mainframe has been in
use for nearly 50 years.

■ An architect must plan for technology changes that can increase the
lifetime of a successful computer.

■ To plan for the evolution of a computer, the designer must be aware

of rapid changes in implementation technology.

■ Five implementation technologies, which change at a dramatic pace,

are critical to modern implementations:

Copyright © 2012, Elsevier Inc. All rights reserved. 12

Technology
Trends in
Trends in Technology
■ Integrated circuit technology
■ Transistor density: 35%/year
■ Die size: 10-20%/year
■ Integration overall: 40-55%/year

■ DRAM capacity: 25-40%/year (slowing)

■ Flash capacity: 50-60%/year

■ 15-20X cheaper/bit than DRAM

■ Magnetic disk technology: 40%/year

■ 15-25X cheaper/bit then Flash
■ 300-500X cheaper/bit than DRAM

Copyright © 2012, Elsevier Inc. All rights reserved. 13

Technology
Trends in
Bandwidth and Latency
■ Bandwidth or throughput
■ Total work done in a given time
■ 10,000-25,000X improvement for processors
■ 300-1200X improvement for memory and disks

■ Latency or response time

■ Time between start and completion of an event
■ 30-80X improvement for processors
■ 6-8X improvement for memory and disks

Copyright © 2012, Elsevier Inc. All rights reserved. 14

Technology
Trends in
Bandwidth and Latency

Log-log plot of bandwidth and latency milestones

Copyright © 2012, Elsevier Inc. All rights reserved. 15

Technology
Trends in
Transistors and Wires
■ Feature size
■ Minimum size of transistor or wire in x or y
dimension
■ 10 microns in 1971 to .032 microns in 2011
■ Transistor performance scales linearly
■ Wire delay does not improve with feature size!
■ Integration density scales quadratically

Copyright © 2012, Elsevier Inc. All rights reserved. 16

Trends in Power and Energy
Power and Energy
■ Problem: Get power in, get power out

■ Thermal Design Power (TDP)

■ Characterizes sustained power consumption
■ Used as target for power supply and cooling system
■ Lower than peak power, higher than average power
consumption

■ Clock rate can be reduced dynamically to limit

power consumption

■ Energy per task is often a better measurement

Copyright © 2012, Elsevier Inc. All rights reserved. 17
Trends in Power and Energy
Dynamic Energy and Power
■ Dynamic energy
■ Transistor switch from 0 -> 1 or 1 -> 0
■ ½ x Capacitive load x Voltage2

■ Dynamic power
■ ½ x Capacitive load x Voltage2 x Frequency switched

■ Reducing clock rate reduces power, not energy

■ For example, processor A may have a 20% higher average power consumption than processor B, but if A executes
the task in only 70% of the time needed by B, its energy consumption will be 1.2 × 0.7 = 0.84, which is clearly
better.

Copyright © 2012, Elsevier Inc. All rights reserved. 18

Trends in Power and Energy
Dynamic Energy and Power

Copyright © 2012, Elsevier Inc. All rights reserved. 19

Trends in Power and Energy
Power
■ Intel 80386
consumed ~ 2 W
■ 3.3 GHz Intel
Core i7 consumes
130 W
■ Heat must be
dissipated from
1.5 x 1.5 cm chip
■ This is the limit of
what can be
cooled by air

Copyright © 2012, Elsevier Inc. All rights reserved. 20

Trends in Power and Energy
Reducing Power
■ Techniques for reducing power:
■ Do nothing well
■ Most microprocessors today turn off the clock of inactive modules to save energy and
dynamic power. For example, if no floating-point instructions are executing, the clock of the
floating-point unit is disabled.

■ Dynamic Voltage-Frequency Scaling:

■ periods of low activity where there is no need to operate at the highest clock frequency and
voltages. Modern microprocessors typically offer a few clock frequencies and voltages in
which to operate that use lower power and energy

Copyright © 2012, Elsevier Inc. All rights reserved. 21

■ Low power state for DRAM, disks:
■ low power modes to save energy

■Overclocking, turning off cores:

Turbo mode: if temperature rises beyond certain threshold: turnoff few
core: performance vary wrt time.

Copyright © 2012, Elsevier Inc. All rights reserved. 22

Trends in Power and Energy
Static Power
■ Static power consumption
■ Currentstatic x Voltage
■ Scales with number of transistors
■ To reduce: power gating
■ Power gating is a technique used in integrated circuit design to
reducepower consumption, by shutting off the current to blocks of the
circuit that are not in

Copyright © 2012, Elsevier Inc. All rights reserved. 23

Trends in Cost
Trends in Cost
■ Cost driven down by learning curve
■ Yield

■ DRAM: price closely tracks cost

■ Microprocessors: price depends on

volume
■ 10% less for each doubling of volume

Copyright © 2012, Elsevier Inc. All rights reserved. 24

Trends in Cost
Integrated Circuit Cost
■ Integrated circuit

■ Bose-Einstein formula:

■ Defects per unit area = 0.016-0.057 defects per square cm (2010)

■ N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

Copyright © 2012, Elsevier Inc. All rights reserved. 25

Trends in Cost
Integrated Circuit Cost

Copyright © 2012, Elsevier Inc. All rights reserved. 26

Trends in Cost
Integrated Circuit Cost

Copyright © 2012, Elsevier Inc. All rights reserved. 27

Dependability
Dependability
■ Module reliability
■ Mean time to failure (MTTF)
■ Mean time to repair (MTTR)
■ Mean time between failures (MTBF) = MTTF + MTTR
■ Availability = MTTF / MTBF

Copyright © 2012, Elsevier Inc. All rights reserved. 28

Dependability
Dependability

Copyright © 2012, Elsevier Inc. All rights reserved. 29

Measuring Performance
Measuring Performance
■ Typical performance metrics:
■ Response time
■ Throughput

■ Speedup of X relative to Y
■ Execution timeY / Execution timeX

■ Execution time
■ Wall clock time: includes all system overheads
■ CPU time: only computation time

■ Benchmarks
■ Kernels (e.g. matrix multiply)
■ Toy programs (e.g. sorting)
■ Synthetic benchmarks (e.g. Dhrystone)
■ Benchmark suites (e.g. SPEC06fp, TPC-C)

Copyright © 2012, Elsevier Inc. All rights reserved. 30

Principles
Principles of Computer Design
■ Take Advantage of Parallelism
■ e.g. multiple processors, disks, memory banks,
pipelining, multiple functional units

■ Principle of Locality
■ Reuse of data and instructions
■ Temporal locality states that recently accessed items
are likely to be accessed in the near future.
■ Spatial locality says that items whose addresses are
near one another tend to be referenced close together
in time.

Copyright © 2012, Elsevier Inc. All rights reserved. 31

Amdahl’s Law
■ Speedup is defined as the time it takes a
program to execute in serial (with one
processor) divided by the time it takes to
execute in parallel (with many processors).
The formula for speedup is:

Where T(j) is the time it takes to execute the program

when using j processors.

Copyright © 2012, Elsevier Inc. All rights reserved. 32

Amdahl’s Law
■ If there are N workers working on a project, we may assume that
they would be able to do a job in 1/N time of one worker working
alone.
■ Now, if we assume the strictly serial part of the program is performed
in B*T(1) time,
■ then the strictly parallel part is performed in ((1-B)*T(1)) / N time.
With some substitution and number manipulation, we get the formula
for speedup as:

Copyright © 2012, Elsevier Inc. All rights reserved. 33

Amdahl’s Law

Copyright © 2012, Elsevier Inc. All rights reserved. 34

Amdahl’s Law

Copyright © 2012, Elsevier Inc. All rights reserved. 35

Amdahl’s Law

Copyright © 2012, Elsevier Inc. All rights reserved. 36

Principles
Principles of Computer Design

■ Focus on the Common Case

■ Amdahl’s Law

Copyright © 2012, Elsevier Inc. All rights reserved. 37

Principles
Principles of Computer Design
■ The Processor Performance Equation

Copyright © 2012, Elsevier Inc. All rights reserved. 38

Principles
Principles of Computer Design
■ Different instruction types having different
CPIs

Copyright © 2012, Elsevier Inc. All rights reserved. 39

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 40

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 41

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 42

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 43

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 44

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 45

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 46

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 47

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 48

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 49

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 50

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 51

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 52

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 53

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 54

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 55

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 56

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 57

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 58

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 59

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 60

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 61

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 62

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 63

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 64

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 65

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 66

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 67

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 68

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 69

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 70

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 71

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 72

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 73

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 74

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 75

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 76

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 77

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 78

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 79

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 80

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 81

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 82

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 83

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 84

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 85

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 86

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 87

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 88

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 89

Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 90

Introduction
Introduction
■ Pipelining become universal technique in 1985
■ Overlaps execution of instructions
■ Exploits “Instruction Level Parallelism”

■ Beyond this, there are two main approaches:

■ Hardware-based dynamic approaches
■ Used in server and desktop processors
■ Not used as extensively in PMP processors
■ Compiler-based static approaches
■ Not as successful outside of scientific applications

Copyright © 2012, Elsevier Inc. All rights reserved. 91

Introduction
Instruction-Level Parallelism
■ When exploiting instruction-level parallelism,
goal is to maximize CPI
■ Pipeline CPI =
■ Ideal pipeline CPI +
■ Structural stalls +
■ Data hazard stalls +
■ Control stalls

■ Parallelism with basic block is limited

■ Typical size of basic block = 3-6 instructions
■ Must optimize across branches

Copyright © 2012, Elsevier Inc. All rights reserved. 92

Introduction
Data Dependence
■ Loop-Level Parallelism
■ Unroll loop statically or dynamically
■ Use SIMD (vector processors and GPUs)

■ Challenges:
■ Data dependency
■ Instruction j is data dependent on instruction i if
■ Instruction i produces a result that may be used by instruction j
■ Instruction j is data dependent on instruction k and instruction k
is data dependent on instruction i

■ Dependent instructions cannot be executed

simultaneously

Copyright © 2012, Elsevier Inc. All rights reserved. 93

Introduction
Data Dependence
■ Dependencies are a property of programs
■ Pipeline organization determines if dependence
is detected and if it causes a stall

■ Data dependence conveys:

■ Possibility of a hazard
■ Order in which results must be calculated
■ Upper bound on exploitable instruction level
parallelism

■ Dependencies that flow through memory

locations are difficult to detect
Copyright © 2012, Elsevier Inc. All rights reserved. 94
Introduction
Name Dependence
■ Two instructions use the same name but no flow
of information
■ Not a true data dependence, but is a problem when
reordering instructions
■ Antidependence: instruction j writes a register or
memory location that instruction i reads
■ Initial ordering (i before j) must be preserved
■ Output dependence: instruction i and instruction j
write the same register or memory location
■ Ordering must be preserved

■ To resolve, use renaming techniques

Copyright © 2012, Elsevier Inc. All rights reserved. 95

Introduction
Other Factors
■ Data Hazards
■ Read after write (RAW)
■ Write after write (WAW)
■ Write after read (WAR)

■ Control Dependence
■ Ordering of instruction i with respect to a branch
instruction
■ Instruction control dependent on a branch cannot be moved
before the branch so that its execution is no longer controller
by the branch
■ An instruction not control dependent on a branch cannot be
moved after the branch so that its execution is controlled by
the branch

Copyright © 2012, Elsevier Inc. All rights reserved. 96

Introduction
Examples
• Example 1: ■ OR instruction dependent
DADDU R1,R2,R3 on DADDU and DSUBU
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8

• Example 2: ■ Assume R4 isn’t used after

DADDU R1,R2,R3 skip
BEQZ R12,skip
■ Possible to move DSUBU
DSUBU R4,R5,R6
DADDU R5,R4,R9
before the branch
skip:
OR R7,R8,R9

Copyright © 2012, Elsevier Inc. All rights reserved. 97

Compiler Techniques
Compiler Techniques for Exposing ILP

■ Pipeline scheduling
■ Separate dependent instruction from the
source instruction by the pipeline latency of
the source instruction
■ Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;

Copyright © 2012, Elsevier Inc. All rights reserved. 98

Compiler Techniques
Pipeline Stalls
Loop: L.D F0,0(R1)
stall
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
DADDUI R1,R1,#-8
stall (assume integer load latency is 1)
BNE R1,R2,Loop

Copyright © 2012, Elsevier Inc. All rights reserved. 99

Compiler Techniques
Pipeline Scheduling
Scheduled code:
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
stall
S.D F4,8(R1)
BNE R1,R2,Loop

Copyright © 2012, Elsevier Inc. All rights reserved. 100

Compiler Techniques
Loop Unrolling
■ Loop unrolling is a loop transformation
technique that helps to optimize the
execution time of a program.
■ basically remove or reduce iterations.
■ Loop unrolling increases the program’s
speed by eliminating loop control
instruction and loop test instructions.

Copyright © 2012, Elsevier Inc. All rights reserved. 101

Compiler Techniques
Loop Unrolling
■ // This program does not ■ // This program uses loop
unrolling.
uses loop unrolling. ■ #include<stdio.h>
■ #include<stdio.h>
■ int main(void)
■ {
■ int main(void) ■ // unrolled the for loop in program
■ { 1
■ for (int i=0; i<5; i++) ■ printf("Hello\n");
■ printf("Hello\n");
■ printf("Hello\n"); ■ printf("Hello\n");
//print hello 5 times ■ printf("Hello\n");
■ printf("Hello\n");
■ return 0;
■ return 0;
■ } ■ }

Copyright © 2012, Elsevier Inc. All rights reserved. 102

Compiler Techniques
Loop Unrolling
■ Advantages:
■ Increases program efficiency.
■ Reduces loop overhead.
■ If statements in loop are not dependent on each other,
they can be executed in parallel.
■ Disadvantages:
■ Increased program code size, which can be undesirable.
■ Possible increased usage of register in a single iteration
to store temporary variables which may reduce
performance.

Copyright © 2012, Elsevier Inc. All rights reserved. 103

Compiler Techniques
Loop Unrolling
■ Loop unrolling
■ Unroll by a factor of 4 (assume # elements is divisible by 4)
■ Eliminate unnecessary instructions
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) ;drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) ;drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) ;drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1) ■ note: number
DADDUI R1,R1,#-32 of live registers
BNE R1,R2,Loop vs. original loop
Copyright © 2012, Elsevier Inc. All rights reserved. 104
Compiler Techniques
Loop Unrolling/Pipeline Scheduling
■ Pipeline schedule the unrolled loop:

Loop: L.D F0,0(R1)

L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
S.D F16,8(R1)
BNE R1,R2,Loop

Copyright © 2012, Elsevier Inc. All rights reserved. 105

Compiler Techniques
Strip Mining
■ Strip-mining, also known as loop sectioning, is a loop
transformation technique for enabling SIMD-encodings of
loops, as well as a means of improving memory
performance.
■ By fragmenting a large loop into smaller segments or
strips, this technique transforms the loop structure in two
ways:
■ It increases the temporal and spatial locality in the data
cache if the data are reusable in different passes of an
algorithm.
■ It reduces the number of iterations of the loop by a factor
of the length of each vector, or number of operations
being performed per SIMD operation.

Copyright © 2012, Elsevier Inc. All rights reserved. 106

Compiler Techniques
Strip Mining
■ Unknown number of loop iterations?
■ Number of iterations = n
■ Goal: make k copies of the loop body
■ Generate pair of loops:
■ First executes n mod k times
■ Second executes n / k times
■ “Strip mining”

Copyright © 2012, Elsevier Inc. All rights reserved. 107

Branch Prediction
Branch Prediction
■ Basic 2-bit predictor:
■ For each branch:
■ Predict taken or not taken
■ If the prediction is wrong two consecutive times, change prediction

Copyright © 2012, Elsevier Inc. All rights reserved. 108

Branch Prediction
Branch Prediction
■ Correlating predictor:
■ The 2-bit predictor schemes use only the recent
behavior of a single branch to predict the future
behavior of that branch.
■ It may be possible to improve the prediction accuracy
if we also look at the recent behavior of other
branches rather than just the branch we are trying to
predict.
■ if (aa==2) aa=0;

■ if (bb==2) bb=0;
■ if (aa!=bb) {

Copyright © 2012, Elsevier Inc. All rights reserved. 109

Branch Prediction
Branch Prediction
■ Correlating predictor:
■ Let’s label these branches b1, b2, and b3.
■ The key observation is that the behavior of branch b3 is
correlated with the behavior of branches b1 and b2.
■ Clearly, if branches b1 and b2 are both not taken (i.e., if the
conditions both evaluate to true and aa and bb are both assigned
0), then b3 will be taken, since aa and bb are clearly equal.
■ A predictor that uses only the behavior of a single branch to
predict the outcome of that branch can never capture this
behavior.
■ Branch predictors that use the behavior of other branches to
make a prediction are called correlating predictors or two-level
predictors

Copyright © 2012, Elsevier Inc. All rights reserved. 110

Branch Prediction
Branch Prediction
■ Tournament predictor:
■ Combine correlating predictor with local predictor

Copyright © 2012, Elsevier Inc. All rights reserved. 111

Branch Prediction
Branch Prediction Performance

Branch predictor performance

Copyright © 2012, Elsevier Inc. All rights reserved. 112

Branch Prediction
Dynamic Scheduling
■ Rearrange order of instructions to reduce stalls
while maintaining data flow
■ Run time, H/W
■ Advantages:
■ Compiler doesn’t need to have knowledge of
microarchitecture
■ Handles cases where dependencies are unknown at
compile time
■ Disadvantage:
■ Substantial increase in hardware complexity
■ Complicates exceptions

Copyright © 2012, Elsevier Inc. All rights reserved. 113

Branch Prediction
Dynamic Scheduling
■ Dynamic scheduling implies:
■ Out-of-order execution
■ Out-of-order completion

■ Creates the possibility for WAR and WAW

hazards

■ Tomasulo’s Approach
■ Tracks when operands are available
■ Introduces register renaming in hardware
■ Minimizes WAW and WAR hazards
Copyright © 2012, Elsevier Inc. All rights reserved. 114
Branch Prediction
Register Renaming
■ Example:

DIV.D F0,F2,F4
ADD.D F6,F0,F8
antidependence
S.D F6,0(R1)
SUB.D F8,F10,F14 antidependence

MUL.D F6,F10,F8

+ name dependence with F6

Copyright © 2012, Elsevier Inc. All rights reserved. 115

Branch Prediction
Register Renaming
■ Example:

DIV.D F0,F2,F4
ADD.D S,F0,F8
S.D S,0(R1)
SUB.D T,F10,F14
MUL.D F6,F10,T