Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views194 pages

Unit 1

The document discusses the fundamentals of computer architecture, focusing on performance improvements through advancements in semiconductor technology and computer architectures. It highlights the shift from single-processor performance to multi-processor systems, emphasizing various classes of parallelism and trends in technology, power, and cost. Key principles of computer design, including Amdahl's Law and pipelining, are also addressed to optimize performance.

Uploaded by

Nandan Hr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views194 pages

Unit 1

The document discusses the fundamentals of computer architecture, focusing on performance improvements through advancements in semiconductor technology and computer architectures. It highlights the shift from single-processor performance to multi-processor systems, emphasizing various classes of parallelism and trends in technology, power, and cost. Key principles of computer design, including Amdahl's Law and pipelining, are also addressed to optimize performance.

Uploaded by

Nandan Hr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 194

PADP(18CS73)

Unit 1
Dr. Minal Moharir

1
Computer Architecture
A Quantitative Approach, Fifth Edition

Chapter 1
Fundamentals of Quantitative
Design and Analysis

Copyright © 2012, Elsevier Inc. All rights reserved. 2


Copyright © 2012, Elsevier Inc. All rights reserved. 3
Copyright © 2012, Elsevier Inc. All rights reserved. 4
Introduction
Computer Technology
■ Performance improvements:
■ Improvements in semiconductor technology
■ Feature size, clock speed
■ Improvements in computer architectures
■ Enabled by HLL compilers, UNIX
■ Lead to RISC architectures

■ Together have enabled:


■ Lightweight computers
■ Productivity-based managed/interpreted
programming languages

Copyright © 2012, Elsevier Inc. All rights reserved. 5


Introduction
Single Processor Performance
Move to multi-processor

RISC

Copyright © 2012, Elsevier Inc. All rights reserved. 6


Introduction
Current Trends in Architecture
■ Cannot continue to leverage Instruction-Level
parallelism (ILP)
■ Single processor performance improvement ended in
2003

■ New models for performance:


■ Data-level parallelism (DLP)
■ Thread-level parallelism (TLP)
■ Request-level parallelism (RLP)

■ These require explicit restructuring of the


application

Copyright © 2012, Elsevier Inc. All rights reserved. 7


Classes of Computers
Parallelism at multiple levels is now the driving force of computer design across

Parallelism all four classes of computers, with energy and cost being the primary
constraints.

■ Classes of parallelism in applications:


■ Data-Level Parallelism (DLP)
■ Task-Level Parallelism (TLP)

■ Classes of architectural parallelism:


■ Instruction-Level Parallelism (ILP)
■ Vector architectures/Graphic Processor Units (GPUs)
■ Thread-Level Parallelism
■ Request-Level Parallelism

Copyright © 2012, Elsevier Inc. All rights reserved. 8


Computer hardware in turn can exploit these two kinds of application parallelism

Parallelism in four major ways:

■ 1. Instruction-Level Parallelism exploits DLP eg. pipelining and speculative


execution.

■ 2. Vector Architectures and Graphic Processor Units (GPUs) exploit DLP by


applying a single instruction to a collection of data in parallel.

■ 3. Thread-Level Parallelism exploits either DLP or TLP in a tightly coupled


hardware model that allows for interaction among parallel threads.

■ 4. Request-Level Parallelism exploits parallelism among largely decoupled


tasks specified by the programmer or the operating system.

Copyright © 2012, Elsevier Inc. All rights reserved. 9


Classes of Computers
Flynn’s Taxonomy
■ Single instruction stream, single data stream (SISD)

■ Single instruction stream, multiple data streams (SIMD)


■ Vector architectures
■ Multimedia extensions
■ Graphics processor units

■ Multiple instruction streams, single data stream (MISD)


■ No commercial implementation

■ Multiple instruction streams, multiple data streams


(MIMD)
■ Tightly-coupled MIMD
■ Loosely-coupled MIMD:Cluster Computing

Copyright © 2012, Elsevier Inc. All rights reserved. 10


Defining Computer Architecture
Defining Computer Architecture
■ “Old” view of computer architecture:
■ Instruction Set Architecture (ISA) design
■ i.e. decisions regarding:
■ registers, memory addressing, addressing modes,
instruction operands, available operations, control flow
instructions, instruction encoding

■ “Real” computer architecture:


■ Specific requirements of the target machine
■ Design to maximize performance within constraints:
cost, power, and availability
■ Includes ISA, microarchitecture, hardware

Copyright © 2012, Elsevier Inc. All rights reserved. 11


Technology
■ If an ISA is to be successful, it must be designed to survive rapid
changes in computer technology. as successful ISA may last
decades—for example, the core of the IBM mainframe has been in
use for nearly 50 years.

■ An architect must plan for technology changes that can increase the
lifetime of a successful computer.

■ To plan for the evolution of a computer, the designer must be aware


of rapid changes in implementation technology.

■ Five implementation technologies, which change at a dramatic pace,


are critical to modern implementations:

Copyright © 2012, Elsevier Inc. All rights reserved. 12


Technology
Trends in
Trends in Technology
■ Integrated circuit technology
■ Transistor density: 35%/year
■ Die size: 10-20%/year
■ Integration overall: 40-55%/year

■ DRAM capacity: 25-40%/year (slowing)

■ Flash capacity: 50-60%/year


■ 15-20X cheaper/bit than DRAM

■ Magnetic disk technology: 40%/year


■ 15-25X cheaper/bit then Flash
■ 300-500X cheaper/bit than DRAM

Copyright © 2012, Elsevier Inc. All rights reserved. 13


Technology
Trends in
Bandwidth and Latency
■ Bandwidth or throughput
■ Total work done in a given time
■ 10,000-25,000X improvement for processors
■ 300-1200X improvement for memory and disks

■ Latency or response time


■ Time between start and completion of an event
■ 30-80X improvement for processors
■ 6-8X improvement for memory and disks

Copyright © 2012, Elsevier Inc. All rights reserved. 14


Technology
Trends in
Bandwidth and Latency

Log-log plot of bandwidth and latency milestones

Copyright © 2012, Elsevier Inc. All rights reserved. 15


Technology
Trends in
Transistors and Wires
■ Feature size
■ Minimum size of transistor or wire in x or y
dimension
■ 10 microns in 1971 to .032 microns in 2011
■ Transistor performance scales linearly
■ Wire delay does not improve with feature size!
■ Integration density scales quadratically

Copyright © 2012, Elsevier Inc. All rights reserved. 16


Trends in Power and Energy
Power and Energy
■ Problem: Get power in, get power out

■ Thermal Design Power (TDP)


■ Characterizes sustained power consumption
■ Used as target for power supply and cooling system
■ Lower than peak power, higher than average power
consumption

■ Clock rate can be reduced dynamically to limit


power consumption

■ Energy per task is often a better measurement


Copyright © 2012, Elsevier Inc. All rights reserved. 17
Trends in Power and Energy
Dynamic Energy and Power
■ Dynamic energy
■ Transistor switch from 0 -> 1 or 1 -> 0
■ ½ x Capacitive load x Voltage2

■ Dynamic power
■ ½ x Capacitive load x Voltage2 x Frequency switched

■ Reducing clock rate reduces power, not energy


■ For example, processor A may have a 20% higher average power consumption than processor B, but if A executes
the task in only 70% of the time needed by B, its energy consumption will be 1.2 × 0.7 = 0.84, which is clearly
better.

Copyright © 2012, Elsevier Inc. All rights reserved. 18


Trends in Power and Energy
Dynamic Energy and Power

Copyright © 2012, Elsevier Inc. All rights reserved. 19


Trends in Power and Energy
Power
■ Intel 80386
consumed ~ 2 W
■ 3.3 GHz Intel
Core i7 consumes
130 W
■ Heat must be
dissipated from
1.5 x 1.5 cm chip
■ This is the limit of
what can be
cooled by air

Copyright © 2012, Elsevier Inc. All rights reserved. 20


Trends in Power and Energy
Reducing Power
■ Techniques for reducing power:
■ Do nothing well
■ Most microprocessors today turn off the clock of inactive modules to save energy and
dynamic power. For example, if no floating-point instructions are executing, the clock of the
floating-point unit is disabled.

■ Dynamic Voltage-Frequency Scaling:


■ periods of low activity where there is no need to operate at the highest clock frequency and
voltages. Modern microprocessors typically offer a few clock frequencies and voltages in
which to operate that use lower power and energy

Copyright © 2012, Elsevier Inc. All rights reserved. 21


■ Low power state for DRAM, disks:
■ low power modes to save energy

■Overclocking, turning off cores:


Turbo mode: if temperature rises beyond certain threshold: turnoff few
core: performance vary wrt time.

Copyright © 2012, Elsevier Inc. All rights reserved. 22


Trends in Power and Energy
Static Power
■ Static power consumption
■ Currentstatic x Voltage
■ Scales with number of transistors
■ To reduce: power gating
■ Power gating is a technique used in integrated circuit design to
reducepower consumption, by shutting off the current to blocks of the
circuit that are not in

Copyright © 2012, Elsevier Inc. All rights reserved. 23


Trends in Cost
Trends in Cost
■ Cost driven down by learning curve
■ Yield

■ DRAM: price closely tracks cost

■ Microprocessors: price depends on


volume
■ 10% less for each doubling of volume

Copyright © 2012, Elsevier Inc. All rights reserved. 24


Trends in Cost
Integrated Circuit Cost
■ Integrated circuit

■ Bose-Einstein formula:

■ Defects per unit area = 0.016-0.057 defects per square cm (2010)


■ N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

Copyright © 2012, Elsevier Inc. All rights reserved. 25


Trends in Cost
Integrated Circuit Cost

Copyright © 2012, Elsevier Inc. All rights reserved. 26


Trends in Cost
Integrated Circuit Cost

Copyright © 2012, Elsevier Inc. All rights reserved. 27


Dependability
Dependability
■ Module reliability
■ Mean time to failure (MTTF)
■ Mean time to repair (MTTR)
■ Mean time between failures (MTBF) = MTTF + MTTR
■ Availability = MTTF / MTBF

Copyright © 2012, Elsevier Inc. All rights reserved. 28


Dependability
Dependability

Copyright © 2012, Elsevier Inc. All rights reserved. 29


Measuring Performance
Measuring Performance
■ Typical performance metrics:
■ Response time
■ Throughput

■ Speedup of X relative to Y
■ Execution timeY / Execution timeX

■ Execution time
■ Wall clock time: includes all system overheads
■ CPU time: only computation time

■ Benchmarks
■ Kernels (e.g. matrix multiply)
■ Toy programs (e.g. sorting)
■ Synthetic benchmarks (e.g. Dhrystone)
■ Benchmark suites (e.g. SPEC06fp, TPC-C)

Copyright © 2012, Elsevier Inc. All rights reserved. 30


Principles
Principles of Computer Design
■ Take Advantage of Parallelism
■ e.g. multiple processors, disks, memory banks,
pipelining, multiple functional units

■ Principle of Locality
■ Reuse of data and instructions
■ Temporal locality states that recently accessed items
are likely to be accessed in the near future.
■ Spatial locality says that items whose addresses are
near one another tend to be referenced close together
in time.

Copyright © 2012, Elsevier Inc. All rights reserved. 31


Amdahl’s Law
■ Speedup is defined as the time it takes a
program to execute in serial (with one
processor) divided by the time it takes to
execute in parallel (with many processors).
The formula for speedup is:

Where T(j) is the time it takes to execute the program


when using j processors.

Copyright © 2012, Elsevier Inc. All rights reserved. 32


Amdahl’s Law
■ If there are N workers working on a project, we may assume that
they would be able to do a job in 1/N time of one worker working
alone.
■ Now, if we assume the strictly serial part of the program is performed
in B*T(1) time,
■ then the strictly parallel part is performed in ((1-B)*T(1)) / N time.
With some substitution and number manipulation, we get the formula
for speedup as:

Copyright © 2012, Elsevier Inc. All rights reserved. 33


Amdahl’s Law

Copyright © 2012, Elsevier Inc. All rights reserved. 34


Amdahl’s Law

Copyright © 2012, Elsevier Inc. All rights reserved. 35


Amdahl’s Law

Copyright © 2012, Elsevier Inc. All rights reserved. 36


Principles
Principles of Computer Design

■ Focus on the Common Case


■ Amdahl’s Law

Copyright © 2012, Elsevier Inc. All rights reserved. 37


Principles
Principles of Computer Design
■ The Processor Performance Equation

Copyright © 2012, Elsevier Inc. All rights reserved. 38


Principles
Principles of Computer Design
■ Different instruction types having different
CPIs

Copyright © 2012, Elsevier Inc. All rights reserved. 39


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 40


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 41


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 42


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 43


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 44


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 45


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 46


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 47


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 48


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 49


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 50


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 51


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 52


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 53


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 54


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 55


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 56


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 57


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 58


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 59


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 60


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 61


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 62


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 63


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 64


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 65


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 66


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 67


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 68


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 69


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 70


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 71


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 72


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 73


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 74


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 75


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 76


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 77


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 78


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 79


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 80


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 81


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 82


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 83


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 84


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 85


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 86


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 87


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 88


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 89


Principles
Pipelining

Copyright © 2012, Elsevier Inc. All rights reserved. 90


Introduction
Introduction
■ Pipelining become universal technique in 1985
■ Overlaps execution of instructions
■ Exploits “Instruction Level Parallelism”

■ Beyond this, there are two main approaches:


■ Hardware-based dynamic approaches
■ Used in server and desktop processors
■ Not used as extensively in PMP processors
■ Compiler-based static approaches
■ Not as successful outside of scientific applications

Copyright © 2012, Elsevier Inc. All rights reserved. 91


Introduction
Instruction-Level Parallelism
■ When exploiting instruction-level parallelism,
goal is to maximize CPI
■ Pipeline CPI =
■ Ideal pipeline CPI +
■ Structural stalls +
■ Data hazard stalls +
■ Control stalls

■ Parallelism with basic block is limited


■ Typical size of basic block = 3-6 instructions
■ Must optimize across branches

Copyright © 2012, Elsevier Inc. All rights reserved. 92


Introduction
Data Dependence
■ Loop-Level Parallelism
■ Unroll loop statically or dynamically
■ Use SIMD (vector processors and GPUs)

■ Challenges:
■ Data dependency
■ Instruction j is data dependent on instruction i if
■ Instruction i produces a result that may be used by instruction j
■ Instruction j is data dependent on instruction k and instruction k
is data dependent on instruction i

■ Dependent instructions cannot be executed


simultaneously

Copyright © 2012, Elsevier Inc. All rights reserved. 93


Introduction
Data Dependence
■ Dependencies are a property of programs
■ Pipeline organization determines if dependence
is detected and if it causes a stall

■ Data dependence conveys:


■ Possibility of a hazard
■ Order in which results must be calculated
■ Upper bound on exploitable instruction level
parallelism

■ Dependencies that flow through memory


locations are difficult to detect
Copyright © 2012, Elsevier Inc. All rights reserved. 94
Introduction
Name Dependence
■ Two instructions use the same name but no flow
of information
■ Not a true data dependence, but is a problem when
reordering instructions
■ Antidependence: instruction j writes a register or
memory location that instruction i reads
■ Initial ordering (i before j) must be preserved
■ Output dependence: instruction i and instruction j
write the same register or memory location
■ Ordering must be preserved

■ To resolve, use renaming techniques

Copyright © 2012, Elsevier Inc. All rights reserved. 95


Introduction
Other Factors
■ Data Hazards
■ Read after write (RAW)
■ Write after write (WAW)
■ Write after read (WAR)

■ Control Dependence
■ Ordering of instruction i with respect to a branch
instruction
■ Instruction control dependent on a branch cannot be moved
before the branch so that its execution is no longer controller
by the branch
■ An instruction not control dependent on a branch cannot be
moved after the branch so that its execution is controlled by
the branch

Copyright © 2012, Elsevier Inc. All rights reserved. 96


Introduction
Examples
• Example 1: ■ OR instruction dependent
DADDU R1,R2,R3 on DADDU and DSUBU
BEQZ R4,L
DSUBU R1,R1,R6
L: …
OR R7,R1,R8

• Example 2: ■ Assume R4 isn’t used after


DADDU R1,R2,R3 skip
BEQZ R12,skip
■ Possible to move DSUBU
DSUBU R4,R5,R6
DADDU R5,R4,R9
before the branch
skip:
OR R7,R8,R9

Copyright © 2012, Elsevier Inc. All rights reserved. 97


Compiler Techniques
Compiler Techniques for Exposing ILP

■ Pipeline scheduling
■ Separate dependent instruction from the
source instruction by the pipeline latency of
the source instruction
■ Example:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;

Copyright © 2012, Elsevier Inc. All rights reserved. 98


Compiler Techniques
Pipeline Stalls
Loop: L.D F0,0(R1)
stall
ADD.D F4,F0,F2
stall
stall
S.D F4,0(R1)
DADDUI R1,R1,#-8
stall (assume integer load latency is 1)
BNE R1,R2,Loop

Copyright © 2012, Elsevier Inc. All rights reserved. 99


Compiler Techniques
Pipeline Scheduling
Scheduled code:
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
stall
S.D F4,8(R1)
BNE R1,R2,Loop

Copyright © 2012, Elsevier Inc. All rights reserved. 100


Compiler Techniques
Loop Unrolling
■ Loop unrolling is a loop transformation
technique that helps to optimize the
execution time of a program.
■ basically remove or reduce iterations.
■ Loop unrolling increases the program’s
speed by eliminating loop control
instruction and loop test instructions.

Copyright © 2012, Elsevier Inc. All rights reserved. 101


Compiler Techniques
Loop Unrolling
■ // This program does not ■ // This program uses loop
unrolling.
uses loop unrolling. ■ #include<stdio.h>
■ #include<stdio.h>
■ int main(void)
■ {
■ int main(void) ■ // unrolled the for loop in program
■ { 1
■ for (int i=0; i<5; i++) ■ printf("Hello\n");
■ printf("Hello\n");
■ printf("Hello\n"); ■ printf("Hello\n");
//print hello 5 times ■ printf("Hello\n");
■ printf("Hello\n");
■ return 0;
■ return 0;
■ } ■ }

Copyright © 2012, Elsevier Inc. All rights reserved. 102


Compiler Techniques
Loop Unrolling
■ Advantages:
■ Increases program efficiency.
■ Reduces loop overhead.
■ If statements in loop are not dependent on each other,
they can be executed in parallel.
■ Disadvantages:
■ Increased program code size, which can be undesirable.
■ Possible increased usage of register in a single iteration
to store temporary variables which may reduce
performance.

Copyright © 2012, Elsevier Inc. All rights reserved. 103


Compiler Techniques
Loop Unrolling
■ Loop unrolling
■ Unroll by a factor of 4 (assume # elements is divisible by 4)
■ Eliminate unnecessary instructions
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) ;drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) ;drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) ;drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1) ■ note: number
DADDUI R1,R1,#-32 of live registers
BNE R1,R2,Loop vs. original loop
Copyright © 2012, Elsevier Inc. All rights reserved. 104
Compiler Techniques
Loop Unrolling/Pipeline Scheduling
■ Pipeline schedule the unrolled loop:

Loop: L.D F0,0(R1)


L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1,#-32
S.D F12,16(R1)
S.D F16,8(R1)
BNE R1,R2,Loop

Copyright © 2012, Elsevier Inc. All rights reserved. 105


Compiler Techniques
Strip Mining
■ Strip-mining, also known as loop sectioning, is a loop
transformation technique for enabling SIMD-encodings of
loops, as well as a means of improving memory
performance.
■ By fragmenting a large loop into smaller segments or
strips, this technique transforms the loop structure in two
ways:
■ It increases the temporal and spatial locality in the data
cache if the data are reusable in different passes of an
algorithm.
■ It reduces the number of iterations of the loop by a factor
of the length of each vector, or number of operations
being performed per SIMD operation.

Copyright © 2012, Elsevier Inc. All rights reserved. 106


Compiler Techniques
Strip Mining
■ Unknown number of loop iterations?
■ Number of iterations = n
■ Goal: make k copies of the loop body
■ Generate pair of loops:
■ First executes n mod k times
■ Second executes n / k times
■ “Strip mining”

Copyright © 2012, Elsevier Inc. All rights reserved. 107


Branch Prediction
Branch Prediction
■ Basic 2-bit predictor:
■ For each branch:
■ Predict taken or not taken
■ If the prediction is wrong two consecutive times, change prediction

Copyright © 2012, Elsevier Inc. All rights reserved. 108


Branch Prediction
Branch Prediction
■ Correlating predictor:
■ The 2-bit predictor schemes use only the recent
behavior of a single branch to predict the future
behavior of that branch.
■ It may be possible to improve the prediction accuracy
if we also look at the recent behavior of other
branches rather than just the branch we are trying to
predict.
■ if (aa==2) aa=0;

■ if (bb==2) bb=0;
■ if (aa!=bb) {

Copyright © 2012, Elsevier Inc. All rights reserved. 109


Branch Prediction
Branch Prediction
■ Correlating predictor:
■ Let’s label these branches b1, b2, and b3.
■ The key observation is that the behavior of branch b3 is
correlated with the behavior of branches b1 and b2.
■ Clearly, if branches b1 and b2 are both not taken (i.e., if the
conditions both evaluate to true and aa and bb are both assigned
0), then b3 will be taken, since aa and bb are clearly equal.
■ A predictor that uses only the behavior of a single branch to
predict the outcome of that branch can never capture this
behavior.
■ Branch predictors that use the behavior of other branches to
make a prediction are called correlating predictors or two-level
predictors

Copyright © 2012, Elsevier Inc. All rights reserved. 110


Branch Prediction
Branch Prediction
■ Tournament predictor:
■ Combine correlating predictor with local predictor

Copyright © 2012, Elsevier Inc. All rights reserved. 111


Branch Prediction
Branch Prediction Performance

Branch predictor performance

Copyright © 2012, Elsevier Inc. All rights reserved. 112


Branch Prediction
Dynamic Scheduling
■ Rearrange order of instructions to reduce stalls
while maintaining data flow
■ Run time, H/W
■ Advantages:
■ Compiler doesn’t need to have knowledge of
microarchitecture
■ Handles cases where dependencies are unknown at
compile time
■ Disadvantage:
■ Substantial increase in hardware complexity
■ Complicates exceptions

Copyright © 2012, Elsevier Inc. All rights reserved. 113


Branch Prediction
Dynamic Scheduling
■ Dynamic scheduling implies:
■ Out-of-order execution
■ Out-of-order completion

■ Creates the possibility for WAR and WAW


hazards

■ Tomasulo’s Approach
■ Tracks when operands are available
■ Introduces register renaming in hardware
■ Minimizes WAW and WAR hazards
Copyright © 2012, Elsevier Inc. All rights reserved. 114
Branch Prediction
Register Renaming
■ Example:

DIV.D F0,F2,F4
ADD.D F6,F0,F8
antidependence
S.D F6,0(R1)
SUB.D F8,F10,F14 antidependence

MUL.D F6,F10,F8

+ name dependence with F6

Copyright © 2012, Elsevier Inc. All rights reserved. 115


Branch Prediction
Register Renaming
■ Example:

DIV.D F0,F2,F4
ADD.D S,F0,F8
S.D S,0(R1)
SUB.D T,F10,F14
MUL.D F6,F10,T

■ Now only RAW hazards remain, which can be strictly


ordered

Copyright © 2012, Elsevier Inc. All rights reserved. 116


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 117


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 118


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 119


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 120


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 121


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 122


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 123


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 124


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 125


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 126


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 127


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 128


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 129


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 130


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 131


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 132


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 133


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 134


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 135


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 136


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 137


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 138


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 139


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 140


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 141


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 142


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 143


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 144


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 145


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 146


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 147


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 148


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 149


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 150


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 151


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 152


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 153


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 154


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 155


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 156


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 157


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 158


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 159


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 160


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 161


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 162


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 163


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 164


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 165


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 166


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 167


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 168


Dynamic Scheduling:Tomasulo’s
Algorithm

Copyright © 2012, Elsevier Inc. All rights reserved. 169


Branch Prediction
Register Renaming
■ Register renaming is provided by reservation stations
(RS)
■ Contains:
■ The instruction
■ Buffered operand values (when available)
■ Reservation station number of instruction providing
the operand values
■ RS fetches and buffers an operand as soon as it becomes
available (not necessarily involving register file)
■ Pending instructions designate the RS to which they will send
their output
■ Result values broadcast on a result bus, called the common data bus (CDB)
■ Only the last output updates the register file
■ As instructions are issued, the register specifiers are renamed
with the reservation station
■ May be more reservation stations than registers
Copyright © 2012, Elsevier Inc. All rights reserved. 170
Branch Prediction
Tomasulo’s Algorithm
■ Load and store buffers
■ Contain data and addresses, act like
reservation stations

■ Top-level design:

Copyright © 2012, Elsevier Inc. All rights reserved. 171


Branch Prediction
Tomasulo’s Algorithm
■ Three Steps:
■ Issue
■ Get next instruction from FIFO queue
■ If available RS, issue the instruction to the RS with operand values if
available
■ If operand values not available, stall the instruction
■ Execute
■ When operand becomes available, store it in any reservation
stations waiting for it
■ When all operands are ready, issue the instruction
■ Loads and store maintained in program order through effective
address
■ No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
■ Write result
■ Write result on CDB into reservation stations and store buffers
■ (Stores must wait until address and value are received)

Copyright © 2012, Elsevier Inc. All rights reserved. 172


Branch Prediction
Example

Copyright © 2012, Elsevier Inc. All rights reserved. 173


Branch Prediction
Hardware-Based Speculation
■ Execute instructions along predicted
execution paths but only commit the
results if prediction was correct
■ Instruction commit: allowing an instruction
to update the register file when instruction
is no longer speculative
■ Need an additional piece of hardware to
prevent any irrevocable action until an
instruction commits
■ I.e. updating state or taking an execution

Copyright © 2012, Elsevier Inc. All rights reserved. 174


Branch Prediction
Reorder Buffer
■ Reorder buffer – holds the result of
instruction between completion and
commit

■ Four fields:
■ Instruction type: branch/store/register
■ Destination field: register number
■ Value field: output value
■ Ready field: completed execution?

■ Modify reservation stations:


■ OperandCopyright
source is now reorder buffer instead175
© 2012, Elsevier Inc. All rights reserved.
Branch Prediction
Reorder Buffer
■ Register values and memory values are
not written until an instruction commits
■ On misprediction:
■ Speculated entries in ROB are cleared

■ Exceptions:
■ Not recognized until it is ready to commit

Copyright © 2012, Elsevier Inc. All rights reserved. 176


Multiple Issue and Static Scheduling
Multiple Issue and Static Scheduling
■ To achieve CPI < 1, need to complete
multiple instructions per clock

■ Solutions:
■ Statically scheduled superscalar processors
■ VLIW (very long instruction word) processors
■ dynamically scheduled superscalar processors

Copyright © 2012, Elsevier Inc. All rights reserved. 177


Multiple Issue and Static Scheduling
Multiple Issue

Copyright © 2012, Elsevier Inc. All rights reserved. 178


Multiple Issue and Static Scheduling
VLIW Processors
■ Package multiple operations into one
instruction

■ Example VLIW processor:


■ One integer instruction (or branch)
■ Two independent floating-point operations
■ Two independent memory references

■ Must be enough parallelism in code to fill


the available slots
Copyright © 2012, Elsevier Inc. All rights reserved. 179
Multiple Issue and Static Scheduling
VLIW Processors
■ Disadvantages:
■ Statically finding parallelism
■ Code size
■ No hazard detection hardware
■ Binary code compatibility

Copyright © 2012, Elsevier Inc. All rights reserved. 180


Dynamic Scheduling, Multiple Issue, and Speculation
Dynamic Scheduling, Multiple Issue, and Speculation

■ Modern microarchitectures:
■ Dynamic scheduling + multiple issue +
speculation

■ Two approaches:
■ Assign reservation stations and update
pipeline control table in half clock cycles
■ Only supports 2 instructions/clock
■ Design logic to handle any possible
dependencies between the instructions
■ Hybrid approaches

Copyright © 2012, Elsevier Inc. All rights reserved. 181


Dynamic Scheduling, Multiple Issue, and Speculation
Overview of Design

Copyright © 2012, Elsevier Inc. All rights reserved. 182


Dynamic Scheduling, Multiple Issue, and Speculation
Multiple Issue
■ Limit the number of instructions of a given
class that can be issued in a “bundle”
■ I.e. on FP, one integer, one load, one store

■ Examine all the dependencies amoung the


instructions in the bundle

■ If dependencies exist in bundle, encode


them in reservation stations

■ Also need multiple completion/commit


Copyright © 2012, Elsevier Inc. All rights reserved. 183
Dynamic Scheduling, Multiple Issue, and Speculation
Example
Loop: LD R2,0(R1) ;R2=array element
DADDIU R2,R2,#1 ;increment R2
SD R2,0(R1) ;store result
DADDIU R1,R1,#8 ;increment pointer
BNE R2,R3,LOOP ;branch if not last element

Copyright © 2012, Elsevier Inc. All rights reserved. 184


Dynamic Scheduling, Multiple Issue, and Speculation
Example (No Speculation)

Copyright © 2012, Elsevier Inc. All rights reserved. 185


Dynamic Scheduling, Multiple Issue, and Speculation
Example

Copyright © 2012, Elsevier Inc. All rights reserved. 186


Adv. Techniques for Instruction Delivery and Speculation
Branch-Target Buffer
■ Need high instruction bandwidth!
■ Branch-Target buffers
■ Next PC prediction buffer, indexed by current PC

Copyright © 2012, Elsevier Inc. All rights reserved. 187


Adv. Techniques for Instruction Delivery and Speculation
Branch Folding
■ Optimization:
■ Larger branch-target buffer
■ Add target instruction into buffer to deal with
longer decoding time required by larger buffer
■ “Branch folding”

Copyright © 2012, Elsevier Inc. All rights reserved. 188


Adv. Techniques for Instruction Delivery and Speculation
Return Address Predictor
■ Most unconditional branches come from
function returns
■ The same procedure can be called from
multiple sites
■ Causes the buffer to potentially forget about
the return address from previous calls
■ Create return address buffer organized
as a stack

Copyright © 2012, Elsevier Inc. All rights reserved. 189


Adv. Techniques for Instruction Delivery and Speculation
Integrated Instruction Fetch Unit
■ Design monolithic unit that performs:
■ Branch prediction
■ Instruction prefetch
■ Fetch ahead
■ Instruction memory access and buffering
■ Deal with crossing cache lines

Copyright © 2012, Elsevier Inc. All rights reserved. 190


Adv. Techniques for Instruction Delivery and Speculation
Register Renaming
■ Register renaming vs. reorder buffers
■ Instead of virtual registers from reservation stations and
reorder buffer, create a single register pool
■ Contains visible registers and virtual registers
■ Use hardware-based map to rename registers during issue
■ WAW and WAR hazards are avoided
■ Speculation recovery occurs by copying during commit
■ Still need a ROB-like queue to update table in order
■ Simplifies commit:
■ Record that mapping between architectural register and physical register
is no longer speculative
■ Free up physical register used to hold older value
■ In other words: SWAP physical registers on commit
■ Physical register de-allocation is more difficult

Copyright © 2012, Elsevier Inc. All rights reserved. 191


Adv. Techniques for Instruction Delivery and Speculation
Integrated Issue and Renaming
■ Combining instruction issue with register
renaming:
■ Issue logic pre-reserves enough physical
registers for the bundle (fixed number?)
■ Issue logic finds dependencies within
bundle, maps registers as necessary
■ Issue logic finds dependencies between
current bundle and already in-flight bundles,
maps registers as necessary

Copyright © 2012, Elsevier Inc. All rights reserved. 192


Adv. Techniques for Instruction Delivery and Speculation
How Much?
■ How much to speculate
■ Mis-speculation degrades performance and
power relative to no speculation
■ May cause additional misses (cache, TLB)
■ Prevent speculative code from causing
higher costing misses (e.g. L2)

■ Speculating through multiple branches


■ Complicates speculation recovery
■ No processor can resolve multiple branches
per cycle
Copyright © 2012, Elsevier Inc. All rights reserved. 193
Adv. Techniques for Instruction Delivery and Speculation
Energy Efficiency
■ Speculation and energy efficiency
■ Note: speculation is only energy efficient
when it significantly improves performance

■ Value prediction
■ Uses:
■ Loads that load from a constant pool
■ Instruction that produces a value from a small set
of values
■ Not been incorporated into modern
processors
■ Similar idea--address aliasing
Copyright © 2012, Elsevier Inc. prediction--is
All rights reserved. 194

You might also like