Computer Architecture
A Quantitative Approach, Sixth Edition
Chapter 3
Instruction-Level
Parallelism and Its
Exploitation
Copyright © 2019, Elsevier Inc. All
rights Reserved 1
Un-pipelined Architecture
Unpipelined Start and finish a job before moving to the next
Fetch
Decode
Jobs
Execute
Time
2
Pipelined Architecture
Pipelined Break the job into smaller stages
F D X
F D X
F D X
Jobs
F D X
Time
3
5-Stage Pipeline
In order to enable pipelining we need to
hold or keep the input stable to each
stage and this require latching data and
control signals to each stage in the
pipeline → Pipeline Registers 4
Clocks and Latches
Stage 1 L Stage 2 L
Clk
• Unpipelined: time to execute one instruction = T + Tovh
• For an N-stage pipeline, time per stage = T/N + Tovh
• Total time per instruction = N (T/N + Tovh) = T + N Tovh
• Clock cycle time = T/N + Tovh
• Clock speed = 1 / (T/N + Tovh)
• Ideal speedup = (T + Tovh) / (T/N + Tovh)
• Cycles to complete one instruction = N
• Average CPI (cycles per instr) = 1 5
A 5-Stage Pipeline
6
A 5-Stage Pipeline
Use the PC to access the I-cache and increment PC by 4
P
C
PC+4
P
C
PC+4
P
C
PC+4
P
C
PC+4
7
A 5-Stage Pipeline
Read registers, compare registers, compute branch target; for now, assume
branches take 2 cyc (there is enough work that branches can easily take more)
8
A 5-Stage Pipeline
ALU computation, effective address computation for load/store
9
A 5-Stage Pipeline
Memory access to/from data cache, stores finish in 4 cycles
10
A 5-Stage Pipeline
Write result of ALU computation or load into register file
11
Introduction
• Pipelining become universal technique in 1985
• Overlaps execution of instructions
• Exploits “Instruction Level Parallelism”
Two main approaches:
• Hardware-based dynamic approaches
• Used in server and desktop processors
• Not used as extensively in PMP processors
• Compiler-based static approaches
• Not as successful outside of scientific applications
12
Instruction-Level Parallelism
• When exploiting instruction-level parallelism, goal is to
minimize pipeline CPI
• Pipeline CPI = Ideal pipeline CPI +Structural stalls +Data hazard
stalls + Control stalls
• **Parallelism with basic block is limited
• Typical size of basic block = 3-6 instructions
• Must optimize across branches
13
Instruction Dependences
• There are three different types of dependences
• Data dependence (True data dependence)
• Name dependence (Instructions using same register
names)
• Control dependence ( Branches )
•An instruction j is data-dependent on instruction
i if either of the following holds
– Instruction i produces a result that may be used by
instruction j
– Instruction j is data dependent on instruction k and
instruction k is data dependent on instruction i 14
Data Dependences
• Example of data dependence
Lp: fld f0,0(x1) //f0=array element
fadd.d f4,f0,f2 //add scalar in f2
fsd f4,0(x1) //store result
addi x1,x1,-8 //decrement pointer 8 bytes
bne x1,x2,Lp //branch if x1 ≠ x2
15
Instruction Dependences
• Dependencies are a property of programs
• Pipeline organization determines if dependence is
detected and if it causes a stall
• Data dependence conveys:
– Possibility of a hazard
– Order in which results must be calculated
– Upper bound on exploitable instruction level parallelism
• Dependencies that flow through memory locations
are difficult to detect
16
Name Dependences
• A name dependence occurs when two instructions
use the same register or memory location, called a
name, but there is no flow of data between the
instructions associated with that name
• Two types of name dependence
– An antidependence Write After Read or (WAR)
– Output dependence Write After Write or (WAW
17
Register Renaming
• Instructions with name dependence can execute
simultaneously or out of order if the registers are
renamed (register renaming)
• Renaming can be done statically at compile time or
dynamically by hardware at run time.
18
Control Dependence
• A control dependence determines the ordering of an
instruction i with respect to a branch instruction
if p1 {
S1;
};
if p2 {
S2;
}
• Instruction S1 is control dependent on p1 and S2 is
control dependent on p2
• Control dependence is preserved by implementing
control hazard detection that causes control stalls.
19
Control Dependence
• Can we move S1 after (if p2 ) or S2 before (if p1) ?
• Yes! but without affecting the correctness if p1 {
S1;
of the program };
if p2 {
S2;
• The two properties critical to program }
correctness are exception behavior and the data flow
add x2,x3,x4
beq x2,x0,L1
ld x1,0(x2)
L1:
• The load instruction may cause a memory
protection exception if moved before the branch
20
Control Dependence
• It is insufficient to just maintain data dependences
because an instruction may be data-dependent on
more than one predecessor add x1,x2,x3
beq x4,x0,L
sub x1,x5,x6
L: ...
or x7,x1,x8
• The or instruction is data-dependent on both the add
and sub instructions
• The data flow must be preserved.
• Speculation helps to lessen the impact of the control
dependence while still maintaining the data flow
21
Value liveness
• The property of whether a value will be used by an
upcoming instruction is called liveness
• What if we knew that the register destination of the sub
instruction (x4) was unused after the instruction labeled
skip? add x1,x2,x3
beq x12,x0,skip
sub x4,x5,x6
add x5,x4,x9
skip: or x7,x8,x9
• Then we can move the sub before the beq
• This type of code scheduling is also a form of
22
speculation, often called software speculation
Hazards
• Structural hazards: different instructions in different stages
(or the same stage) conflicting for the same resource
• Data hazards: an instruction cannot continue because it
needs a value that has not yet been generated by an
earlier instruction
• Control hazard: fetch cannot continue because it does
not know the outcome of an earlier branch – special case
of a data hazard – separate category because they are
treated in different ways
23
Structural Hazards
• Example: a unified instruction and data cache →
stage 4 (MEM) and stage 1 (IF) can never coincide
• The later instruction and all its successors are delayed
until a cycle is found when the resource is free → these
are pipeline bubbles
• Structural hazards are easy to eliminate – increase the
number of resources (for example, implement a separate
instruction and data cache)
24
Enabling and optimizing ILP
• To enable ILP we need to
– Detect data dependences either in software or hardware
– Insert stalls when ever needed for correct program result
– Flush pipeline when ever a branch is taken
• To optimize ILP we need to
– Minimize number of stalls needed for correct program result
• Know when and how the ordering among instructions may be changed
– Minimize flushing the pipeline
• Predicting branch outcomes.
25
Compiler Techniques for Exposing ILP
• Pipeline Scheduling
– Separate dependent instruction from the source instruction by
the pipeline latency of the source instruction
• Example
➢ C code:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
➢ Un-Pipelined RISC-V Code
Loop: fld f0,0(x1) //f0=array element x[i]
fadd.d f4,f0,f2 //add scalar in f2=s
fsd f4,0(x1) //store result
addi x1,x1,-8 //decrement pointer 8 bytes (per DW)
bne x1,x2,Loop //branch if x1≠x2
Where are the data dependencies in the above code? And of which type?
26
Compiler Techniques for Exposing ILP
➢ Pipelined RISC-V Code
Loop: fld f0,0(x1) Loop: fld f0,0(x1)
stall addi x1,x1,-8
fadd.d f4,f0,f2 fadd.d f4,f0,f2
stall stall
stall stall
fsd f4,0(x1) fsd f4,8(x1)
addi x1,x1,-8 bne x1,x2,Loop
bne x1,x2,Loop
Constrains
27
Compiler Techniques for Exposing ILP
• Loop unrolling
– Replicates the loop body multiple times, and adjusting the loop
termination code
– Unroll by a factor of 4 (assume # elements is divisible by 4)
– Eliminate unnecessary instructions
Loop fld f0,0(x1)
fadd.d f4,f0,f2
fsd f4,0(x1) //drop addi & bne
fld f6,-8(x1)
fadd.d f8,f6,f2
fsd f8,-8(x1) //drop addi & bne
fld f0,-16(x1)
fadd.d f12,f0,f2
fsd f12,-16(x1) //drop addi & bne
fld f14,-24(x1)
fadd.d f16,f14,f2
fsd f16,-24(x1)
addi x1,x1,-32
bne x1,x2,Loop
• Eliminating three branches and three decrements of x1 28
Compiler Techniques for Exposing ILP
• Pipeline schedule the unrolled loop
Loop: fld f0,0(x1)
fld f6,-8(x1)
fld f8,-16(x1)
fld f14,-24(x1)
fadd.d f4,f0,f2
fadd.d f10,f6,f2
fadd.d f12,f8,f2
fadd.d f16,f14,f2
fsd f4,0(x1)
fsd f10,-8(x1)
fsd
fsd
f12,-16(x1)
f16,-24(x1)
◼ 14 cycles
addi x1,x1,-32 ◼ 3.5 cycles per element
bne x1,x2,Loop
29
Compiler Techniques for Exposing ILP
❖ Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code
❖ Use different registers for different computations to avoid name
dependence.
❖ Eliminate the extra test and branch instructions and adjust the loop
termination and iteration code.
❖ Determine that the loads and stores in the unrolled loop can be
interchanged if they are independent, they do not refer to the same
address.
❖ Schedule the code, preserving any dependences needed to yield
the same result as the original code.
30
Compiler Techniques Limitations
❖ Loop overhead
❖ Amount of overhead that can be reduced decrease by each
additional unroll
❖ Code size limitations
❖ Increase in code size → possible increase in cache miss rate
❖ Compiler limitations
❖ Potential shortfall in registers --> register pressure.
31
Branch Prediction
❖ Basic 1-bit predictor:
❖ Predict not taken, just increment pc+4 (do nothing special)
T
T
N 0 1
N
32
Basic 1-bit predictor
▪ How basic 1bit predictor branch predictor behaves on
the following branch patterns?
▪ TTTTTTTTTTTNTTTTTTTTTTTTTTT…..
▪ NNNNNNNNNNNNTNNNNNNNNNNNNN….
▪ TNTNTNTNTNTNTNTNTNTNTNTNTNTNTN…..
33
Basic 1-bit predictor
▪ Assume 30% of instructions are branches, and 60%of
branches are mispredicted, calculate pipeline CPI if the
branch misprediction penalty is 2 cycles.
Pipeline CPI =
= 1 + %Branch Instructions %Branch Miss Prediction Rate Branch Miss Prediction Penalty
= 1 + 0.3 0.6 2 = 1.36
34
Resources
▪ Memory Timing
▪ https://www.hardwaresecrets.com/understanding-ram-timings/
▪ Memory Architecture
▪ https://en.wikipedia.org/wiki/Multi-channel_memory_architecture
▪ CS6810 Computer Architecture 87 Lectures by Rajeev
Balasubramonian
▪ https://www.youtube.com/playlist?list=PL8EC1756A7B1764F6
35
Resources
▪ HPCA short Lecture series on High Performance Computer Architecture
▪ Part 1 (161 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPmqpjgrmf4-
DGlaeV0om4iP
▪ Part 2 (62 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPkNw98-
MFodLzKgi6bYGjZs
▪ Part 3 (169 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPnhRXZ6wuHnnclMLfg_y
jHs
▪ Part 4 (120 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPn79fsplIuZG34KwbkYSe
dj
▪ Part 5 (149 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPkr-
vo9gKBTid_BWpWEfuXe
36
How we implement a basic 1-bit predictor?
Branch PC
10 bits
Table of
1K entries
Each
entry is
a bit
The table keeps track of what the branch did last time
37
Basic 2-bit Branch Prediction
❖ Basic 2-bit predictor:
❖ For each branch:
❖ Predict taken or not taken
❖ Change prediction only if the prediction is wrong two consecutive times.
T T T
T
00 01 10 11
N
N N N
▪Check the following case assuming we start from 11 state:
TNTNTNTNTNTNTNTNTNTNTNTNTNTNTN…..
▪ We get 50% Correct prediction!
38
Basic 2-bit Branch Prediction
• For each branch, maintain a 2-bit saturating counter:
if the branch is taken: counter = min(3,counter+1)
if the branch is not taken: counter = max(0,counter-1)
• If (counter >= 2), predict taken, else predict not taken
• Advantage: few typical branches will not influence the
prediction (a better measure of “the common case”)
• Especially useful when multiple branches share the same
counter (some bits of the branch PC are used to index
into the branch predictor)
• Can be easily extended to N-bits (in most processors,
N=2)
• Prediction performance depends on both the prediction
accuracy and the branch frequency
39
Basic 2-bit Branch Prediction
Branch PC
10 bits Table of
1K entries
Each
entry is
a 2-bit
sat.
The table keeps track of the common-case counter
outcome for the branch
40