0% found this document useful (0 votes)

23 views40 pages

EC483 Fall2024 W7

Chapter 3 of 'Computer Architecture: A Quantitative Approach' discusses instruction-level parallelism (ILP) and its exploitation through pipelined architecture. It covers the concepts of data, name, and control dependencies, as well as hazards that can occur in pipelined systems, and techniques for optimizing ILP such as loop unrolling and branch prediction. The chapter emphasizes the importance of minimizing stalls and maximizing instruction throughput in modern processors.

Uploaded by

tmbuot

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views40 pages

EC483 Fall2024 W7

Uploaded by

tmbuot

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Computer Architecture

A Quantitative Approach, Sixth Edition

Chapter 3
Instruction-Level
Parallelism and Its
Exploitation

Copyright © 2019, Elsevier Inc. All

rights Reserved 1
Un-pipelined Architecture
Unpipelined Start and finish a job before moving to the next

Fetch

Decode

Jobs
Execute

Time
2
Pipelined Architecture
Pipelined Break the job into smaller stages

F D X

F D X
Jobs

F D X

Time
3
5-Stage Pipeline

In order to enable pipelining we need to

hold or keep the input stable to each
stage and this require latching data and
control signals to each stage in the
pipeline → Pipeline Registers 4
Clocks and Latches

Stage 1 L Stage 2 L

Clk

• Unpipelined: time to execute one instruction = T + Tovh

• For an N-stage pipeline, time per stage = T/N + Tovh
• Total time per instruction = N (T/N + Tovh) = T + N Tovh
• Clock cycle time = T/N + Tovh
• Clock speed = 1 / (T/N + Tovh)
• Ideal speedup = (T + Tovh) / (T/N + Tovh)
• Cycles to complete one instruction = N
• Average CPI (cycles per instr) = 1 5
A 5-Stage Pipeline

6
A 5-Stage Pipeline

Use the PC to access the I-cache and increment PC by 4

P
C

PC+4
P
C

PC+4

7
A 5-Stage Pipeline
Read registers, compare registers, compute branch target; for now, assume
branches take 2 cyc (there is enough work that branches can easily take more)

8
A 5-Stage Pipeline

ALU computation, effective address computation for load/store

9
A 5-Stage Pipeline

Memory access to/from data cache, stores finish in 4 cycles

10
A 5-Stage Pipeline

Write result of ALU computation or load into register file

11
Introduction

• Pipelining become universal technique in 1985

• Overlaps execution of instructions
• Exploits “Instruction Level Parallelism”

Two main approaches:

• Hardware-based dynamic approaches
• Used in server and desktop processors
• Not used as extensively in PMP processors
• Compiler-based static approaches
• Not as successful outside of scientific applications

12
Instruction-Level Parallelism

• When exploiting instruction-level parallelism, goal is to

minimize pipeline CPI
• Pipeline CPI = Ideal pipeline CPI +Structural stalls +Data hazard
stalls + Control stalls

• **Parallelism with basic block is limited

• Typical size of basic block = 3-6 instructions
• Must optimize across branches

13
Instruction Dependences

• There are three different types of dependences

• Data dependence (True data dependence)
• Name dependence (Instructions using same register
names)
• Control dependence ( Branches )

•An instruction j is data-dependent on instruction

i if either of the following holds
– Instruction i produces a result that may be used by
instruction j
– Instruction j is data dependent on instruction k and
instruction k is data dependent on instruction i 14
Data Dependences

• Example of data dependence

Lp: fld f0,0(x1) //f0=array element

fadd.d f4,f0,f2 //add scalar in f2

fsd f4,0(x1) //store result

addi x1,x1,-8 //decrement pointer 8 bytes

bne x1,x2,Lp //branch if x1 ≠ x2

15
Instruction Dependences

• Dependencies are a property of programs

• Pipeline organization determines if dependence is
detected and if it causes a stall

• Data dependence conveys:

– Possibility of a hazard
– Order in which results must be calculated
– Upper bound on exploitable instruction level parallelism

• Dependencies that flow through memory locations

are difficult to detect

16
Name Dependences

• A name dependence occurs when two instructions

use the same register or memory location, called a
name, but there is no flow of data between the
instructions associated with that name

• Two types of name dependence

– An antidependence Write After Read or (WAR)
– Output dependence Write After Write or (WAW

17
Register Renaming

• Instructions with name dependence can execute

simultaneously or out of order if the registers are
renamed (register renaming)

• Renaming can be done statically at compile time or

dynamically by hardware at run time.

18
Control Dependence

• A control dependence determines the ordering of an

instruction i with respect to a branch instruction
if p1 {
S1;
};
if p2 {
S2;
}
• Instruction S1 is control dependent on p1 and S2 is
control dependent on p2
• Control dependence is preserved by implementing
control hazard detection that causes control stalls.
19
Control Dependence

• Can we move S1 after (if p2 ) or S2 before (if p1) ?

• Yes! but without affecting the correctness if p1 {
S1;
of the program };
if p2 {
S2;
• The two properties critical to program }
correctness are exception behavior and the data flow
add x2,x3,x4
beq x2,x0,L1
ld x1,0(x2)
L1:
• The load instruction may cause a memory
protection exception if moved before the branch
20
Control Dependence

• It is insufficient to just maintain data dependences

because an instruction may be data-dependent on
more than one predecessor add x1,x2,x3
beq x4,x0,L
sub x1,x5,x6
L: ...
or x7,x1,x8
• The or instruction is data-dependent on both the add
and sub instructions
• The data flow must be preserved.
• Speculation helps to lessen the impact of the control
dependence while still maintaining the data flow
21
Value liveness

• The property of whether a value will be used by an

upcoming instruction is called liveness
• What if we knew that the register destination of the sub
instruction (x4) was unused after the instruction labeled
skip? add x1,x2,x3
beq x12,x0,skip
sub x4,x5,x6
add x5,x4,x9
skip: or x7,x8,x9

• Then we can move the sub before the beq

• This type of code scheduling is also a form of
22
speculation, often called software speculation
Hazards

• Structural hazards: different instructions in different stages

(or the same stage) conflicting for the same resource

• Data hazards: an instruction cannot continue because it

needs a value that has not yet been generated by an
earlier instruction

• Control hazard: fetch cannot continue because it does

not know the outcome of an earlier branch – special case
of a data hazard – separate category because they are
treated in different ways

23
Structural Hazards

• Example: a unified instruction and data cache →

stage 4 (MEM) and stage 1 (IF) can never coincide

• The later instruction and all its successors are delayed

until a cycle is found when the resource is free → these
are pipeline bubbles

• Structural hazards are easy to eliminate – increase the

number of resources (for example, implement a separate
instruction and data cache)

24
Enabling and optimizing ILP

• To enable ILP we need to

– Detect data dependences either in software or hardware
– Insert stalls when ever needed for correct program result
– Flush pipeline when ever a branch is taken

• To optimize ILP we need to

– Minimize number of stalls needed for correct program result
• Know when and how the ordering among instructions may be changed
– Minimize flushing the pipeline
• Predicting branch outcomes.

25
Compiler Techniques for Exposing ILP

• Pipeline Scheduling
– Separate dependent instruction from the source instruction by
the pipeline latency of the source instruction
• Example
➢ C code:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;

➢ Un-Pipelined RISC-V Code

Loop: fld f0,0(x1) //f0=array element x[i]
fadd.d f4,f0,f2 //add scalar in f2=s
fsd f4,0(x1) //store result
addi x1,x1,-8 //decrement pointer 8 bytes (per DW)
bne x1,x2,Loop //branch if x1≠x2
Where are the data dependencies in the above code? And of which type?
26
Compiler Techniques for Exposing ILP
➢ Pipelined RISC-V Code
Loop: fld f0,0(x1) Loop: fld f0,0(x1)
stall addi x1,x1,-8
fadd.d f4,f0,f2 fadd.d f4,f0,f2
stall stall
stall stall
fsd f4,0(x1) fsd f4,8(x1)
addi x1,x1,-8 bne x1,x2,Loop
bne x1,x2,Loop
Constrains

27
Compiler Techniques for Exposing ILP

• Loop unrolling
– Replicates the loop body multiple times, and adjusting the loop
termination code
– Unroll by a factor of 4 (assume # elements is divisible by 4)
– Eliminate unnecessary instructions
Loop fld f0,0(x1)
fadd.d f4,f0,f2
fsd f4,0(x1) //drop addi & bne
fld f6,-8(x1)
fadd.d f8,f6,f2
fsd f8,-8(x1) //drop addi & bne
fld f0,-16(x1)
fadd.d f12,f0,f2
fsd f12,-16(x1) //drop addi & bne
fld f14,-24(x1)
fadd.d f16,f14,f2
fsd f16,-24(x1)
addi x1,x1,-32
bne x1,x2,Loop
• Eliminating three branches and three decrements of x1 28
Compiler Techniques for Exposing ILP

• Pipeline schedule the unrolled loop

Loop: fld f0,0(x1)

fld f6,-8(x1)
fld f8,-16(x1)
fld f14,-24(x1)
fadd.d f4,f0,f2
fadd.d f10,f6,f2
fadd.d f12,f8,f2
fadd.d f16,f14,f2
fsd f4,0(x1)
fsd f10,-8(x1)
fsd
fsd
f12,-16(x1)
f16,-24(x1)
◼ 14 cycles
addi x1,x1,-32 ◼ 3.5 cycles per element
bne x1,x2,Loop

29
Compiler Techniques for Exposing ILP
❖ Determine that unrolling the loop would be useful by finding that the
loop iterations were independent, except for the loop maintenance
code
❖ Use different registers for different computations to avoid name
dependence.
❖ Eliminate the extra test and branch instructions and adjust the loop
termination and iteration code.
❖ Determine that the loads and stores in the unrolled loop can be
interchanged if they are independent, they do not refer to the same
address.
❖ Schedule the code, preserving any dependences needed to yield
the same result as the original code.

30
Compiler Techniques Limitations

❖ Loop overhead
❖ Amount of overhead that can be reduced decrease by each
additional unroll
❖ Code size limitations
❖ Increase in code size → possible increase in cache miss rate
❖ Compiler limitations
❖ Potential shortfall in registers --> register pressure.

31
Branch Prediction

❖ Basic 1-bit predictor:

❖ Predict not taken, just increment pc+4 (do nothing special)

T
T
N 0 1

N
32
Basic 1-bit predictor

▪ How basic 1bit predictor branch predictor behaves on

the following branch patterns?
▪ TTTTTTTTTTTNTTTTTTTTTTTTTTT…..
▪ NNNNNNNNNNNNTNNNNNNNNNNNNN….
▪ TNTNTNTNTNTNTNTNTNTNTNTNTNTNTN…..

33
Basic 1-bit predictor

▪ Assume 30% of instructions are branches, and 60%of

branches are mispredicted, calculate pipeline CPI if the
branch misprediction penalty is 2 cycles.
Pipeline CPI =
= 1 + %Branch Instructions  %Branch Miss Prediction Rate  Branch Miss Prediction Penalty
= 1 + 0.3  0.6  2 = 1.36

34
Resources

▪ Memory Timing
▪ https://www.hardwaresecrets.com/understanding-ram-timings/
▪ Memory Architecture
▪ https://en.wikipedia.org/wiki/Multi-channel_memory_architecture
▪ CS6810 Computer Architecture 87 Lectures by Rajeev
Balasubramonian
▪ https://www.youtube.com/playlist?list=PL8EC1756A7B1764F6

35
Resources

▪ HPCA short Lecture series on High Performance Computer Architecture

▪ Part 1 (161 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPmqpjgrmf4-
DGlaeV0om4iP
▪ Part 2 (62 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPkNw98-
MFodLzKgi6bYGjZs
▪ Part 3 (169 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPnhRXZ6wuHnnclMLfg_y
jHs
▪ Part 4 (120 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPn79fsplIuZG34KwbkYSe
dj
▪ Part 5 (149 Lectures)
▪ https://www.youtube.com/playlist?list=PLAwxTw4SYaPkr-
vo9gKBTid_BWpWEfuXe
36
How we implement a basic 1-bit predictor?

Branch PC
10 bits
Table of
1K entries

Each
entry is
a bit
The table keeps track of what the branch did last time

37
Basic 2-bit Branch Prediction

❖ Basic 2-bit predictor:

❖ For each branch:
❖ Predict taken or not taken
❖ Change prediction only if the prediction is wrong two consecutive times.

T T T
T

00 01 10 11
N
N N N
▪Check the following case assuming we start from 11 state:
TNTNTNTNTNTNTNTNTNTNTNTNTNTNTN…..
▪ We get 50% Correct prediction!
38
Basic 2-bit Branch Prediction
• For each branch, maintain a 2-bit saturating counter:
if the branch is taken: counter = min(3,counter+1)
if the branch is not taken: counter = max(0,counter-1)
• If (counter >= 2), predict taken, else predict not taken
• Advantage: few typical branches will not influence the
prediction (a better measure of “the common case”)
• Especially useful when multiple branches share the same
counter (some bits of the branch PC are used to index
into the branch predictor)
• Can be easily extended to N-bits (in most processors,
N=2)
• Prediction performance depends on both the prediction
accuracy and the branch frequency
39
Basic 2-bit Branch Prediction

Branch PC
10 bits Table of
1K entries

Each
entry is
a 2-bit
sat.
The table keeps track of the common-case counter
outcome for the branch

Cosc530 Ch3all6up
No ratings yet
Cosc530 Ch3all6up
8 pages
Chapter 2 ILP
No ratings yet
Chapter 2 ILP
89 pages
Instruction Level Parallelism: Soner Onder
No ratings yet
Instruction Level Parallelism: Soner Onder
25 pages
CompanionAsset 9780128119051 Chapter03
No ratings yet
CompanionAsset 9780128119051 Chapter03
67 pages
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
No ratings yet
CSE 820 Graduate Computer Architecture Week 5 - Instruction Level Parallelism
38 pages
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
No ratings yet
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
201 pages
4th Lecture Computer Architecture
No ratings yet
4th Lecture Computer Architecture
15 pages
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
No ratings yet
Instruction-Level Parallelism and Its Exploitation: Prof. Dr. Nizamettin AYDIN
170 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
Pipelining Become Universal Technique in 1985
No ratings yet
Pipelining Become Universal Technique in 1985
16 pages
13) Ilp1 PDF
No ratings yet
13) Ilp1 PDF
85 pages
Topic2c Ss Dynamicscheduling
No ratings yet
Topic2c Ss Dynamicscheduling
94 pages
Instruction-Level Parallelism (ILP), Since The
100% (1)
Instruction-Level Parallelism (ILP), Since The
57 pages
ILP Overview and Scoreboard
No ratings yet
ILP Overview and Scoreboard
60 pages
4-Advanced Pipelining - 241114 - 060906
No ratings yet
4-Advanced Pipelining - 241114 - 060906
80 pages
Lecture 5
No ratings yet
Lecture 5
80 pages
CAQA5e ch3
No ratings yet
CAQA5e ch3
45 pages
COA Report
No ratings yet
COA Report
13 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
22 pages
Computer Architecture Insights
No ratings yet
Computer Architecture Insights
41 pages
03 Dynamic Sched
No ratings yet
03 Dynamic Sched
84 pages
U3.1 Concepts and Challenges
No ratings yet
U3.1 Concepts and Challenges
12 pages
Lecture-7-15 01 2025
No ratings yet
Lecture-7-15 01 2025
19 pages
Instruction-Level Parallel Processors: Asim Munir
No ratings yet
Instruction-Level Parallel Processors: Asim Munir
28 pages
Instruction Scheduling
No ratings yet
Instruction Scheduling
17 pages
Chapter 5 PPTV 41 STDV 1
No ratings yet
Chapter 5 PPTV 41 STDV 1
47 pages
Chapter 2 Lecture 4 and 5
No ratings yet
Chapter 2 Lecture 4 and 5
56 pages
EE457Unit9a OoO
No ratings yet
EE457Unit9a OoO
77 pages
CH10-Processor Structure and Function
No ratings yet
CH10-Processor Structure and Function
14 pages
MCP Unit 1
No ratings yet
MCP Unit 1
41 pages
CS 6290 Instruction Level Parallelism
No ratings yet
CS 6290 Instruction Level Parallelism
45 pages
Lecture 5
No ratings yet
Lecture 5
76 pages
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
No ratings yet
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
14 pages
Computer Architecture ILP - Techniques For Increasing
No ratings yet
Computer Architecture ILP - Techniques For Increasing
11 pages
Compiler Techniques To Explore ILP: Lecture 3A
No ratings yet
Compiler Techniques To Explore ILP: Lecture 3A
16 pages
Instruction-Level Parallel Processors: Objective
No ratings yet
Instruction-Level Parallel Processors: Objective
31 pages
Advanced ILP Techniques for Developers
No ratings yet
Advanced ILP Techniques for Developers
104 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
108 pages
Unit 6
No ratings yet
Unit 6
22 pages
Introduction To Instruction Level Parallelism (ILP) : ECE338 Parallel Computer Architecture Spring 2022
No ratings yet
Introduction To Instruction Level Parallelism (ILP) : ECE338 Parallel Computer Architecture Spring 2022
13 pages
06 Ooo Basics
No ratings yet
06 Ooo Basics
74 pages
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
No ratings yet
Instruction-Level Parallelism: Stalls Control Stalls WAW Stalls WAR Stalls RAW Stalls Structural CPI CPI
50 pages
3a.ILP Dipendenze e Superscalare
No ratings yet
3a.ILP Dipendenze e Superscalare
24 pages
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
0% (1)
EEF011 Computer Architecture 計算機結構: Exploiting Instruction-Level Parallelism with Software Approaches
40 pages
Lec5 - ILP Issues in Pipeline Design
No ratings yet
Lec5 - ILP Issues in Pipeline Design
38 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Unit4 Aca
No ratings yet
Unit4 Aca
6 pages
Lec3 PDF
No ratings yet
Lec3 PDF
15 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
2 pages
Study Guide Chapter 3
No ratings yet
Study Guide Chapter 3
3 pages
Module 5 - Processor Structure and Function
No ratings yet
Module 5 - Processor Structure and Function
74 pages
SRM Pipelining 05
No ratings yet
SRM Pipelining 05
42 pages
Ch2 Lec7 Instruction Piplining
No ratings yet
Ch2 Lec7 Instruction Piplining
34 pages
Chapter 4
No ratings yet
Chapter 4
78 pages
Parallel Computing for Students
No ratings yet
Parallel Computing for Students
113 pages
Processor Organization & Instruction Cycle
No ratings yet
Processor Organization & Instruction Cycle
31 pages
Instruction Level Parallelism-Concepts N Challenges
100% (1)
Instruction Level Parallelism-Concepts N Challenges
4 pages
Unit - 1 Microprocessor Architecture
No ratings yet
Unit - 1 Microprocessor Architecture
52 pages
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
No ratings yet
Parallelism Via Instructions: Instruction-Level Parallelism (ILP)
21 pages
Blockchain Unconfirmed Transaction Hack Script 3 PDF Free
No ratings yet
Blockchain Unconfirmed Transaction Hack Script 3 PDF Free
4 pages
Microwave Network Analysis Guide
No ratings yet
Microwave Network Analysis Guide
17 pages
TeslaSCADA IDE UserManual PDF
No ratings yet
TeslaSCADA IDE UserManual PDF
386 pages
Semester3 Student Packet Tracer Manual
No ratings yet
Semester3 Student Packet Tracer Manual
72 pages
Cisco AMP for Endpoint Security
No ratings yet
Cisco AMP for Endpoint Security
11 pages
Cassandra: Decentralized Storage System
No ratings yet
Cassandra: Decentralized Storage System
37 pages
Checkpoint r80 Vs Palo Alto Networks
No ratings yet
Checkpoint r80 Vs Palo Alto Networks
4 pages
TOPCNC TC55V Instruction Manual
No ratings yet
TOPCNC TC55V Instruction Manual
13 pages
Uconnect 8.4 & 8.4N Software Update Instructions: English Page 2
No ratings yet
Uconnect 8.4 & 8.4N Software Update Instructions: English Page 2
9 pages
Associate Application Developer by Duetche
No ratings yet
Associate Application Developer by Duetche
3 pages
Memory Selection of ES
No ratings yet
Memory Selection of ES
37 pages
Addressing Modes of TMS320c54x: Ece 450:digital Signal Processors and Applications Processors and Applications
No ratings yet
Addressing Modes of TMS320c54x: Ece 450:digital Signal Processors and Applications Processors and Applications
14 pages
Revision BIOS Tweaking Guide
100% (1)
Revision BIOS Tweaking Guide
6 pages
MPhasis Assessment - 3-2
No ratings yet
MPhasis Assessment - 3-2
41 pages
APS User Guide
No ratings yet
APS User Guide
11 pages
Prog LA12
0% (1)
Prog LA12
6 pages
6GK52062BB002AC2 Datasheet en
No ratings yet
6GK52062BB002AC2 Datasheet en
5 pages
Linux & Open Source Genius Guide - Volume 3, 2013
No ratings yet
Linux & Open Source Genius Guide - Volume 3, 2013
228 pages
Ajith Boddu
No ratings yet
Ajith Boddu
7 pages
Atm Processing
50% (10)
Atm Processing
25 pages
Prosedur Operasional Standar Pengelolaan Infrastruktur Jardiknas Wireline VPN IP
No ratings yet
Prosedur Operasional Standar Pengelolaan Infrastruktur Jardiknas Wireline VPN IP
18 pages
Nrusso Ccie Ccde Evolving Tech 1july2018 PDF
No ratings yet
Nrusso Ccie Ccde Evolving Tech 1july2018 PDF
173 pages
CH 02
No ratings yet
CH 02
89 pages
Cruxminer Setup Guide V0.1.0
No ratings yet
Cruxminer Setup Guide V0.1.0
5 pages
Digital Electronics Tutorial
No ratings yet
Digital Electronics Tutorial
2 pages
Samsung LE22S86BD Chassis GJA22SEN
100% (4)
Samsung LE22S86BD Chassis GJA22SEN
123 pages
GSM Based Power Grid Monitoring System
No ratings yet
GSM Based Power Grid Monitoring System
41 pages
Fix - (SPICE) Transient GMIN Stepping at Time 0.00156965 - Geeky Engineers
No ratings yet
Fix - (SPICE) Transient GMIN Stepping at Time 0.00156965 - Geeky Engineers
7 pages
3D Printer Partslist Pricelist
No ratings yet
3D Printer Partslist Pricelist
18 pages
Meraki Datasheet Ms 210
No ratings yet
Meraki Datasheet Ms 210
6 pages

EC483 Fall2024 W7

Uploaded by

EC483 Fall2024 W7

Uploaded by

Computer Architecture

A Quantitative Approach, Sixth Edition

Copyright © 2019, Elsevier Inc. All

In order to enable pipelining we need to

• Unpipelined: time to execute one instruction = T + Tovh

Use the PC to access the I-cache and increment PC by 4

ALU computation, effective address computation for load/store

Memory access to/from data cache, stores finish in 4 cycles

Write result of ALU computation or load into register file

• Pipelining become universal technique in 1985

Two main approaches:

• When exploiting instruction-level parallelism, goal is to

• **Parallelism with basic block is limited

• There are three different types of dependences

•An instruction j is data-dependent on instruction

• Example of data dependence

Lp: fld f0,0(x1) //f0=array element

fadd.d f4,f0,f2 //add scalar in f2

fsd f4,0(x1) //store result

addi x1,x1,-8 //decrement pointer 8 bytes

bne x1,x2,Lp //branch if x1 ≠ x2

• Dependencies are a property of programs

• Data dependence conveys:

• Dependencies that flow through memory locations

• A name dependence occurs when two instructions

• Two types of name dependence

• Instructions with name dependence can execute

• Renaming can be done statically at compile time or

• A control dependence determines the ordering of an

• Can we move S1 after (if p2 ) or S2 before (if p1) ?

• It is insufficient to just maintain data dependences

• The property of whether a value will be used by an

• Then we can move the sub before the beq

• Structural hazards: different instructions in different stages

• Data hazards: an instruction cannot continue because it

• Control hazard: fetch cannot continue because it does

• Example: a unified instruction and data cache →

• The later instruction and all its successors are delayed

• Structural hazards are easy to eliminate – increase the

• To enable ILP we need to

• To optimize ILP we need to

➢ Un-Pipelined RISC-V Code

• Pipeline schedule the unrolled loop

Loop: fld f0,0(x1)

❖ Basic 1-bit predictor:

▪ How basic 1bit predictor branch predictor behaves on

▪ Assume 30% of instructions are branches, and 60%of

▪ HPCA short Lecture series on High Performance Computer Architecture

❖ Basic 2-bit predictor:

You might also like