Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views11 pages

Pipeline Part 2 and Data Hazards

The document discusses branch prediction and speculative execution techniques used in processors to enhance performance by reducing stalls caused by conditional branches. It explains both static and dynamic branch prediction methods, including the use of branch target buffers (BTB) and various algorithms for predicting branch outcomes. Additionally, it covers data dependency and operand forwarding in pipelined processors to optimize instruction execution and minimize delays.

Uploaded by

Shivansh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views11 pages

Pipeline Part 2 and Data Hazards

The document discusses branch prediction and speculative execution techniques used in processors to enhance performance by reducing stalls caused by conditional branches. It explains both static and dynamic branch prediction methods, including the use of branch target buffers (BTB) and various algorithms for predicting branch outcomes. Additionally, it covers data dependency and operand forwarding in pipelined processors to optimize instruction execution and minimize delays.

Uploaded by

Shivansh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Branch prediction

cund Speculative
exec ti on

Thre are va vus Was in wich we can decsease the


mumber f sal ycs in Case conttional bran chu any brh
Statement s

cways to avoid Sal


ycles
Nof all thetime teso mefh ods
cOk Succo ss fully but 80.

() Brach
(suass)
predii on
Specu ative execution

is
IVFlusin1, invelved

Cen drti e (Gntin


I
is false ) if.my guess is
Orong. Suppese

us we have priced
that banchcoon
be tuken; i.e, (odtis
BrancV
istaken us fase .
Ly200: So, we oould start
fetching T I, ad
|-e-) we
seg uente and oudlenly
wouwd be executing ih sea uence

of I (at 00)
execi
after the Complete our Qssunp in
kns that
take thon
branch cOOud b

have to go oth t lusing


again
Assumption s right; Ged
Ce Cant
perform ance
avo stalls
Asumptien is son
(28

There doe van ous coays by hich this predictin


Venous
us implemented
(in Some Pooetors
be taker
al ways
brnchh coil
D
in Sm
sych,
theyfrd

branc
afer executing the bnchPiSiTa
incur the pena ly
tokon ten they
|brarch is not
oth (o0)
again slast
(us hh tho pipeline
ofhes
(stalicaprdicdior
taken (in Some )
coill neyer be
branch procesors

(Conse cuti
ve in stuction)
200.

cn what
dep a
these strotegieg
ance o Contai)
he perfr
m

instructica our prgra


type of

byanch

yamic, pneketion) fenm)


G
Bt imnp kmentotin
CBTB)
simplest

Branch Taget Buffes

we c ah Ue
acsaey further
To improve predicin
the predictin
behavior toibfluehca
actual brarch
in dynamic branch predictio
Yesulting

Gimp le st fom )a dynomic predi cior algorithm Can

Use the
the resull of the mosl eent
executin of a branch hstruction.

The poeSor assumes that the net im


the instruchion is exeueted, the branch decisjon is likel to be
(29

tho Sae ao the last ie


Tao technique S

L Too- state
algont hm /Sngk- b poe
dictor

bit
Simple prrdicts that uses aa single
A

the last Out corme of a brarch


to represnt
Gaken or hot faken)
two-State
may be dosCribad by fhe
The algorithm
machhe
my
LT : Branch is Claly to be taken

Branch is not ikaly to be


taken

state LNT.
Suppese, the algo. is started in
s executed and the
when the bran ch instructisn State
branch is ta ken the machne moves to LT.

it Temais in state LNT. The


thero/se,

hext time the Same jhstrucin js en coustered,


the branch i5 prechcted as ta ken if the machine

i5in State LT. Othencoise, it is predited as


mot ta kon
realy
BT Cbranch is
taken)

LT BT
LNT
BNT
B=1))
ts veat

taken

w ohee B is the bit reSered or al branch insutind.


,

bitpredicto
algorithan| Double-
tour- state

of
Fast chupbac k) exe cution
epeated m
In two state alqoaithm, too misPred'i
YesuHs
he Same lbbp pas
in st

pas
bin Last

oy
Can be achieed
pediction
accu ray history.
BeHer abat executisn
nfomatien beloLO :
kaeping
more
as Showh
uses fur 6tateo
tio algonittm
litely to be taker
k.
Strorgly
ST to be taken
LiKely
tafcon
not to be
vLNT be taken
nof to
BNT -
lkaly
Gtrongly

BT
LNT
BNT
oBNT To)
BT
BNT

BT ST BT
(10)

BNT

executen of the Same loep resuts In any


epeated
che mis pirediction un the last Pas.
Secomd Doa w badk ot
Sing le -bit br onch praditos
o- ate algo

Consi deo c bianch


branch oudome sequonce
CT- toa ken. N- hot akor)

TNTNTN T N TV 1

Single- bt predictor predicts tha net outme


of a branch based onn st presea t
Sts Presen autcome
porm
we have an
Hene, the accUTa y is verit
outcome seg uene o a braneh
altehate
un

we change the Gtate ony


In 2- bit poedicto,
poe dictor

Preict
twbmre Corrcty |corongy.
uf

Brorch Target Bufer (BTB

I! is a small,fast memo

memo) ohese branch tnformatin


Cacho table
tnthe fsom of a lovkup
1s oTganized tnthe
in ohich each entry ih cdude:

An addren of tha branch tnstctian

too state bte) foo the brnch


prei etin algor ithm

the brah ch target addrs


(Pooussor) branch
Next ime Oheh e co me accross tho Game instucion,
process0o oill obain the branch precicicn State bitr

bakd on the a ddrr of hs

instucion beng on Cra)


Tnstrucdi
BranCh Target Pdr.
fetched. ddre State bitls)
Ug i4 tofid oheth
fron tso
for
fedh NT
Memoy

CP
T a
here 1s
ht io

BTB, i:eiy
the cunent

D etehod -instau ctish


us a branch
instruction
hde
akeady
been sen eas irhe Branch tuwget buffer(BTB)
kho ohat the

ls
Tfthe predichion
the ent in
updak
KoTorg, upclate

TLB
CHAPTER 7: Pipelining 317

Furthermore, we assume that the branch condition and target address are computed in
one cycle and that instructions are fetched four at a time from thc instruction cache.
InstructionI; compares two operands, and instruction I, is a conditional branch based
on the result of the comnparison. Assume that the fetch unit predicts that thebranch
will not be taken and leaves instructions I; and I, in the instruction queue. Thus,
instruction I; begins cxecution afier I,. followed by I4. The final decision on the
branch cannot be made until aftcr the condition code flags have been updated at
the end of the compare instruction.
(Later, we will sce that the decision can some
times bemade a litle carlier:) Hence,after step Wi. the instruction fetch unit real

izes that the prediction was incorrect and that the instructions in the execution pipe
must be purged. Purge cycles are marked "p" inthe figure. Four new instructions,
I,through l+}. are fetched, starting at the address computed in step Ez (only I, is
shown in the figure).
Note that write steps, in which results are stored in the destination registers, cannot
be allowed to proceed until the branch condition has been resolved. Had instruction I;
taken longer to complete, step W3 and any subsequent steps would have had to beW
delayed.We return to this issue in Section 7.6.
Branch prediction can be done in one of two ways. It can be done by the compiler,in
which case it is encoded in the branch instruction. The OP-code word of the instruction

indicates tothe instruction fetch unit whether this branch should be predicted as taken
or not taken. The prediction result is thesame every time a given branch instruction is

encountered;hence, this approach is called static branch prediction.

Another approach is to have the processor hardware assess the likelihood of a


branch being taken every time a branch instruction is encountered.This can be done
by keeping track of the result of the branch decision the last time that instruction was
executed and assuming that the decision in the current instanceis likely to be the sarme.

Clearly, the prediction result may be different for different instancesof execution of
the same branch instruction. This approach is called dynamic branch prediction. Both
static and dynamic prediction lead to significant improvement in performance.

74
DATA DEPENDENCY

Section 7.1.2 introduced the idea of data dependency, which arises when a source
operand for an instruction depends on the results ofexecution of a preceding instruction.
If the results of the execution of the preceding instruction havenot yet been recorded
in their respective registers, the pipeline is stalled.

Consider a processor that uses the four-stage pipeline in Figure 7.2b. The first
stage fetchesan instruction from the cache. The second stage decodes the instruction
and reads the source operands from the register file. The third stage performs an ALU
operation and stores the result in the destination location. Assume that some processor
instructions have three operands-two source operands and one destination operand.
A hardware organization that supports these features is shown in Figure 7.13. Part a of
the figure shows the connection between the ALU and the register file. The register file
allows three different locations be accessed simultaneouslyduring each clock period:
to
two locations provide the two sourceoperands, SRCIand SRC2, which are transferred
318 Computer Organization

Source 1

Source 2

SRCI SRC2

Register

file

ALU

RSLT

Destination

(a) Part of the data paths of a CPU

SRC1,SRC2 RSLT

O: Operate W: Write
(ALU) (Registerfile)

forwarding
data path

(b) Position of the source and result registers in the processor pipeline
FIGURE 7.13

Operand forwarding in a pipelined processor.

in parallel to the input registers of the ALU. At the same time, RSLT, hethe contents of
result register, are Stored in a third location in the register
Thus, at the end of each
file.

clock period, two new source operandsare loaded in SRCl


and SRC2, and a new result
at the output of the ALU is stored in the RSLT register. That result will be transferred
to the register file in the following clock cycle.
Figure7.13b shows the positions of the three registers, SRCI, SRC2, and RSLI, 11
the pipeline, assuming the overall structure is similar to that in Figure 7.2b. The three
registers are part of the interstage buffers used to
pass data from one stage to the nex
during pipelined operation.
CHAFTER 7: Pipelining 319

Consider the following two


instructions:

I: Add RI,R2,R3
Iz: R3 Shift_left
The timing of the execution of these two
instructions is depicted in Figure 7.14.
The Add instruction, which is fetched in the first
clock cycle, performs the opera
tion R3 +
(R1] ([R2]. The contents of the two source registers, R1 and
R2, are read
during the second clock cycle and are clocked into registers SRCI
and SRC2. Their
sum is generatedby the ALUandloaded into the ALUoutput register, RSLT,
during
clock cycle 3. From there, is transferred into register R3 in the
it
register file in cycle
4. What happens to the sourceoperand of instruction I? The
processor hardware must
detect that the sourceoperand for this instruction is the
same as the destination operand
of the previous instruction, and must ensure that the updated value is used.
it
Hence,
the decode stage, which is responsible for getting the source operand from R3,cannot
complete this task until cycle 5; so the pipeline is stalled for two cycles. Subsequent
instructions proceed as shown, forthe same reasons as Section 7.1.2explains for Figure
7.6.

To avoid the delay, the hardware can be organized to allow the result of one ALU
operation to be available for anotherALU operation immediatelyfol
in the cycle that
lows. This techniqueis called operandforwarding,and the blue connection lines shown
in Figure7.13 can be added for this purpose. After decoding instruction h, the control
circuitry determinesthat the sourceoperañd of l, is the same as the destination operand

of I,.Hence, the Operand Fetch operation, which would have taken place in the Decode
In the next clock cycle, the control hardware in the Operate stage
stage, is inhibited.
arranges for the sourceoperand to come directly from register RSLT, as indicated by

the word fwd" inFigure7.15a.Thus, execution of I, proceeds without interuption.


In the case ofa longer pipeline,operation may be stalled for a few cycles, even with
operand forwarding.An example of a six-stage pipeline is shown in Figure 7.15b. In

Clock cycle 1 2 3 4 5 6 7 8

Instruction

I, (Add) R1,R2 R3

, (Shift)
R3 Shift R3

W
F D

F D W

Time

FIGURE 7.14
of data dependency.
Interruption of pipelined operation because
320 ComputerOrganization

Instruction

1 RI,R2 4 R3

fwd,
D
shift
R3

D W

Timc

(a) Short pipeline

Instruction

F D o! W

F D fwd,0' 0² W

F D W

Time

(b) Long pipeline

FIGURE 7.15
Instruction execution using operand forwarding.

this case, the Operate phase of instruction executionconsists of three pipeline steps. O'.

0°,and O³ (seeblue shading). When needed, operand forwarding takes place after all

three steps have been completed. In this case, the O' phase of instruction l, is delayed

by two cycles.
Data dependency also occurs when operands are read from the memory. Consider.
forexample, the two instructions:

Ij: Load (RI),R2


Iz: Add R2,R3,R4

One possible sequence of eventsduring execution of these instructions is shown in Fig


ure 7.16.The processoruses one clock cycle to obtain the contentsof R and one cyele
Il
to read the sourceoperand of instruction I, from the data cache.Because the destinato

register,R2, appears as a source register in h,the memory data are forwarded to the
ALUdirectly as soon as they are received.For example, because of the path shown
blue in Figure 7.13a, when the memory data are available at the input of the regs
fle, they are also available at the inputof the ALU. Thus,operand forwarding makes
possible to continueexecution of l2 without stalling the pipeline. Of course.in the ci

of acache miss, the pipeline is stalled and execution of both I, and l, is delayed un
the requesteddata are read from the main memory or a secondary cache.
Pipelining 321
CHAPTER 7:

Instruction

I, (Load) F RI Read R2 4
FIGURE 7.16
I,(Add)
F RI,R2 fwd, + R3€ Operand forwarding a read in

a cache
operation (assuming
Time hit).

while the in
is discovered by the hardware
In Figure 7.14,the data dependency
in cycle 3. The dependency
involves the contents of R3.
struction is being decoded value is not
during the same cycle. Since this
which would normally have been read
delays reading it until cycle 5. Such
dependencies
yet available, the control hardware
dependency
In this case, the task of detecting the
can also be dealt with in software. example. the
resolve it is left to the compiler. For
and introducing the delay needed to
delay between instructions I
and I; in Figure 7.14
compiler must introduce a two-cycle
by inserting NOPinstructions, as follows

Add RI,R2.R3
I|:
NOP
NOP
Shift_left R3
would use the old contents of R3 in in
Without the NOP instructions, the processor
an incorrect result.
struction l2, producing for correct op
of having the compiler introduce the delay needed
The possibility
link between the compiler
and the hardware. A particular
illustrates the close
eration tasks
to the compiler. Leaving
implemented in hardware or left
feature can be either Being
has an important advantage.
as inserting NOP instructions to the compiler
such in
to place usetul instructions
for a delay, the comnpiler can attempt
aware of the need as in the case of the delayed
NOP slots. This isdone by reordering instructions,
the
branch in Figure 7.10.

7.4.1 Side Effects


explicit and
encountered in the two preceding examples are
The data dependencies destination in instruction
the register involved is named as the
easily detected, because
the contents of a register

You might also like