Branch prediction
cund Speculative
exec ti on
Thre are va vus Was in wich we can decsease the
mumber f sal ycs in Case conttional bran chu any brh
Statement s
cways to avoid Sal
ycles
Nof all thetime teso mefh ods
cOk Succo ss fully but 80.
() Brach
(suass)
predii on
Specu ative execution
is
IVFlusin1, invelved
Cen drti e (Gntin
I
is false ) if.my guess is
Orong. Suppese
us we have priced
that banchcoon
be tuken; i.e, (odtis
BrancV
istaken us fase .
Ly200: So, we oould start
fetching T I, ad
|-e-) we
seg uente and oudlenly
wouwd be executing ih sea uence
of I (at 00)
execi
after the Complete our Qssunp in
kns that
take thon
branch cOOud b
have to go oth t lusing
again
Assumption s right; Ged
Ce Cant
perform ance
avo stalls
Asumptien is son
(28
There doe van ous coays by hich this predictin
Venous
us implemented
(in Some Pooetors
be taker
al ways
brnchh coil
D
in Sm
sych,
theyfrd
branc
afer executing the bnchPiSiTa
incur the pena ly
tokon ten they
|brarch is not
oth (o0)
again slast
(us hh tho pipeline
ofhes
(stalicaprdicdior
taken (in Some )
coill neyer be
branch procesors
(Conse cuti
ve in stuction)
200.
cn what
dep a
these strotegieg
ance o Contai)
he perfr
m
instructica our prgra
type of
byanch
yamic, pneketion) fenm)
G
Bt imnp kmentotin
CBTB)
simplest
Branch Taget Buffes
we c ah Ue
acsaey further
To improve predicin
the predictin
behavior toibfluehca
actual brarch
in dynamic branch predictio
Yesulting
Gimp le st fom )a dynomic predi cior algorithm Can
Use the
the resull of the mosl eent
executin of a branch hstruction.
The poeSor assumes that the net im
the instruchion is exeueted, the branch decisjon is likel to be
(29
tho Sae ao the last ie
Tao technique S
L Too- state
algont hm /Sngk- b poe
dictor
bit
Simple prrdicts that uses aa single
A
the last Out corme of a brarch
to represnt
Gaken or hot faken)
two-State
may be dosCribad by fhe
The algorithm
machhe
my
LT : Branch is Claly to be taken
Branch is not ikaly to be
taken
state LNT.
Suppese, the algo. is started in
s executed and the
when the bran ch instructisn State
branch is ta ken the machne moves to LT.
it Temais in state LNT. The
thero/se,
hext time the Same jhstrucin js en coustered,
the branch i5 prechcted as ta ken if the machine
i5in State LT. Othencoise, it is predited as
mot ta kon
realy
BT Cbranch is
taken)
LT BT
LNT
BNT
B=1))
ts veat
taken
w ohee B is the bit reSered or al branch insutind.
,
bitpredicto
algorithan| Double-
tour- state
of
Fast chupbac k) exe cution
epeated m
In two state alqoaithm, too misPred'i
YesuHs
he Same lbbp pas
in st
pas
bin Last
oy
Can be achieed
pediction
accu ray history.
BeHer abat executisn
nfomatien beloLO :
kaeping
more
as Showh
uses fur 6tateo
tio algonittm
litely to be taker
k.
Strorgly
ST to be taken
LiKely
tafcon
not to be
vLNT be taken
nof to
BNT -
lkaly
Gtrongly
BT
LNT
BNT
oBNT To)
BT
BNT
BT ST BT
(10)
BNT
executen of the Same loep resuts In any
epeated
che mis pirediction un the last Pas.
Secomd Doa w badk ot
Sing le -bit br onch praditos
o- ate algo
Consi deo c bianch
branch oudome sequonce
CT- toa ken. N- hot akor)
TNTNTN T N TV 1
Single- bt predictor predicts tha net outme
of a branch based onn st presea t
Sts Presen autcome
porm
we have an
Hene, the accUTa y is verit
outcome seg uene o a braneh
altehate
un
we change the Gtate ony
In 2- bit poedicto,
poe dictor
Preict
twbmre Corrcty |corongy.
uf
Brorch Target Bufer (BTB
I! is a small,fast memo
memo) ohese branch tnformatin
Cacho table
tnthe fsom of a lovkup
1s oTganized tnthe
in ohich each entry ih cdude:
An addren of tha branch tnstctian
too state bte) foo the brnch
prei etin algor ithm
the brah ch target addrs
(Pooussor) branch
Next ime Oheh e co me accross tho Game instucion,
process0o oill obain the branch precicicn State bitr
bakd on the a ddrr of hs
instucion beng on Cra)
Tnstrucdi
BranCh Target Pdr.
fetched. ddre State bitls)
Ug i4 tofid oheth
fron tso
for
fedh NT
Memoy
CP
T a
here 1s
ht io
BTB, i:eiy
the cunent
D etehod -instau ctish
us a branch
instruction
hde
akeady
been sen eas irhe Branch tuwget buffer(BTB)
kho ohat the
ls
Tfthe predichion
the ent in
updak
KoTorg, upclate
TLB
CHAPTER 7: Pipelining 317
Furthermore, we assume that the branch condition and target address are computed in
one cycle and that instructions are fetched four at a time from thc instruction cache.
InstructionI; compares two operands, and instruction I, is a conditional branch based
on the result of the comnparison. Assume that the fetch unit predicts that thebranch
will not be taken and leaves instructions I; and I, in the instruction queue. Thus,
instruction I; begins cxecution afier I,. followed by I4. The final decision on the
branch cannot be made until aftcr the condition code flags have been updated at
the end of the compare instruction.
(Later, we will sce that the decision can some
times bemade a litle carlier:) Hence,after step Wi. the instruction fetch unit real
izes that the prediction was incorrect and that the instructions in the execution pipe
must be purged. Purge cycles are marked "p" inthe figure. Four new instructions,
I,through l+}. are fetched, starting at the address computed in step Ez (only I, is
shown in the figure).
Note that write steps, in which results are stored in the destination registers, cannot
be allowed to proceed until the branch condition has been resolved. Had instruction I;
taken longer to complete, step W3 and any subsequent steps would have had to beW
delayed.We return to this issue in Section 7.6.
Branch prediction can be done in one of two ways. It can be done by the compiler,in
which case it is encoded in the branch instruction. The OP-code word of the instruction
indicates tothe instruction fetch unit whether this branch should be predicted as taken
or not taken. The prediction result is thesame every time a given branch instruction is
encountered;hence, this approach is called static branch prediction.
Another approach is to have the processor hardware assess the likelihood of a
branch being taken every time a branch instruction is encountered.This can be done
by keeping track of the result of the branch decision the last time that instruction was
executed and assuming that the decision in the current instanceis likely to be the sarme.
Clearly, the prediction result may be different for different instancesof execution of
the same branch instruction. This approach is called dynamic branch prediction. Both
static and dynamic prediction lead to significant improvement in performance.
74
DATA DEPENDENCY
Section 7.1.2 introduced the idea of data dependency, which arises when a source
operand for an instruction depends on the results ofexecution of a preceding instruction.
If the results of the execution of the preceding instruction havenot yet been recorded
in their respective registers, the pipeline is stalled.
Consider a processor that uses the four-stage pipeline in Figure 7.2b. The first
stage fetchesan instruction from the cache. The second stage decodes the instruction
and reads the source operands from the register file. The third stage performs an ALU
operation and stores the result in the destination location. Assume that some processor
instructions have three operands-two source operands and one destination operand.
A hardware organization that supports these features is shown in Figure 7.13. Part a of
the figure shows the connection between the ALU and the register file. The register file
allows three different locations be accessed simultaneouslyduring each clock period:
to
two locations provide the two sourceoperands, SRCIand SRC2, which are transferred
318 Computer Organization
Source 1
Source 2
SRCI SRC2
Register
file
ALU
RSLT
Destination
(a) Part of the data paths of a CPU
SRC1,SRC2 RSLT
O: Operate W: Write
(ALU) (Registerfile)
forwarding
data path
(b) Position of the source and result registers in the processor pipeline
FIGURE 7.13
Operand forwarding in a pipelined processor.
in parallel to the input registers of the ALU. At the same time, RSLT, hethe contents of
result register, are Stored in a third location in the register
Thus, at the end of each
file.
clock period, two new source operandsare loaded in SRCl
and SRC2, and a new result
at the output of the ALU is stored in the RSLT register. That result will be transferred
to the register file in the following clock cycle.
Figure7.13b shows the positions of the three registers, SRCI, SRC2, and RSLI, 11
the pipeline, assuming the overall structure is similar to that in Figure 7.2b. The three
registers are part of the interstage buffers used to
pass data from one stage to the nex
during pipelined operation.
CHAFTER 7: Pipelining 319
Consider the following two
instructions:
I: Add RI,R2,R3
Iz: R3 Shift_left
The timing of the execution of these two
instructions is depicted in Figure 7.14.
The Add instruction, which is fetched in the first
clock cycle, performs the opera
tion R3 +
(R1] ([R2]. The contents of the two source registers, R1 and
R2, are read
during the second clock cycle and are clocked into registers SRCI
and SRC2. Their
sum is generatedby the ALUandloaded into the ALUoutput register, RSLT,
during
clock cycle 3. From there, is transferred into register R3 in the
it
register file in cycle
4. What happens to the sourceoperand of instruction I? The
processor hardware must
detect that the sourceoperand for this instruction is the
same as the destination operand
of the previous instruction, and must ensure that the updated value is used.
it
Hence,
the decode stage, which is responsible for getting the source operand from R3,cannot
complete this task until cycle 5; so the pipeline is stalled for two cycles. Subsequent
instructions proceed as shown, forthe same reasons as Section 7.1.2explains for Figure
7.6.
To avoid the delay, the hardware can be organized to allow the result of one ALU
operation to be available for anotherALU operation immediatelyfol
in the cycle that
lows. This techniqueis called operandforwarding,and the blue connection lines shown
in Figure7.13 can be added for this purpose. After decoding instruction h, the control
circuitry determinesthat the sourceoperañd of l, is the same as the destination operand
of I,.Hence, the Operand Fetch operation, which would have taken place in the Decode
In the next clock cycle, the control hardware in the Operate stage
stage, is inhibited.
arranges for the sourceoperand to come directly from register RSLT, as indicated by
the word fwd" inFigure7.15a.Thus, execution of I, proceeds without interuption.
In the case ofa longer pipeline,operation may be stalled for a few cycles, even with
operand forwarding.An example of a six-stage pipeline is shown in Figure 7.15b. In
Clock cycle 1 2 3 4 5 6 7 8
Instruction
I, (Add) R1,R2 R3
, (Shift)
R3 Shift R3
W
F D
F D W
Time
FIGURE 7.14
of data dependency.
Interruption of pipelined operation because
320 ComputerOrganization
Instruction
1 RI,R2 4 R3
fwd,
D
shift
R3
D W
Timc
(a) Short pipeline
Instruction
F D o! W
F D fwd,0' 0² W
F D W
Time
(b) Long pipeline
FIGURE 7.15
Instruction execution using operand forwarding.
this case, the Operate phase of instruction executionconsists of three pipeline steps. O'.
0°,and O³ (seeblue shading). When needed, operand forwarding takes place after all
three steps have been completed. In this case, the O' phase of instruction l, is delayed
by two cycles.
Data dependency also occurs when operands are read from the memory. Consider.
forexample, the two instructions:
Ij: Load (RI),R2
Iz: Add R2,R3,R4
One possible sequence of eventsduring execution of these instructions is shown in Fig
ure 7.16.The processoruses one clock cycle to obtain the contentsof R and one cyele
Il
to read the sourceoperand of instruction I, from the data cache.Because the destinato
register,R2, appears as a source register in h,the memory data are forwarded to the
ALUdirectly as soon as they are received.For example, because of the path shown
blue in Figure 7.13a, when the memory data are available at the input of the regs
fle, they are also available at the inputof the ALU. Thus,operand forwarding makes
possible to continueexecution of l2 without stalling the pipeline. Of course.in the ci
of acache miss, the pipeline is stalled and execution of both I, and l, is delayed un
the requesteddata are read from the main memory or a secondary cache.
Pipelining 321
CHAPTER 7:
Instruction
I, (Load) F RI Read R2 4
FIGURE 7.16
I,(Add)
F RI,R2 fwd, + R3€ Operand forwarding a read in
a cache
operation (assuming
Time hit).
while the in
is discovered by the hardware
In Figure 7.14,the data dependency
in cycle 3. The dependency
involves the contents of R3.
struction is being decoded value is not
during the same cycle. Since this
which would normally have been read
delays reading it until cycle 5. Such
dependencies
yet available, the control hardware
dependency
In this case, the task of detecting the
can also be dealt with in software. example. the
resolve it is left to the compiler. For
and introducing the delay needed to
delay between instructions I
and I; in Figure 7.14
compiler must introduce a two-cycle
by inserting NOPinstructions, as follows
Add RI,R2.R3
I|:
NOP
NOP
Shift_left R3
would use the old contents of R3 in in
Without the NOP instructions, the processor
an incorrect result.
struction l2, producing for correct op
of having the compiler introduce the delay needed
The possibility
link between the compiler
and the hardware. A particular
illustrates the close
eration tasks
to the compiler. Leaving
implemented in hardware or left
feature can be either Being
has an important advantage.
as inserting NOP instructions to the compiler
such in
to place usetul instructions
for a delay, the comnpiler can attempt
aware of the need as in the case of the delayed
NOP slots. This isdone by reordering instructions,
the
branch in Figure 7.10.
7.4.1 Side Effects
explicit and
encountered in the two preceding examples are
The data dependencies destination in instruction
the register involved is named as the
easily detected, because
the contents of a register