Lecture #3 - Processing of Control Transfer
Instructions
Review: Data Dependency
We have discussed the different types of data dependencies and how they cause dif-
ferent hazards (e.g., RAW, WAR, WAW). If we look at an example some pseudo
instructions:
; i1 : load a
mov r9 , [ a ]
; i2 : load b
mov r10 , [ b ]
; i3 : add
lea rcx , [ r9 + r10 ]
; i4 : multiply
mov rdx , r9
imul rdx , r10
; i5 : divide
mov rax , r9
cqo
idiv r10
mov r11 , rax
Code Example 1: Implementation of the Data Dependencies Hazards
Due to the instruction flow (i.e., how registers have been used), we have different
dependencies that requires waiting on some data from previous instructions which
are causing the hazards.
It may be hard to see them so lets look at a graph version of this:
1
i1 RAW
RAW
i2 i3
WAW
RAW
WAR
WAW WAW
i4
i5
RAW
WAW
Figure 1: Data Dependency Graph of Data Dependencies Hazards
The Control Hazard Problem
What are Control Transfer Instructions (CTIs)?
• Instructions that change the Program Counter (PC) non-sequentially.
• Branches: Conditional change based on data/flags (e.g., jeq, jne).
• Jumps: Unconditional change (e.g., jmp).
• Function Calls: Jump + save return address (call).
• Returns: Jump to saved return address (ret).
The Pipeline Problem (Control Hazard):
• The pipeline fetches instructions sequentially (PC+4).
• By the time a CTI is identified and its outcome/target address is known (of-
ten late in the pipeline, e.g., ID or EX stage), several subsequent (potentially
incorrect) instructions may have already entered the pipeline.
• Pipeline must stall or flush these incorrect instructions, creating ”bubbles” and
reducing performance (IPC drops, CPI increases).
2
Figure 2: A load instruction followed by an immediate use results in a x1 stall
Early Solutions (and their limitations)
• Stall/Freeze Pipeline: Simplest approach. Stop fetching new instructions
once a branch is detected until its outcome and target are known.
– Problem: Creates significant performance loss (multiple cycles per branch).
• Predict Branch Not Taken: Always fetch the sequential instruction (PC+4).
If the branch is taken, flush the incorrectly fetched instruction(s).
– Problem: Many branches are taken (esp. loop branches). Still significant
flushing.
• Predict Branch Taken: Always assume the branch is taken.
– Problem: Requires knowing the target address early. Still stalls if predicted
incorrectly.
• (Optional: Delayed Branch): The instruction(s) immediately following the
branch are always executed, regardless of the branch outcome. Compiler tries
to fill slot(s) with useful work.
– Problem: Hard for compilers to fill slots effectively, complex for deeper
pipelines, breaks precise exception model, largely obsolete in modern high-
performance designs.
Dynamic Branch Prediction - Core Idea
Goal: Predict the outcome (Taken (T)/Not Take (NT)) and target address of a
branch dynamically at runtime, based on past behavior.
Why it Works: Program behavior, especially branches (e.g., loops, error checks), is
often repetitive and predictable.
Key components:
• Outcome Prediction: Predict (or guess) Taken/Not Taken.
– Target Address Prediction: Predict/Guess the destination PC if taken.
3
Integration: Prediction often happens early (IF or ID stage) to avoid fetch stalls.
Branch Outcome Prediction: Simple Predictors
Branch History Table (BHT) / Pattern History Table (PHT): A small mem-
ory indexed by (part of) the branch instruction’s PC. stores prediction state.
1-bit Predictor: Stores the outcome of the last execution. Flips prediction on a
single mispredict.
• Problem: Mispredicts twice on typical loop exits (last iteration taken → predict
taken → exit (NT) → mispredict; first iteration NT → predict NT → enter loop
(T) → mispredict).
2-bit Saturating Counter Predictor: Uses 4 states (e.g., Strongly Taken, Weakly
Taken, Weakly Not Taken, Strongly Not Taken). Requires two consecutive mispre-
dictions to change from strongly T/NT.
• Much better performance, especially for loops. Standard building block.
Figure 3: 2-bit prediction scheme
Transitions between these states occur based on whether the branch is taken or not.
The key advantage is hysteresis: a single misprediction doesn’t immediately flip the
prediction, which stabilizes the predictor in noisy conditions.
4
Advanced Outcome Prediction: Correlating Predictors
Idea: The outcome of a branch may depend on the outcome of other, recent branches.
Example:
if ( aa ==2) ... // B1
if ( bb ==2) ... // B2
if ( aa != bb ) { // B3
...
}
Outcome of B3 depends on B1 and B2.
• (m,n1) Predictor: Uses the behavior of the last m branches (global history)
to choose among 2m different n-bit predictors for the current branch.
• Global History Register (GHR): Shift register recording outcomes of last
m branches.
• Implementation (e.g., gshare): Combine (XOR) global history with branch
PC bits to index into a single large table of 2-bit counters. Reduces table size
compared to having separate tables per history pattern.
Figure 4: A gshare predictor with 1024 entries (each being a standard 2-bit predictor).
5
Advanced Outcome Prediction: Tournament Predictors
Idea: Different branches might be predicted better by different strategies (e.g., some
correlate well with global history, others with their own local history). Use multiple
predictors and dynamically select the best one for each branch. Structure:
Figure 5: Tournament Predictor
• Typically combines a local predictor (based only on the history of this branch)
and a global predictor (like gshare).
• A ”Choice Predictor” (meta-predictor), often another table of 2-bit counters,
tracks which underlying predictor (local or global) has been more accurate re-
cently fora given branch/history, and selects its prediction.
Performance: Generally offers higher accuracy than either local or global alone.
Advanced Outcome Prediction: Hybrid/Tagged Predictors
(TAGE)
Motivation: Longer history can be better, but requires huge tables and suffers from
cold starts/interference. Need to balance history length and table size/accuracy.
TAGE (Tagged Geometric History Length): State-of-the-art approach.
• Uses multiple predictor tables, indexed by different (geometrically increasing)
lengths of global history combined with PC.
• Tables are tagged to detect if an entry belongs to the current branch/history
(reduces interference).
• Uses partial tags for efficiency.
• Prediction comes from the longest matching history table entry. Has sophisti-
cated update/allocation mechanisms.
6
Performance: Outperforms ghsare and simple tournament predictors, especially
with limited storage budget.
Figure 6: Five-Component Tagged Hybrid Predictor
Predicting the Branch Target Address
Problem: Outcome prediction isn’t enough for taken branches/jumps. We also need
the target address early to redirect fetch. Decoding the instruction to calculate return
is too slow.
Solution → Branch Target Buffer (BTB):
• A small cache memory indexed by the address (PC) of the CTI.
• Stores the prediction target address for that CTI (if previously taken).
• Often stores the branch prediction state (e.g., 2-bit counter) as well.
Operation:
• During IF, PC indexes the BTB.
• BTB Hit: Branch is predicted (based on stored state); target address is avail-
able immediately. Fetch redirects if predicted taken.
• BTB Miss: Assume not a branch, or assume not taken. Fetch PC+4. If later
found to be a taken branch, flush and redirect (causes penalty).
7
Figure 7: Branch Target Buffer (BTB) mechanism
The BTB enables early fetching of the next instruction (speculatively) before the
branch is resolved, minimizing control hazards and pipeline stalls.
8
Figure 8: BTB Lookup and Handling Process
This flowchart details how a branch instruction is handled with a BTB:
1. Instruction Fetch (IF): The PC is used to index the BTB.
2. Match Found: If a matching branch is found, fetch begins from the predicted
target.
3. No Match: If no match is found, fetch proceeds sequentially.
4. Branch Execution: The actual branch is resolved in the Execute (EX) stage.
5. Update BTB: If the prediction was incorrect or if the branch was not in the
BTB, update or insert a new entry.
This sequence is vital in understanding speculative execution and control hazard
resolution.
Figure 9: Penalty Scenarios Based on BTB State
9
Handling Returns: Return Address Stack (RAS)
Problem: Function returns are indirect jumps (target address is in a register or on
the stack). The target varies depending on the call site BTBs don’t work well for
them because the same ret instruction goes to different targets.
Observation: Calls and returns are typically nested and matched.
Solution → Return Address Stack (RAS):
• A small hardware stack.
• On function call(jal, call), hardware pushes the return address (PC+4) onto the
RAS.
• On function return (ret, jr), hardware predicts the target by popping the address
from the top of the RAS.
Performance: Very effective, significantly improves return prediction accuracy.
Figure 10: Return Address Buffer Prediction Accuracy
Most unconditional branches come from function returns... Causes the buffer to
potentially forget about the return address from previous calls. So, create return
address buffer organized as a stack (i.e., now a standard feature in nearly all modern
superscalar CPUs).
Integrated Instruction Fetch Units (IFU)
Modern Processors often have a dedicated IFU responsible for providing a high band-
width stream of correct-path instructions.
10
Combines:
• PC generation logic.
• Branch prediction (outcome prediction tables like PHT/TAGE).
• Branch Target Buffer (BTB).
• Return Address Stack (RAS).
• Instruction Cache (I-Cache) access.
• Instruction Buffering/Queuing.
IFU is a monolithic hardware unit designed to optimize instruction supply by bundling
multiple responsibilities traditionally split across control and decode logic. It enables
high-throughput and low-latency instruction delivery in superscalar and speculative
execution pipelines.
It handles:
• Branch prediction
• Instruction prefetching
• Fetch-ahead logic
• Instruction memory access
• Instruction buffering
• Cache line boundary management
This integration is essential for wide-issue processors and architectures with specu-
lative execution, where a high instruction fetch rate is critical to keep the backend
full.
Branch Folding Optimization: If a BTB entry holds the predicted target instruc-
tion itself (not just address), the IFU can provide the target instruction directly,
potentially skipping I-Cache access latency for taken branches.
Speculative Execution
What are the consequences of Prediction?
• Once a branch is predicted (outcome + target), the processor doesn’t wait. It
fetches and executes instructions from the predicted path. This is speculation.
Why Speculate?
• Avoid stalling the pipeline, crucial for exploiting ILP across branches.
11
Challenge → What if prediction was wrong?
• Must not let speculative instructions change the architectural state permanently
(registers, memory) until the branch is confirmed.
• Must be able to efficiently discard the results of speculative work and recover
by starting fetch/execute down the correct path.
Mechanisms:
• Reorder Buffer (ROB) or similar structures buffer results of speculative instruc-
tions.
• Instructions commit (update architectural state) in program order only after
confirmed to be on the correct path.
• Mispredicted branches cause ROB’s entries for subsequent speculative instruc-
tions to be flushed.
Overall
• Control Transfer Instructions are fundamental but pose a major challenge to
pipelined performance due to control hazards.
• Stalling is too slow for modern processors.
• Dynamic Branch Prediction (predicting outcome and target address) is essen-
tial.
– Techniques evolved from simple 2-bit counters to complex correlating, tour-
nament, and hybrid predictors (e.g., TAGE).
– BTBs predict target addresses for direct branches/jumps.
– RAS predicts target addresses for function returns.
• Prediction enables Speculative Execution, allowing the pipeline to continue
working down the predicted path, further hiding branch latency.
• Managing speculation (buffering results, recovering from mispredicts) requires
complex hardware like ROBs.
• Effective handling of control flow is a cornerstone of high-performance processor
design.
12