Chapter IV: The Processor
Chapter IV: The Processor
The Processor
By Nguyen Duong Quynh Nhi from USTH Learning Support
Jun 2025
Contents
2 Building a Datapath 4
2.1 Instruction Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 R-Format Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Composing the Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 An Overview of Pipelining 8
4.1 Pipeline Timing and Performance . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Pipeline Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Why RISC-V Fits Pipelining Well . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4 Types of Pipeline Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.5 Structural Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.6 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.7 Control Hazards (Branches) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
6 Data Hazards: Forwarding versus Stalling 12
6.1 Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.2 Stalling (Hazard Detection) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 Control/Branch Hazards 14
7.1 What is a Branch Hazard? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.2 How to Reduce Delays from Branches . . . . . . . . . . . . . . . . . . . . . . . 14
7.3 What is Branch Prediction? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.4 1-Bit Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.5 2-Bit Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
1 Logic Design Conventions
• Binary Encoding:
• The output depends only on the current input (the output is a function of the input).
• Its output depends on current inputs and previously stored values (it has internal
storage).
– Inputs:
∗ A data value to be stored
∗ A clock signal that determines when the data is stored (typically on rising
edge)
– Output:
∗ The data that was stored in a previous clock cycle
∗ Output remains stable until a new value is written on a clock edge
3
1.3 Clocking Methodology
• Combinational logic connects two state elements (happens within one clock cycle):
2 Building a Datapath
– Processes data
– Handles addresses
– Instruction Memory:
∗ A memory unit that holds all the program instructions.
∗ It receives an address (from the PC) and returns the instruction stored at
that location.
– Program Counter (PC):
∗ A special register that keeps track of the address of the current instruction.
∗ It is initially set to the address of the first instruction in the program.
4
∗ After each instruction is fetched, the PC is updated to point to the next
instruction.
– Adder (Incrementer):
∗ A simple combinational circuit that adds 4 to the current PC value.
∗ This new address (PC + 4) becomes the location of the next instruction,
assuming sequential execution.
• R-format instructions perform operations using only registers. They use three registers:
– If it is a load instruction:
∗ Use the computed address to read a value from memory.
∗ Write the loaded value to the destination register.
– If it is a store instruction:
∗ Read the value from the source register (the one to store).
∗ Write this value to the memory at the computed address.
– If it is a branch instruction:
∗ Use the ALU to compare two registers (via subtraction and checking Zero).
∗ If the condition is met:
· Sign-extend and shift the offset left by 1 (to form a byte address).
· Add this to the current PC to compute the branch target.
· Update PC to the new target address.
∗ Otherwise, simply increment PC by 4.
5
– Instruction memory (for fetching).
– Data memory (for load/store).
• By using the datapath and adding a simple control function, we can built a simple
implementation.
– Load/Store:
∗ Use ALU to compute memory address = addition.
– Branch:
∗ The ALU perform a subtraction.
– R-type instructions:
∗ ALU perform depend on the value of the function field.
∗ It can perform one of the five actions (AND, OR, subtract, add, set on less
than)
– NOR is needed for other parts of MIPS instruction set.
• A 2-bit control signal, ALUOp, is generated based on the instruction type (opcode).
6
• This signal helps determine the correct operation the ALU should perform (e.g: add,
subtract) using simple combinational logic.
• To better understand this, refer to the following table. It shows how the 4-bit ALU
control signals are derived from the ALUOp control bits and, for R-type instructions, the
funct field:
• This is a combinational logic circuit - generates outputs directly from the opcode,
without any clocking.
• A multiplexer is added to select the correct destination register (either rt or rd) based
on instruction type.
• Control signals can be defined using a truth table that maps each opcode to a set of
signal values (1 = assert, 0 = deassert, X = don’t care).
⇒ This modular control logic simplifies design and improves speed and clarity.
1. Instruction Fetch:
– Use PC to fetch the instruction from memory.
– Increment PC by 4 to point to the next instruction.
2. Decode and Read Registers:
7
– Decode instruction type (R, I, S, B, etc.).
– Read operands from source registers.
– Sign-extend immediate if needed.
3. Execute / Compute:
– ALU performs operation:
∗ Arithmetic or logic (e.g., add, sub).
∗ Compute address for memory access.
∗ Compare values for branches.
4. Memory Access / Write Back:
– Load: Read from memory → write to register.
– Store: Write register value → memory.
– R/I-type: Write ALU result to register.
– Branch: Update PC if condition is true.
4 An Overview of Pipelining
– In a non-pipelined approach, you complete all 4 steps for one load before starting
the next:
∗ Wash → Dry → Fold → Put Away
– In pipelining, each stage works in parallel on different loads:
∗ While one load is drying, the next can start washing.
∗ While one load is being folded, another can be drying, and another washing.
– ⇒ This overlap improves the performance.
8
• Example for the laundry annalogy for pipelining:
• The MIPS/RISC-V pipeline breaks instruction execution into five stages (one step per
stage):
• In a single-cycle design, the clock cycle must be long enough to fit the slowest instruc-
tion.
• In pipelining, each stage takes one cycle, and a new instruction starts every cycle.
• So even though a single instruction still takes multiple stages, we finish one instruction
per clock cycle after the pipeline fills.
• If stages are not balanced, speedup will be less than the number of stages.
9
• Speedup comes from higher instruction throughput — more instructions completed
per unit time.
• Latency (time for a single instruction to complete all stages) stays about the same or
even slightly increases due to overhead.
• All instructions are 32 bits ⇒ easy to fetch and decode in one cycle.
• Load/store model allows address computation and memory access to be separated into
clean stages.
• Problem: Instruction fetch and memory access both use the same memory.
• Forwarding: Pass results directly between pipeline stages without waiting for register
write.
• Works for most cases, except load-use situations (when a value is loaded and used in
the next instruction).
10
4.7 Control Hazards (Branches)
• Branches delay execution because the next instruction depends on the outcome.
• Branch Prediction:
– Static: Predict based on fixed rules (e.g., backward branches are taken).
– Dynamic: Use hardware to track actual outcomes and make smarter guesses.
• Special registers (e.g., IF/ID, ID/EX, EX/MEM, MEM/WB) hold data and control signals
between stages.
• These registers separate each stage and help synchronize data flow through the pipeline.
• This increases instruction throughput, though the time to complete a single instruction
(latency) stays the same.
11
5.5 Single-Cycle Pipeline Diagram
• Control signals are generated during the ID stage based on instruction type.
– Data hazards happen when an instruction depends on a result not yet available.
6.1 Forwarding
• Paths: Connects the outputs of the EX/MEM and MEM/WB pipeline stages back to
the ALU inputs
⇒ Avoids reading outdated values from the register file.
• Conditions:
12
– Instruction in EX/MEM or MEM/WB will write to a register.
– Register destination matches the source register of the current instruction.
– When two back-to-back instructions both depend on the same earlier instruction.
– Forwarding logic must check both EX/MEM and MEM/WB stages.
– Forward if the needed data is coming from either the EX/MEM or MEM/WB
stage.
– Priority given to the most recent value (EX/MEM).
lw x1, 0(x2)
add x3, x1, x4 ← needs x1 before it’s loaded
– Detected when:
∗ EX stage is doing a load
∗ ID stage wants to use the loaded register
– In this case, the pipeline must stall.
13
7 Control/Branch Hazards
• A branch hazard happens when the CPU doesn’t know what instruction to do next
because it’s waiting to find out if a branch should happen.
• For example: beq x1, x2, target — the CPU must first check if x1 == x2.
• While waiting for this result, the pipeline might need to pause. This pause is called a
stall.
• The CPU tries to solve the branch earlier, during the ID stage, so it doesn’t waste
time.
• It also calculates the branch target address early using this formula:
• If a branch depends on data from a previous instruction, the CPU tries to forward
that data. If it’s not ready, it must stall.
• Branch prediction means the CPU guesses whether the branch will be taken or not,
to keep the pipeline running without delay.
• If the guess is wrong, the CPU throws away the wrong instructions and corrects itself.
• Dynamic Prediction
– The CPU watches past branch behavior and uses it to make better guesses in the
future.
– If it was right before, it will guess the same next time.
14
7.5 2-Bit Predictor
• More accurate: it needs two wrong guesses in a row to change the prediction.
• These are events that stop the normal flow of a program to handle something important.
• The handler takes care of the problem, then returns to the original program.
• Instead of handling each type of error with separate logic, the CPU uses:
• This simplifies the datapath and makes exception handling more efficient.
15
8.3 Handler Actions
• If an exception happens:
• Precise Exception: CPU state looks like the exception occurred exactly at one in-
struction. Easy to debug.
• Imprecise Exception: Some later instructions may have executed. Harder to handle.
• Happens when the pipeline can’t cleanly stop at the faulting instruction.
16
9 Parallelism via Instructions
• In static multiple issue, the compiler decides which instructions can run in parallel.
• It groups instructions into bundles and schedules them ahead of time.
• This method relies on:
– The compiler to reorder instructions.
– Careful checking for data dependencies.
• Hazards are problems that occur when instructions interact in unsafe ways:
– Data hazard: An instruction needs data that isn’t ready yet.
– Structural hazard: Two instructions need the same hardware.
– Control hazard: Uncertainty in what instruction to fetch next (due to branches).
17
9.5 Dynamic Multiple Issue
• In dynamic issue, the hardware (not the compiler) decides how many and which in-
structions to issue each cycle.
• The hardware schedules instructions out-of-order to avoid stalls and make better use
of the CPU.
• Speculation means guessing the result of an instruction (like a branch) before it’s ac-
tually known.
• Types of speculation:
18
9.8 Power Efficiency
• Executing more instructions in parallel usually requires more hardware and power.
• Designers must balance speed gains with energy use to build efficient systems.
• The Intel Core i7 is a powerful processor that uses many smart techniques to run
programs faster:
– Multiple Cores: It can run several programs or parts of a program at the same
time.
– Out-of-Order Execution: It doesn’t always wait for one instruction to finish
before starting the next. It runs instructions when the needed parts are ready.
– Branch Prediction: It guesses which path the program will take, so it doesn’t
waste time waiting.
– Register Renaming: It avoids confusion when two instructions use the same
register name.
– Hyper-Threading: Each core can handle two tasks at once, acting like it’s doing
double the work.
• These features work together to make the CPU run faster and more efficiently by doing
more work at the same time.
– Loop Unrolling: Instead of repeating small steps, we combine them into bigger
steps to save time.
– Reordering Instructions: We change the order of steps to avoid waiting and
use the CPU better.
– Register Blocking: We keep data in the fastest memory (registers), so we don’t
have to go back to slower memory as often.
– Software Pipelining: We start the next step before the current one finishes, so
everything keeps moving.
• These help use the CPU more fully → finish calculations much faster.
19
12 Fallacies and Pitfalls
12.1 Fallacies
• “Pipelining is easy”
– The idea is simple, but real designs are complex and tricky to get right.
– Pipeline design depends on available technology. What worked before (like delayed
branches) may not be best now. Modern designs change with faster chips and
power limits.
12.2 Pitfalls
– Example: MIPS and RISC-V are easier to pipeline than older, complex ISAs like
VAX.
20