0% found this document useful (0 votes)

31 views20 pages

Chapter IV: The Processor

Chapter IV of 'Computer Architecture' focuses on the processor, detailing logic design conventions, datapath construction, and pipelining techniques. It covers various elements such as combinational and state elements, ALU control, and performance issues, along with pipeline hazards and control mechanisms. The chapter also discusses instruction-level parallelism and concludes with practical examples like the Intel Core i7 and matrix multiplication.

Uploaded by

anhtnh.23ba14015

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views20 pages

Chapter IV: The Processor

Uploaded by

anhtnh.23ba14015

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Computer Architecture - Chapter IV

The Processor
By Nguyen Duong Quynh Nhi from USTH Learning Support
Jun 2025

Contents

1 Logic Design Conventions 3

1.1 Combinational Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 State Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Clocking Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Building a Datapath 4
2.1 Instruction Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 R-Format Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Composing the Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 A Simple Implementation Scheme 6

3.1 The ALU Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 The Main Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Operation of the Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 An Overview of Pipelining 8
4.1 Pipeline Timing and Performance . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Pipeline Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.3 Why RISC-V Fits Pipelining Well . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4 Types of Pipeline Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.5 Structural Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.6 Data Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.7 Control Hazards (Branches) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Pipelined Datapath and Control 11

5.1 RISC-V Pipelined Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Pipelined Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.3 Pipelined Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.4 Multi-Cycle Pipeline Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.5 Single-Cycle Pipeline Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.6 Pipeline Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1
6 Data Hazards: Forwarding versus Stalling 12
6.1 Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.2 Stalling (Hazard Detection) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

7 Control/Branch Hazards 14
7.1 What is a Branch Hazard? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.2 How to Reduce Delays from Branches . . . . . . . . . . . . . . . . . . . . . . . 14
7.3 What is Branch Prediction? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.4 1-Bit Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.5 2-Bit Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

8 Exceptions and Interrupts 15

8.1 Handling Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
8.2 An Alternate Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
8.3 Handler Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.4 Exceptions in a Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.5 Exception Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.6 Multiple Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.7 Imprecise Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

9 Parallelism via Instructions 17

9.1 Instruction-Level Parallelism (ILP) . . . . . . . . . . . . . . . . . . . . . . . . 17
9.2 Static Multiple Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
9.2.1 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
9.3 RISC-V with Static Dual Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
9.4 Hazards in Dual-Issue Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . 17
9.5 Dynamic Multiple Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.5.1 Dynamic Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.5.2 Register Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.6 Speculation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.7 Does Multiple Issue Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.8 Power Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

10 Putting It All Together: Intel Core i7 19

11 Going Faster: Matrix Multiply and ILP 19

12 Fallacies and Pitfalls 20

12.1 Fallacies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
12.2 Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2
1 Logic Design Conventions
• Binary Encoding:

– Low voltage = 0, High voltage = 1

– One wire per bit; multi-bit data use buses (multiple wires)

• 2 different types of logic elements:

– Combinational element: operate on data values

– State (sequential) element: contain state
→ it has internal storage to store information

1.1 Combinational Elements

• Combinational circuits do not store any data.

• The output depends only on the current input (the output is a function of the input).

• Used to implement logic functions: AND, OR, NOT,....

1.2 State Elements

• It stores information across clock cycles.

• Its output depends on current inputs and previously stored values (it has internal
storage).

• It has at least two inputs and one output:

– Inputs:
∗ A data value to be stored
∗ A clock signal that determines when the data is stored (typically on rising
edge)
– Output:
∗ The data that was stored in a previous clock cycle
∗ Output remains stable until a new value is written on a clock edge

• In addition to flip-flops, MIPS implementation uses 2 other types of state elements:

– Memories: Arrays of stored values.

– Registers: Store multiple bits.
⇒ Use a clock to determine when to store new values
⇒ In most case, these can be read at any time.

3
1.3 Clocking Methodology

• This allows us to know when data can be:

– Read from a state element.

– Written into a state element.

• By using Edge-Triggered Clocking, state elements update only on a clock edge →

hardware behavior predictable and safe.

• Combinational logic connects two state elements (happens within one clock cycle):

– Firstly, it takes input from one state element (register/memory).

– Then do a computation.
– Finally, sends the result to another state element.
! NOTICE: The clock speed = the longest delay in the circuit.

2 Building a Datapath

• Datapath is the part of the CPU that:

– Processes data
– Handles addresses

• There are some key components in the Datapath such as:

– Registers: store temporary data

– ALU (Arithmetic Logic Unit): performs operations (add, subtract,...)
– Multiplexers (mux): choose between multiple data inputs
– Memories: instruction and data memory

2.1 Instruction Fetch

To begin executing any instruction, the CPU must first fetch it from memory. This
step uses three essential components:

– Instruction Memory:
∗ A memory unit that holds all the program instructions.
∗ It receives an address (from the PC) and returns the instruction stored at
that location.
– Program Counter (PC):
∗ A special register that keeps track of the address of the current instruction.
∗ It is initially set to the address of the first instruction in the program.

4
∗ After each instruction is fetched, the PC is updated to point to the next
instruction.
– Adder (Incrementer):
∗ A simple combinational circuit that adds 4 to the current PC value.
∗ This new address (PC + 4) becomes the location of the next instruction,
assuming sequential execution.

2.2 R-Format Instructions

• R-format instructions perform operations using only registers. They use three registers:

– Read two source registers

– Perform an ALU operation
– Write result to a destination register

• Some common R-format instructions: add, sub, and, or, slt

• Step 3 (writing a result to a register) depends on the instruction type:

– If it is a load instruction:
∗ Use the computed address to read a value from memory.
∗ Write the loaded value to the destination register.
– If it is a store instruction:
∗ Read the value from the source register (the one to store).
∗ Write this value to the memory at the computed address.
– If it is a branch instruction:
∗ Use the ALU to compare two registers (via subtraction and checking Zero).
∗ If the condition is met:
· Sign-extend and shift the offset left by 1 (to form a byte address).
· Add this to the current PC to compute the branch target.
· Update PC to the new target address.
∗ Otherwise, simply increment PC by 4.

⇒ No register is written in a store or branch instruction.

2.3 Composing the Elements

• The datapath processes one instruction per clock cycle.

• Each datapath component performs only one operation per cycle.

• Therefore, we need separate:

5
– Instruction memory (for fetching).
– Data memory (for load/store).

• Multiplexers (MUXes) are used to select between sources:

– For register destination, ALU source, memory data, or PC update.

• Control signals are used to:

– Decide register write, memory access, ALU operation, and PC update.

3 A Simple Implementation Scheme

3.1 The ALU Control

• By using the datapath and adding a simple control function, we can built a simple
implementation.

• The MIPS ALU defines these combinations of four control inputs:

• Depending on the instruction class, ALU performs different operations:

– Load/Store:
∗ Use ALU to compute memory address = addition.
– Branch:
∗ The ALU perform a subtraction.
– R-type instructions:
∗ ALU perform depend on the value of the function field.
∗ It can perform one of the five actions (AND, OR, subtract, add, set on less
than)
– NOR is needed for other parts of MIPS instruction set.

• A 2-bit control signal, ALUOp, is generated based on the instruction type (opcode).

6
• This signal helps determine the correct operation the ALU should perform (e.g: add,
subtract) using simple combinational logic.

• To better understand this, refer to the following table. It shows how the 4-bit ALU
control signals are derived from the ALUOp control bits and, for R-type instructions, the
funct field:

3.2 The Main Control Unit

• This is a combinational logic circuit - generates outputs directly from the opcode,
without any clocking.

• To control the datapath, it uses the opcode (bits 31–26) to set:

– RegDst, ALUSrc, MemtoReg, RegWrite, MemRead, MemWrite, and PCSrc

• The PCSrc signal (to change the PC) depends on both:

– The Branch signal from the control unit, AND

– The Zero signal from the ALU

• A multiplexer is added to select the correct destination register (either rt or rd) based
on instruction type.

• Control signals can be defined using a truth table that maps each opcode to a set of
signal values (1 = assert, 0 = deassert, X = don’t care).
⇒ This modular control logic simplifies design and improves speed and clarity.

3.3 Operation of the Datapath

• Execution of any instruction generally follows these four steps:

1. Instruction Fetch:
– Use PC to fetch the instruction from memory.
– Increment PC by 4 to point to the next instruction.
2. Decode and Read Registers:

7
– Decode instruction type (R, I, S, B, etc.).
– Read operands from source registers.
– Sign-extend immediate if needed.
3. Execute / Compute:
– ALU performs operation:
∗ Arithmetic or logic (e.g., add, sub).
∗ Compute address for memory access.
∗ Compare values for branches.
4. Memory Access / Write Back:
– Load: Read from memory → write to register.
– Store: Write register value → memory.
– R/I-type: Write ALU result to register.
– Branch: Update PC if condition is true.

3.4 Performance Issues

• However, there are some performance issues such as:

– The clock period is limited by the longest delay in the datapath.

– The slowest (critical) path is usually a load instruction:
∗ Instruction Memory → Register File → ALU → Data Memory → Register
File
– All instructions must use this slow clock — even fast ones.
– This violates the design principle 1.

⇒ We need to apply pipelining to divide execution into faster, smaller steps.

4 An Overview of Pipelining

• Pipelining is like doing laundry in stages:

– In a non-pipelined approach, you complete all 4 steps for one load before starting
the next:
∗ Wash → Dry → Fold → Put Away
– In pipelining, each stage works in parallel on different loads:
∗ While one load is drying, the next can start washing.
∗ While one load is being folded, another can be drying, and another washing.
– ⇒ This overlap improves the performance.

8
• Example for the laundry annalogy for pipelining:

• The MIPS/RISC-V pipeline breaks instruction execution into five stages (one step per
stage):

1. IF: Instruction fetch from memory

2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access memory operand
5. WB: Write result back to register

4.1 Pipeline Timing and Performance

• In a single-cycle design, the clock cycle must be long enough to fit the slowest instruc-
tion.

• In pipelining, each stage takes one cycle, and a new instruction starts every cycle.

• So even though a single instruction still takes multiple stages, we finish one instruction
per clock cycle after the pipeline fills.

4.2 Pipeline Speedup

• If all stages take the same time (perfectly balanced):

Time per instructionnonpipelined
Time per instructionpipelined =
Number of stages

• If stages are not balanced, speedup will be less than the number of stages.

9
• Speedup comes from higher instruction throughput — more instructions completed
per unit time.

• Latency (time for a single instruction to complete all stages) stays about the same or
even slightly increases due to overhead.

4.3 Why RISC-V Fits Pipelining Well

• All instructions are 32 bits ⇒ easy to fetch and decode in one cycle.

• Simple and consistent instruction formats ⇒ easier to implement.

• Load/store model allows address computation and memory access to be separated into
clean stages.

4.4 Types of Pipeline Hazards

• Pipeline hazards are issues that delay the next instruction:

– Structural Hazard: A hardware unit (like memory) is needed by two stages at

once.
– Data Hazard: An instruction depends on data not ready yet from a previous
instruction.
– Control Hazard: The next instruction depends on the result of a branch.

4.5 Structural Hazards

• Problem: Instruction fetch and memory access both use the same memory.

• Solution: Use separate instruction and data memories (or caches).

4.6 Data Hazards

• Forwarding: Pass results directly between pipeline stages without waiting for register
write.

• Works for most cases, except load-use situations (when a value is loaded and used in
the next instruction).

• Load-use hazard needs a stall.

• Instruction Scheduling: Rearrange instructions to avoid immediate use of load re-

sult.

10
4.7 Control Hazards (Branches)

• Branches delay execution because the next instruction depends on the outcome.

• Stalling: Wait until the branch is resolved before fetching.

• Branch Prediction:

– Static: Predict based on fixed rules (e.g., backward branches are taken).
– Dynamic: Use hardware to track actual outcomes and make smarter guesses.

5 Pipelined Datapath and Control

5.1 RISC-V Pipelined Datapath

• The RISC-V pipelined datapath divides instruction execution into 5 stages.

• Each stage is executed in parallel for different instructions, improving throughput.

5.2 Pipelined Registers

• Special registers (e.g., IF/ID, ID/EX, EX/MEM, MEM/WB) hold data and control signals
between stages.

• These registers separate each stage and help synchronize data flow through the pipeline.

5.3 Pipelined Operation

• After the pipeline fills, one instruction completes per cycle.

• This increases instruction throughput, though the time to complete a single instruction
(latency) stays the same.

5.4 Multi-Cycle Pipeline Diagram

• Shows how different instructions are processed across multiple cycles.

• Each row represents an instruction, and each column a cycle.

• Useful to visualize instruction overlap and pipeline flow.

11
5.5 Single-Cycle Pipeline Diagram

• Illustrates how one instruction flows through all 5 pipeline stages.

• Helps understand data/control movement within one cycle.

5.6 Pipeline Control

• Control signals are generated during the ID stage based on instruction type.

• These signals pass through pipeline registers to later stages.

• Additional control logic is needed to handle:

– Data Hazards: Use forwarding or stalling.

– Control Hazards: Use branch prediction or delay slot.

6 Data Hazards: Forwarding versus Stalling

• Data Hazards in ALU Instructions

– Data hazards happen when an instruction depends on a result not yet available.

add x1, x2, x3 # x1 = x2 + x3

sub x4, x1, x5 # needs x1 before it’s ready

• Dependencies and Forwarding

– A dependency = one instruction uses a result from another.

– Forwarding (bypassing) solves this by sending the result directly from one stage
to another without waiting for it to be written to a register.

• Detecting the need to Forward

– A destination register of a previous instruction matches a source register of a

current one.
– The value is needed before it is written back

6.1 Forwarding

• Paths: Connects the outputs of the EX/MEM and MEM/WB pipeline stages back to
the ALU inputs
⇒ Avoids reading outdated values from the register file.

• Conditions:

12
– Instruction in EX/MEM or MEM/WB will write to a register.
– Register destination matches the source register of the current instruction.

• Double Data Hazard:

– When two back-to-back instructions both depend on the same earlier instruction.
– Forwarding logic must check both EX/MEM and MEM/WB stages.

• Revised Forwarding Logic:

– Forward if the needed data is coming from either the EX/MEM or MEM/WB
stage.
– Priority given to the most recent value (EX/MEM).

6.2 Stalling (Hazard Detection)

• Some hazards can’t be fixed by forwarding — like the load-use hazard:

lw x1, 0(x2)
add x3, x1, x4 ← needs x1 before it’s loaded

• Load-Use Hazard Detection:

– Detected when:
∗ EX stage is doing a load
∗ ID stage wants to use the loaded register
– In this case, the pipeline must stall.

• How to Stall the Pipeline:

– Insert a “bubble” (NOP) in the EX stage.

– Hold PC and IF/ID register (don’t advance).

• Datapath with Hazard Detection:

– Includes logic to:

∗ Detect load-use hazard
∗ Freeze IF and ID stages
∗ Insert bubble in EX stage

• Stalls and Performance:

– Stalls reduce performance by delaying instructions.

– Good compilers try to avoid stalls via instruction reordering.

13
7 Control/Branch Hazards

7.1 What is a Branch Hazard?

• A branch hazard happens when the CPU doesn’t know what instruction to do next
because it’s waiting to find out if a branch should happen.

• For example: beq x1, x2, target — the CPU must first check if x1 == x2.

• While waiting for this result, the pipeline might need to pause. This pause is called a
stall.

7.2 How to Reduce Delays from Branches

• The CPU tries to solve the branch earlier, during the ID stage, so it doesn’t waste
time.

• It also calculates the branch target address early using this formula:

Target Address = PC + 4 + (offset << 1)

• If a branch depends on data from a previous instruction, the CPU tries to forward
that data. If it’s not ready, it must stall.

7.3 What is Branch Prediction?

• Waiting for a branch result slows down the pipeline.

• Branch prediction means the CPU guesses whether the branch will be taken or not,
to keep the pipeline running without delay.

• If the guess is wrong, the CPU throws away the wrong instructions and corrects itself.

• Dynamic Prediction

– The CPU watches past branch behavior and uses it to make better guesses in the
future.
– If it was right before, it will guess the same next time.

7.4 1-Bit Predictor

• Remembers whether the branch was taken last time.

• If it’s wrong once, it flips the prediction.

14
7.5 2-Bit Predictor

• More accurate: it needs two wrong guesses in a row to change the prediction.

• This avoids changing the guess too quickly.

⇒ What Happens If the Prediction is Wrong?

– The CPU stops and throws away wrong instructions.

– It then fetches the correct instruction.
– It also updates its prediction to be more accurate next time.

8 Exceptions and Interrupts

• These are events that stop the normal flow of a program to handle something important.

• Interrupts: Triggered by external devices.

• Exceptions: Triggered by internal events.

8.1 Handling Exceptions

• The CPU detects a problem (like a fault).

• It saves the current PC (Program Counter) in a special register.

• Then, it jumps to a special function called the exception handler.

• The handler takes care of the problem, then returns to the original program.

8.2 An Alternate Mechanism

• Instead of handling each type of error with separate logic, the CPU uses:

– A shared control mechanism.

– Centralized handler routines.

• This simplifies the datapath and makes exception handling more efficient.

15
8.3 Handler Actions

• Identify what caused the exception.

• Perform the correct response:

– Skip the instruction,

– Re-execute,
– Or stop the program.

• Return to the original program using an instruction like mret.

8.4 Exceptions in a Pipeline

• In pipelined CPUs, many instructions are processed at once.

• If an exception happens:

– Earlier stages (before the exception) must complete.

– Later instructions (after the exception) must be canceled.
– The CPU saves the faulting instruction’s PC.

8.5 Exception Properties

• Precise Exception: CPU state looks like the exception occurred exactly at one in-
struction. Easy to debug.

• Imprecise Exception: Some later instructions may have executed. Harder to handle.

• Most systems aim for precise exceptions to keep behavior predictable.

8.6 Multiple Exceptions

• If multiple instructions cause exceptions:

– The one that happens first in program order is handled first.

– Others are either ignored or handled later.

8.7 Imprecise Exceptions

• Happens when the pipeline can’t cleanly stop at the faulting instruction.

• Can result in partial updates or mixed instruction effects.

• Some architectures allow them, but most try to avoid them.

16
9 Parallelism via Instructions

9.1 Instruction-Level Parallelism (ILP)

• Instruction-Level Parallelism (ILP) means that multiple instructions in a program can

be executed at the same time, rather than one after another.
• This can improve performance by taking advantage of the fact that not all instructions
depend on each other.

9.2 Static Multiple Issue

• In static multiple issue, the compiler decides which instructions can run in parallel.
• It groups instructions into bundles and schedules them ahead of time.
• This method relies on:
– The compiler to reorder instructions.
– Careful checking for data dependencies.

9.2.1 Loop Unrolling

• Loop unrolling is a compiler technique to expose more parallelism by expanding the

loop body.
• It reduces loop overhead and helps fill the pipeline with more independent instructions.

9.3 RISC-V with Static Dual Issue

• A simple version of static multiple issue is the dual-issue pipeline in RISC-V.

• It issues up to two instructions at once:
– One integer or branch instruction.
– One memory instruction.
• There are specific rules to avoid conflicts (hazards) when issuing two instructions to-
gether.

9.4 Hazards in Dual-Issue Pipelines

• Hazards are problems that occur when instructions interact in unsafe ways:
– Data hazard: An instruction needs data that isn’t ready yet.
– Structural hazard: Two instructions need the same hardware.
– Control hazard: Uncertainty in what instruction to fetch next (due to branches).

17
9.5 Dynamic Multiple Issue

• In dynamic issue, the hardware (not the compiler) decides how many and which in-
structions to issue each cycle.

• It is more flexible and can react to what’s happening during execution.

9.5.1 Dynamic Scheduling

• The hardware schedules instructions out-of-order to avoid stalls and make better use
of the CPU.

• This increases ILP automatically during runtime.

9.5.2 Register Renaming

• To avoid false data hazards, dynamic processors can rename registers.

• This allows more instructions to run in parallel by removing name-based conflicts.

9.6 Speculation Techniques

• Speculation means guessing the result of an instruction (like a branch) before it’s ac-
tually known.

• Types of speculation:

– Compiler-based: The compiler decides the most likely path.

– Hardware-based: The CPU guesses and fixes mistakes later if it guessed wrong.

• Exception handling: Speculative instructions must be handled carefully to avoid side

effects if they turn out to be incorrect.

9.7 Does Multiple Issue Work?

• Multiple issue can significantly improve performance by executing more instructions at

once.

• However, actual speedup depends on:

– The amount of ILP in the program.

– How well hazards are handled.
– The overhead of speculation and reordering.

18
9.8 Power Efficiency

• Executing more instructions in parallel usually requires more hardware and power.

• Designers must balance speed gains with energy use to build efficient systems.

10 Putting It All Together: Intel Core i7

• The Intel Core i7 is a powerful processor that uses many smart techniques to run
programs faster:

– Multiple Cores: It can run several programs or parts of a program at the same
time.
– Out-of-Order Execution: It doesn’t always wait for one instruction to finish
before starting the next. It runs instructions when the needed parts are ready.
– Branch Prediction: It guesses which path the program will take, so it doesn’t
waste time waiting.
– Register Renaming: It avoids confusion when two instructions use the same
register name.
– Hyper-Threading: Each core can handle two tasks at once, acting like it’s doing
double the work.

• These features work together to make the CPU run faster and more efficiently by doing
more work at the same time.

11 Going Faster: Matrix Multiply and ILP

• Matrix multiplication is used often in programs, like in graphics or machine learning.

To make it run faster, we can use:

– Loop Unrolling: Instead of repeating small steps, we combine them into bigger
steps to save time.
– Reordering Instructions: We change the order of steps to avoid waiting and
use the CPU better.
– Register Blocking: We keep data in the fastest memory (registers), so we don’t
have to go back to slower memory as often.
– Software Pipelining: We start the next step before the current one finishes, so
everything keeps moving.

• These help use the CPU more fully → finish calculations much faster.

19
12 Fallacies and Pitfalls

12.1 Fallacies

• “Pipelining is easy”

– The idea is simple, but real designs are complex and tricky to get right.

• “Pipelining works the same on all hardware”

– Pipeline design depends on available technology. What worked before (like delayed
branches) may not be best now. Modern designs change with faster chips and
power limits.

12.2 Pitfalls

• Complex instruction sets make pipelining harder:

– Varying instruction lengths slow things down.

– Complicated memory access modes increase control difficulty.
– Some CPUs (like x86) need extra translation (micro-ops), which adds overhead.

• Simple instruction sets help pipelining

– Example: MIPS and RISC-V are easier to pipeline than older, complex ISAs like
VAX.

Nextgen Comp Arch
No ratings yet
Nextgen Comp Arch
794 pages
Alpha 21264 Microprocessor
No ratings yet
Alpha 21264 Microprocessor
340 pages
Computer Architecture A Constructive Approach
No ratings yet
Computer Architecture A Constructive Approach
197 pages
Module 2
No ratings yet
Module 2
64 pages
Course Script HWSW Codesign
No ratings yet
Course Script HWSW Codesign
144 pages
Good VLSI Design Test Power Tutorial
No ratings yet
Good VLSI Design Test Power Tutorial
1,515 pages
Chapter 04 RISC V
No ratings yet
Chapter 04 RISC V
130 pages
21264ev6 HRM
No ratings yet
21264ev6 HRM
348 pages
Chapter - 04 RISC V
No ratings yet
Chapter - 04 RISC V
132 pages
21264ev68cb Ev68dc HRM
100% (1)
21264ev68cb Ev68dc HRM
360 pages
CO chpt-4
No ratings yet
CO chpt-4
161 pages
SRM Pipelining 05
No ratings yet
SRM Pipelining 05
42 pages
CA4Proc - Datapath - Contrl - P1
No ratings yet
CA4Proc - Datapath - Contrl - P1
27 pages
ECE Basics
No ratings yet
ECE Basics
112 pages
Coa Unit 5
No ratings yet
Coa Unit 5
20 pages
My Thesis
No ratings yet
My Thesis
59 pages
Lecture 5 The Processors-Basic Structures
No ratings yet
Lecture 5 The Processors-Basic Structures
37 pages
Digital ASIC Manual
100% (1)
Digital ASIC Manual
162 pages
Module 5 - Processor Structure and Function
No ratings yet
Module 5 - Processor Structure and Function
74 pages
DSP Prcoessor
No ratings yet
DSP Prcoessor
130 pages
L05 PipeliningII
No ratings yet
L05 PipeliningII
36 pages
ECE 327 Slides VHDL Verilog Digital Hardware Design
No ratings yet
ECE 327 Slides VHDL Verilog Digital Hardware Design
705 pages
L04 Pipelining
No ratings yet
L04 Pipelining
38 pages
Ec8691 Unit Ii - PPT
No ratings yet
Ec8691 Unit Ii - PPT
44 pages
CENG 450 Lab Project Report
No ratings yet
CENG 450 Lab Project Report
25 pages
It3030e CA Chap5 Cpu p1
No ratings yet
It3030e CA Chap5 Cpu p1
62 pages
Design of 32bit MIPS Processor
No ratings yet
Design of 32bit MIPS Processor
23 pages
Project 3: Writing A Kernel From Scratch: 15-410 Operating Systems
No ratings yet
Project 3: Writing A Kernel From Scratch: 15-410 Operating Systems
44 pages
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
No ratings yet
Onur 447 Spring15 Lecture12 Ooo Execution Afterlecture
67 pages
ACA - All Unit
No ratings yet
ACA - All Unit
31 pages
Chapter4 2
No ratings yet
Chapter4 2
34 pages
Coa
No ratings yet
Coa
14 pages
Unit V
No ratings yet
Unit V
23 pages
Ch#4 Part 1, 2,34
No ratings yet
Ch#4 Part 1, 2,34
70 pages
CA07 2022S3 New
No ratings yet
CA07 2022S3 New
29 pages
Embedde, D Systems
100% (1)
Embedde, D Systems
91 pages
Fpga Project
No ratings yet
Fpga Project
41 pages
Microprocessors
No ratings yet
Microprocessors
3 pages
Real Time Software
No ratings yet
Real Time Software
272 pages
08 Architecture
No ratings yet
08 Architecture
51 pages
Computer Architecture Basics
No ratings yet
Computer Architecture Basics
28 pages
WilliamStallings Chp3 PDF
No ratings yet
WilliamStallings Chp3 PDF
60 pages
Making Embedded Systems
No ratings yet
Making Embedded Systems
7 pages
15IF11 Multicore A PDF
No ratings yet
15IF11 Multicore A PDF
64 pages
COA DR MVN 5 UNIT - Latest PDF
No ratings yet
COA DR MVN 5 UNIT - Latest PDF
24 pages
Ca06 2014 PDF
No ratings yet
Ca06 2014 PDF
53 pages
Chapter 2 - Computer Organization
No ratings yet
Chapter 2 - Computer Organization
30 pages
Ibex Documentation: Release 0.1.dev50+g36ce999.d20191026
No ratings yet
Ibex Documentation: Release 0.1.dev50+g36ce999.d20191026
53 pages
CPU Organization & ISA Guide
No ratings yet
CPU Organization & ISA Guide
7 pages
RISC-V CPU Design Project
No ratings yet
RISC-V CPU Design Project
47 pages
Embedded Systems
No ratings yet
Embedded Systems
92 pages
Super Cpmputers
No ratings yet
Super Cpmputers
101 pages
CTS FlowGuide
100% (5)
CTS FlowGuide
14 pages
3.3V to 5V Power Supply Design
100% (2)
3.3V to 5V Power Supply Design
155 pages
KFS-576A Maintenance Manual
100% (4)
KFS-576A Maintenance Manual
118 pages
BICMOS Technology PDF
No ratings yet
BICMOS Technology PDF
79 pages
SE 350: Operating Systems
No ratings yet
SE 350: Operating Systems
24 pages
SystemVerilog Callback & Patterns
83% (6)
SystemVerilog Callback & Patterns
22 pages
CTS Cheatsheet
No ratings yet
CTS Cheatsheet
3 pages
DigRF Standard v3.09
100% (1)
DigRF Standard v3.09
66 pages
Assignment 2
100% (1)
Assignment 2
2 pages
Brief Introduction To ASM Charts
No ratings yet
Brief Introduction To ASM Charts
5 pages
Switched Capacitor Networks
No ratings yet
Switched Capacitor Networks
17 pages
VLSI Design Interview Questions With Answers
No ratings yet
VLSI Design Interview Questions With Answers
27 pages
3.4.5 Boot Modes: Chapter 3: Getting It Working
No ratings yet
3.4.5 Boot Modes: Chapter 3: Getting It Working
11 pages
BK1085Datasheetv1 1
No ratings yet
BK1085Datasheetv1 1
20 pages
Latch & Flip-Flop Design SEO Guide
No ratings yet
Latch & Flip-Flop Design SEO Guide
34 pages
Timing For Pipelined System: BITS Pilani
100% (1)
Timing For Pipelined System: BITS Pilani
62 pages
180.5Mbps-8Gbps DLL-based Clock and Data Recovery Circuit With Low Jitter Performance
No ratings yet
180.5Mbps-8Gbps DLL-based Clock and Data Recovery Circuit With Low Jitter Performance
4 pages
SA-15S1 Super Audio CD Player Manual
No ratings yet
SA-15S1 Super Audio CD Player Manual
73 pages
Ad9833 PDF
No ratings yet
Ad9833 PDF
21 pages
M.tech Notes Advanced Logic Design
No ratings yet
M.tech Notes Advanced Logic Design
15 pages
JSSC - 2019 - A 1-Pjpbit 80-Gbps PRBS15 Generator With A Modified Cherry-Hooper Output Driver
No ratings yet
JSSC - 2019 - A 1-Pjpbit 80-Gbps PRBS15 Generator With A Modified Cherry-Hooper Output Driver
11 pages
OpenRAM: Verifying SRAM Design
No ratings yet
OpenRAM: Verifying SRAM Design
7 pages
Download
No ratings yet
Download
11 pages
AY2018-19 Course Listing
No ratings yet
AY2018-19 Course Listing
9 pages
Programmable Frequency Divider Dissertation
No ratings yet
Programmable Frequency Divider Dissertation
124 pages
38-08032 CY7C68013A CY7C68014A CY7C68015A CY7C68016A EZ-USB FX2LP USB Microcontroller High-Speed USB Peripheral Controller
No ratings yet
38-08032 CY7C68013A CY7C68014A CY7C68015A CY7C68016A EZ-USB FX2LP USB Microcontroller High-Speed USB Peripheral Controller
69 pages
Max7490 Max7491
No ratings yet
Max7490 Max7491
18 pages
Timers
No ratings yet
Timers
25 pages
Bapatla Engineering College Digital Communications Lab EC-451
No ratings yet
Bapatla Engineering College Digital Communications Lab EC-451
30 pages
ECE-E434 Digital Electronics
No ratings yet
ECE-E434 Digital Electronics
27 pages