Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
31 views39 pages

Computer Systems Pipelining Guide

Carleton University

Uploaded by

celestemelody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views39 pages

Computer Systems Pipelining Guide

Carleton University

Uploaded by

celestemelody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Department of Systems and Engineering

Design, Carleton University

SYSC 3320 Computer Systems Design


Processors – Pipelining
Pipelining as a technique to improve performance
• Recall: ‘Iron Law’ for processor performance

• Three factors to improve CPU performance


1) Time per cycle
2) Clock cycles per instruction
3) Instructions per program
•Performance is a product of three factors that are not independent of one
another. It is important to concentrate on ‘reducing’ all three factors. Trying to
reduce “instructions per program” is a compiler/developer dependent. Trying to
reduce “time/cycle” means a high clock frequency which has reached a limit due
to clock technology limitations. Scientists focused on improving
Cycles/Instruction. Pipelining is technique to increase the instructions executed
per clock cycle which is equivalent to reducing the Cycles/Instruction.

2
Recall: Instruction Execution Cycle
• Instruction execution cycle has two main phases
➢ Fetch
➢ Decode SYSTE
MEMORY
CPU M BUS
➢ Execute REGISTERS
- Program counter (PC)
- Instruction Register (IR)
- Memory Address Reg. (MAR)
- Status Register (SR)
- Stack Pointer (SP) .
A .
- Gener..al Purpose Reg. (GPR1) .
Gener.alPurpose Reg. (GPR2) . D C
- D
D N
A
R T
T
ALU CNTRL E R
A I/O Device 1
-
-
ADD, SUB,MULT,DIV
COMPL
S L
- SHIFT
S
.

I/O Device 2

• This is an overly simplified sequence. Real processors have much more


complicated steps to execute an instruction.
3
Instruction Execution Cycle
• A more realistic Instruction execution cycle has the following main phases
1) Instruction Fetch (IF) stage:
Fetch an instruction from instruction memory
2) Instruction Decoding (ID) stage:
Decode instruction and reads registers from register file (or register bank)
3) Execution (EX) stage:
Execute instruction. If ALU, perform the ALU operation, if load/store, calculate memory address
4) Memory access (MEM) stage:
Access memory for a load/store instruction
5) Writeback (WB) stage :
Write the results into the register file (or register bank)

4
Instruction Execution Cycle
• A more realistic Instruction execution cycle has the following main phases
1) Instruction Fetch (IF) stage:
Fetch an instruction from instruction memory
2) Instruction Decoding (ID) stage:
Decode instruction and reads registers from register file (or register bank)
3) Execution (EX) stage:
Execute instruction. If ALU, perform the ALU operation, if load/store, calculate memory address
4) Memory access (MEM) stage:
Access memory for a load/store instruction
5) Writeback (WB) stage :
Write the results into the register file (or register bank)
• Each phase takes one clock cycle
• Conventional CPUs will implement these phases “in series” or “sequentially”
• This means that the total instruction time equals to the sum of all phases time
• Obviously, this is a time-consuming process that can be significantly enhanced if we
can run the different phases “in parallel”. This is called “pipelining”.

5
Principles of an ideal pipeline
• Pipelining is a technique to introduce parallelism to the system
• All object must go through all stages
• Sharing of resources is not allowed
• Propagation delay for all stages is the same
• Ideally there should be no dependency between the stages
• This cannot be satisfied about microprocessors (why?)

Stage 1 Stage 2 Stage 3

3/19/2023 6
Instruction Pipelining
• Given the instruction execution stages defined as follows:
Instruction Fetch (IF) stage
Instruction Decoding (ID) stage
Execution (EX) stage
Memory access (MEM) stage
Writeback (WB) stage
• Assume each stage takes one clock cycle T
• A major limitation of conventional instruction execution is each stage wait for ALL
previous stages to finish before it can proceed.

• Pipelining optimizes this process by running stages “in parallel” so instruction


executions will “overlap” instead of being completely sequential.

7
Pipeline diagram
Control
Unit

Add

Add
4
Shift
PC

MUX
Instruction

ALU
memory Registers
Data
Memory
MUX MUX

Imm

time t0 t1 t2 t3 t4 t5 t6 t7 t8
Instruction 1 IF1 ID1 EX1 MA1 WB1
Instruction 2 IF1 ID1 EX1 MA1 WB1
Instruction 3 IF1 ID1 EX1 MA1 WB1
Instruction 4 IF1 ID1 EX1 MA1 WB1
Instruction 5 IF1 ID1 EX1 MA1 WB1
3/20/2023 8
Instruction Pipelining

Cycles per instruction (CPI) = 5


Instructions per cycle (IPC) = 1/5

Cycles per instruction (CPI) = 1


Instructions per cycle (IPC) = 1

Instructions still take the same clock cycles but the


overlapped processing reduces cycles per instructions

9
Pipelining Concept
• Each pipeline stage takes 1 clock cycle
• The clock cycle must be long enough to accommodate the slowest pipeline
stage.
• How much speed up we can get using pipelining?
• Under ideal situations, approximately equal to the number of stages
• How many pipe stages we must consider?
• More pipe stages results in shorter clock period
• Might result in extra overhead

Time
IF: Instruction Fetch
Ex
execution

1st IF ID MEM WB
Program

ID: Instruction Decode


IF ID Ex MEM WB EX: Execution
2nd
MEM: Memory access
IF ID Ex MEM WB WB: Write Back
3rd

3/19/2023 10
Example/Discussion
Assume a program consists of 10000 instructions is running on a an non-pipeline
single cycle, A, and a 5-stage pipelined, B, processors. Assuming the clock
frequency of processor A is 200 MHZ, and processor B is 1 GHz, calculate the
speedup in processor B compared to A. Assume we have an ideal pipeline in B,
and the program is made of simple arithmetic instructions.

3/19/2023 11
Pipelining Challenges: Clock Skew
• For a general K-stage pipelining system, ideally, CPI = 1 instead of K (IPC = 1
instead of 1/K). However, these calculations are for the ideal case. In
practice, a pipeline is a hardware structure with a number of registers that
need to be clocked synchronously.

• The additional hardware needed for pipelining will lead to different arrivals of
the clock signal (clock skew) for each stage leading to additional delays.
• In effect, pipelining adds a latency to the clock. If the clock period without
pipelining is ‘t’, this latency adds a factor ‘∆𝑡’ to it. Thus, the overall clock
frequency is reduced.

12
Pipelining Challenges: Clock Skew-Example
Consider an unpipelined processor with a clock period of 2 ns. This
processor is now re-modeled with a five-stage pipeline, which adds
0.2 ns latency to the clock period.
❑ What are the old and new CPIs?
❑ Calculate the ideal and actual speedups obtained

13
Pipelining Challenges: Clock Skew-Example
Consider an unpipelined processor with a clock period of 2 ns. This
processor is now re-modeled with a five-stage pipeline, which adds
0.2 ns latency to the clock period.
❑ What are the old and new CPIs?
❑ Calculate the ideal and actual speedups obtained
Ideal Case
Without pipelining: Execution time without pipelining = 5 × 2 = 10 ns.
With pipelining: For an ideal pipeline, after the first instruction one
instruction is delivered at every cycle. Execution time with pipelining =
2 ns.

Speedup = Execution time without pipelining/Execution time


with pipelining = 10/2 = 5 = Number of pipeline stages.
CPI = 5 (without pipelining) and 1 (with pipelining).

14
Pipelining Challenges: Clock Skew-Example
Consider an unpipelined processor with a clock period of 2 ns. This
processor is now re-modeled with a five-stage pipeline, which adds
0.2 ns latency to the clock period.
❑ What are the old and new CPIs?
❑ Calculate the ideal and actual speedups obtained
Non-ideal Case
With pipelining, latency = 0.2 ns, Clock period = 2 + latency = 2 + 0.2 =
2.2 ns. Instruction execution time with pipelining = 2.2 ns.
Speedup = instruction execution time without pipelining/instruction
execution time with pipelining = 10/2.2 = 4.45. Thus, the speedup has
reduced from 5 to 4.45 because of the non-ideal nature of the pipeline.

15
Pipelining Challenges: Clock Skew-Example
Consider an unpipelined processor with a clock period of 2 ns. This
processor is now re-modeled with a five-stage pipeline, which adds
0.2 ns latency to the clock period.
❑ What are the old and new CPIs?
❑ Calculate the ideal and actual speedups obtained
Non-ideal Case
With pipelining, latency = 0.2 ns, Clock period = 2 + latency = 2 + 0.2 =
2.2 ns. Instruction execution time with pipelining = 2.2 ns.
Speedup = instruction execution time without pipelining/instruction
execution time with pipelining = 10/2.2 = 4.45. Thus, the speedup has
reduced from 5 to 4.45 because of the non-ideal nature of the pipeline.

16
Pipelining Challenges: Additional hardware
• The pipeline is clocked and the intermediate results from one stage have to be
forwarded to the next stage in every cycle, while the data from the previous
stage has to be clocked in. This cannot be done without intermediate storage
between stages. Thus, we need inter-stage buffers.
• While designing a pipeline, one mandatory requirement is that the different
stages should be balanced. This means that all stages should have the same or
almost the same latency, as they are to be clocked synchronously. This obviously
means that the pipeline rate is determined by the latency of the slowest stage.

intermediate storage between stages

17
Pipelining Challenges: Additional hardware
• Another hardware challenge is “multiporting” of processor register files.
• We need to read and write simultaneously from register banks
• We need registers banks with “multiport”

CTRL
R1
Registers bank (file)
R2
Port1
..
Processor Port2
..
..
..
.. ALU

18
Pipelining Challenges: Hazards
• Situations where the next instruction cannot be executed in the next pipeline
stages
• Structural
• When a planned instruction cannot execute in the proper clock cycle
because the hardware does not support the combination of instructions
that are set to execute.
• When two instructions need the same hardware resource
• Sharing resources is not possible by different instructions
• Data
• Instruction cannot be executed because data that are needed to execute
the instruction are not yet available.
• Data dependency
• Control
• Conditional Branch Hazards
• The processor does not recognize a branch until later in the pipeline
stages

3/19/2023 19
Structural Hazard
• When two instructions need the same hardware resource
• Sharing resources is not possible by different instructions
• Solutions
• Schedule
• Programmer avoids scheduling instructions that need the same
hardware resource at the same time
• Stall
• Wait until the resource is free an then take the next instruction
• Duplicate
• Add more hardware!
• Example: more ports
• Not always possible

• Interesting to know that MIPS processors never have structural hazards because
of the way its ISA has been designed

3/19/2023 20
Structural Hazard Example: One memory
EX MEM Write
IF (Fetch) ID (Decode) (Execution) (Memory Back
access) (WB)

Add

Add
4
Shift

PC
MUX

ALU
Registers MUX

MUX MUX

Imm

MUX Data and


Instruction
Memory

Ld F D X M W
Add F D X M W
Add F D X M W
Add F D X M W

3/19/2023 21
Structural Hazard Example: One memory
EX MEM Write
IF (Fetch) ID (Decode) (Execution) (Memory Back
access) (WB)

Add

Add
4
Shift

PC
MUX

ALU
Registers MUX

MUX MUX

Imm

MUX Data and


Instruction
Memory

Ld F D X M W
Add F D X M W
Add F D X M W
Add F D X M W

3/19/2023 22
Data Hazard
• Data Dependency
• Solution
• Scheduling
• Programmer avoids scheduling the instructions that causes data hazard
• Stall
• Like freezing earlier instructions
• Bypass
• A hardware mechanism
• Sending some sort of feedback from later stages to the earlier stages in
the pipeline
• Extra hardware complexity
• Speculate
• Guessing that there is no problem, if incorrect kill the speculative
instruction

3/19/2023 23
Data Hazard: Scheduling solution example
Reorder code to avoid use of load result in the next instruction
C code for a = b + e; c = b + f;

ld x1, 0(x0) ld x1, 0(x0)


ld x2, 8(x0) ld x2, 8(x0)
stall add x3, x1, x2 ld x4, 16(x0)
sd x3, 24(x0) add x3, x1, x2
ld x4, 16(x0) sd x3, 24(x0)
stall add x5, x1, x4 add x5, x1, x4
sd x5, 32(x0) sd x5, 32(x0)
13 cycles 11 cycles

24
Example Data Hazard

Second Add First Add


EX MEM Write
IF (Fetch) ID (Decode) (Execution) (Memory Back
access) (WB)
Add

Add
4
Shift
PC

MUX
Instructio

ALU
n memory Registers
Data
Memory
MUX MUX

Imm

Add R1, R0, #20 R1 <- R0 + 20


Add R4, R1, #30 R4 <- R1 + 30
What do we do?

3/19/2023 25
Example Data Hazard: stall solution
First Add
EX MEM Write
IF (Fetch) ID (Decode) (Execution) (Memory Back
access) (WB)
Add

Add
4
Shift
PC

MUX
Instructio

ALU
n memory Registers
Data
Memory
MUX MUX

Second Add
Imm

Add R1, R0, #20 R1 <- R0 + 20


Read after Write
Add R4, R1, #30 R4 <- R1 + 30 (RAW) hazard
We will have to wait for a number of clock cycles and it is not efficient
3/19/2023 26
Stall solution (interlock)

Add R1, R0, #20 R1 <- R0 + 20


Add R4, R1, #30 R4 <- R1 + 30

We must decode the second instruction after the first one is written back
to the register file

Stalled stages
ADD F D X M W
ADD F D D D D X M W
F D X M W
F D X M W

3/19/2023 27
Dependencies & Forwarding

28
Stalls and Performance

Stalls reduce performance


• But are required to get correct results

Compiler can arrange code to avoid hazards and stalls


• Requires knowledge of the pipeline structure

29
Control Hazards
Branch determines flow of control
• Fetching next instruction depends on branch outcome
• Pipeline can’t always fetch correct instruction
• Still working on ID stage of branch

30
Stall on Branch

Wait until branch outcome determined before fetching next instruction

31
Branch Prediction

• Longer pipelines can’t readily determine branch outcome early


• Stall penalty becomes unacceptable

• Predict outcome of branch


• Only stall if prediction is wrong

32
More-Realistic Branch Prediction

• Static branch prediction


• Based on typical branch behavior
• Example: loop and if-statement branches
• Predict backward branches taken
• Predict forward branches not taken

• Dynamic branch prediction


• Hardware measures actual branch behavior
• e.g., record recent history of each branch

• Assume future behavior will continue the trend


• When wrong, stall while re-fetching, and update history

33
Control Hazard: Branch Hazards
If branch outcome determined in MEM

Flush these
instructions
(Set control
values to 0)

PC

34
Data Hazards for Branches

If a comparison register is a destination of 2nd or 3rd preceding ALU instruction

add x1, x2, x3 IF ID EX MEM WB

add x4, x5, x6 IF ID EX MEM WB

… IF ID EX MEM WB

beq x1, x4, target IF ID EX MEM WB

◼ Can resolve using forwarding

35
Data Hazards for Branches

If a comparison register is a destination of preceding ALU instruction or 2nd preceding


load instruction
• Need 1 stall cycle

lw x1, addr IF ID EX MEM WB

add x4, $x5, $x6 IF ID EX MEM WB

beq stalled IF ID

beq x1, x4, target ID EX MEM WB

36
Data Hazards for Branches

If a comparison register is a destination of immediately preceding load instruction


• Need 2 stall cycles

lw x1, addr IF ID EX MEM WB

beq stalled IF ID

beq stalled ID

beq x1, x0, target ID EX MEM WB

37
Future Lecture
• Memory Technologies

39

You might also like