CST202
COMPUTER ORGANISATION AND
ARCHITECTURE
Module 3
Pipelining : Basic principles, classification of
pipeline processors, instruction and
arithmetic
pipelines (Design examples not required), hazard
detection and resolution.
Pipelining
●
Pipelining is a process of arrangement of hardware
elements of the CPU such that simultaneous execution of
more than one instruction will take place and hence its
overall performance will increase .
●
Pipeline is divided into stages and these stages are
connected with one another to form a pipe like structure.
●
Instructions enter from one end and exit from another
end.
●
Pipelining increases the overall instruction throughput.
Sequential Execution
Pipelined Execution
Steps in pipelined execution
●
A pipelined processor may process
each instruction in four steps.
●
F Fetch : Read the instruction from the memory.
●
D Decode : Decode the instruction and fetch
the source operand(s).
●
E Execute : Perform th operation specified by the
instruction.
●
W Write : Store the result in the destination
location.
A 4-Stage Pipeline.
Steps in pipelined execution
IF –Instruction Fetch
ID –Instruction Decode
OF-Operand Fetch
IE-Instruction Execute IF ID OF IE WB
WB-Write Back Result . 1 1 1 1 1
clock cycle time clock cycle time clock cycle time clock cycle time clock cycle time
Suppose process P1 with 6 instructions:
I1,I2,I3,I4,I5,I6
NON PIPELINED SYSTEM
Total completion time = 6 Instructions X 5 stages
=6x5
= 30 CCT
Namitha
Ramachandran
PIPELINED SYSTEM:
I1 =completes in 5 CCT
I2,I3,I4,I5,I6 (5 Instructions) complete their execution in next 5 clock cycles one
after another 1X 5+ 5X 1
=5+5=10 clock cycles
K stage pipelined system
N number of instructions to be executed
First instruction completes in k Clock cycles
N-1 instructions complete their executions in next n-1 clock cycles one
after another
=1XK+N-1 X1 =K+(N-1)
Pipeline Performance
●
The pipelined processor completes the
processing of one instruction in each clockcycle,
which means that the rate of instruction
processing is four times that of sequential
● operation.
The potential increase in performance resulting
from pipelining is proportional to the number
● of pipeline stages.
However, this increase would be achieved only
if pipelined operation could be sustained without
interruption throughout program execution.
Pipeline Performance
●
One of the pipeline stages may not be able
to complete its processing task for a given
instruction in the time allotted.
●
For example, stage E in the four-stage pipeline
of Figure is responsible for arithmetic and
logic operations, and one clock cycle is
assigned for this task.
●
Although this may be sufficient for most
operations, some operations, such as
divide, may require more time to complete.
Pipeline Performance
Classification of pipeline processor:
Various types of pipelining can be applied in computer
operations depending on the following factors:
1. Level of Processing
2. Pipeline configuration
3. Type of instruction and data
Classification according to level of processing
➔ Arithmetic pipeline
➔ Processor pipeline
➔ Instruction pipeline
Arithmetic pipeline
An arithmetic pipeline generally breaks an arithmetic
operation into multiple arithmetic steps.
in arithmetic pipeline, an arithmetic operation like multiplication,
addition, etc. can be divided into series of steps that can be executed one
by one in stages in Arithmetic Logic Unit (ALU).
The complex arithmetic operations like multiplication, and floating point
operations consume much of the time of the ALU. These operations can
also be pipelined by segmenting the operations of the ALU and as a
consequence, high speed performance may be achieved.
Listed below are examples of arithmetic pipeline processor:
8 stage pipeline used in TI-ASC
4 stage pipeline used in Star-100
We want to evaluate Ai*Bi +Ci for i=1,2,3,…7
Segment 1 : R1🡨Ai , R2🡨Bi
Segment 2 : R3🡨R1*R2 , R4🡨 Ci
Segment 3: R5🡨R3 +R4
Ci
Segment 1 : R1🡨Ai , R2🡨Bi Ai Bi
Segment 1:
Segment 2 : R3🡨R1*R2 , R4🡨 Ci
R1 R2
Segment 3: R5🡨R3 +R4
Segment
Segment1 Segment2 Segment3 2: Multiplier R4
Clock R1 R2 R3 R4 R5
pulse
1 A1 B1 ---- ---- ---- R3
2 A2 B2 A1*B1 C1 ----
3 A3 B3 A2*B2 C2 A1*B1+C1 Segment 3: Adder
4 A4 B4 A3*B3 C3 A2*B2+C2
5 A5 B5 A4*B4 C4 A3*B3+C3
R5
6 A6 B6 A5*B5 C5 A4*B4+C4
7 A7 B7 A6*B6 C6 A5*B5+C5
8 --- --- A7*B7 C7 A6*B6+C6
9 --- --- --- --- A7*B7+C7
Namitha
Ramachandran
Processor pipeline
Pipeline processing of the same data stream by a cascade
of processors, each of which processes a specific task.
The data stream passes the first processor with the
results stored in memory block which is also accessible
by the second processor. The second processor then
passes the refined results to the third and so on.
In this each cascade of processor is assigned and process a
specific task.
m1 P1 m2 m3 P3 m4
P2
Task 1 Task 2 Task 3
Instruction Pipeline:
● An instruction cycle may consist of many operations like, fetch opcode,
decode opcode, compute operand addresses, fetch operands, and
execute instructions.
● These operations of the instruction execution cycle can be realized
through the pipelining concept. Each of these operations forms one
stage of a pipeline.
● The overlapping of execution of the operations through the pipeline
provides a speedup over the normal execution. Thus, the pipelines used
for instruction cycle operations are known as instruction pipelines.
Four Segment Instruction Pipeline
In general the computer needs to process each instruction with the following
sequence of steps. (6 steps in 4 segments)
Fig shows the operation of the instruction pipeline. The clock in the horizontal axis is
divided into steps of equal duration. The four segments are represented in the diagram
with an abbreviated symbol.
1. IF is the segment that fetches an instruction.
2.ID is the segment that decodes the instruction and calculates
the effective address.
3. OF is the segment that fetches the operand.
4. EX is the segment that executes the instruction.
Here the instruction is fetched (IF) on first clock cycle in segment 1. it is
decoded (ID) on second clock cycle , the operands are fetched (OF) on third
clock cycle and finally the instruction is executed (EX) in the fourth cycle.
Here the fetch and decode phase overlap due to pipelining. By the time the
first instruction is being decoded, next instruction is fetched by the pipeline.
In case of third instruction we see that it is a branched instruction. Here
when it is being decoded, 4th instruction is fetched simultaneously. But as it
is a branched instruction it may point to some other instruction when it is
decoded. Thus fourth instruction is kept on hold until the branched
instruction is executed. When it gets executed then the fourth instruction is
copied back and the other phases continue as usual. In the absence of a
branch instruction, each segment operates on different instructions.
Classification according to pipeline
configuration:
●
Uni-function Pipelines: When a fixed and dedicated
function is performed through a pipeline, it is called a
Uni-function pipeline.
●
Multifunction Pipelines: When different functions at
different times are performed through the pipeline,
this is known as Multifunction pipeline.
● Multifunction pipelines are reconfigurable at
different times according to the operation being
performed.
Classification according to type of
instruction and data:
●
Scalar Pipelines: This type of pipeline processes
scalar operands of repeated scalar instructions.
●
Vector Pipelines: This type of pipeline
processes vector instructions over vector
operands.
Pipeline - Problems
●
An instruction in pipeline has five stages without any branch
prediction and these stages are Instruction Fetch (IF), Instruction
Decode (ID), Operand Fetch (OF), Execute (EX) and Operand
Write (OW). The stage delays for IF, ID, OF, EX and OW phases
are 5 nsec, 7 nsec, 10 nsec, 8 nsec and 6 nsec, respectively.
There are intermediate storage buffers after each stage and the
delay of each buffer is 1 nsec. A program consisting of 12
instructions I1, I2, …, I12 is executed in the pipelined processor.
Instruction I4 is the only branch instruction and its branch target
is I9. If the branch is taken during the execution of this program,
the time needed to complete the program is:
Pipeline -
● Problems
Minimum clock period = max{5,7,10,8,6} + 1 =
11
Pipeline - Problems
●
A four stage pipeline has the stage delays as 150, 120, 160 and
140 ns respectively. Registers are used between the stages
and have a delay of 5 ns each. Assuming constant clocking
rate, the total time taken to process 1000 data items on the
pipeline will be-
●
120.4
● microseconds
● 160.5
microseconds
●
165.5
microseconds
590.0
microseconds
Pipeline -
I Problems
Given-
I
Four stage pipeline is used
I
Delay of stages = 150, 120, 160 and 140 ns
I
Delay due to each register = 5 ns
I
1000 data items or instructions are processed
I
Cycle time
I
= Maximum delay due to any stage + Delay due to
its register
I
= Max { 150, 120, 160, 140 } + 5 ns
I
= 160 ns + 5 ns
I
= 165 ns
Pipeline - Problems
●
Pipeline time to process 1000 data items
●
= Time taken for 1st data item + Time taken for remaining 999
data items
●
= 1 x 4 clock cycles + 999 x 1 clock cycle
●
= 4 x cycle time + 999 x cycle time
●
= 4 x 165 ns + 999 x 165 ns
●
= 660 ns + 164835 ns
●
= 165495 ns
●
= 165.5 μs
Pipeline - Problems
●
We have 2 designs D1 and D2 for a synchronous pipeline
processor. D1 has 5 stage pipeline with execution time of 3 ns, 2
ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages
each with 2 ns execution time. How much time can be saved
using design D2 over design D1 for executing 100 instructions?
●
214 ns
●
202 ns
●
86 ns
●
200 ns
Pipeline - Problems
●
Cycle Time in Design D1-
●
Cycle time = Maximum delay due to any stage + Delay due to its register
●
= Max { 3, 2, 4, 2, 3 } + 0
●
= 4 ns
●
Execution time for 100 instructions
●
= Time taken for 1st instruction + Time taken for remaining 99
● instructions
● = 1 x 5 clock cycles + 99 x 1 clock cycle
● = 5 x cycle time + 99 x cycle time
● = 5 x 4 ns + 99 x 4 ns
● = 20 ns + 396 ns
= 416 ns
Pipeline - Problems
●
Cycle Time in Design D2-
● Cycle time = Delay due to a stage + Delay due to its register
= 2 ns + 0 = 2 ns
● Execution Time For 100 Instructions in Design D2-
●
Execution time for 100 instructions = Time taken for
1st instruction + Time taken for remaining 99
● instructions
● = 1 x 8 clock cycles + 99 x 1 clock cycle
● = 8 x cycle time + 99 x cycle time
= 8 x 2 ns + 99 x 2 ns = 16 ns + 198 ns = 214 ns
PIPELINE CONFLICTS:
Resource Conflicts: They are caused by access to memory by two
segments at the same time. Most of these conflicts can be resolved
by using separate instruction and data memories.
Data Dependency: these conflicts arise when an instruction depends
on the result of a previous instruction, but this result is not yet
available.
Branch Difference: they arise from branch and other instructions
that change the value of PC.
When one of the pipeline stage is not able to complete its
processing task for a given instruction in the time allotted,
the pipelined operation is said to have stalled.
Any condition that cause pipeline to stall is called a hazard.
Hazard in pipeline
Any condition that causes the pipeline to stall is called
a hazard .
●
A data hazard is any condition in which either the source
or the destination operands of an instruction are not
●
available at the time expected in the pipeline. As a result
some operation has to be delayed, and the pipeline
stalls.
Hazard in pipeline
Pipelining- Hazard
Types
Data hazard: Any condition in which either the source or
destination operands of an instruction are not available at the
time expected in the pipeline.
As a result some operation has to be delayed and the pipeline stalls.
Control hazards or Instruction hazards: The pipeline may also
be stalled because of a delay in the availability of an instruction.
For example, this may result a miss in the cache requiring the
instruction to be fetched from the main memory.
Structural hazard occur when there is a situation when two
instructions require the use of a given hardware resource at the
same time.
Most common case is access to memory. One instruction may need to
access memory as part of execute or write stage while the other is being
fetched.
Data hazard:
Data hazards occur when an instruction's execution depends on the
results of some previous instruction that is still being processed in the
pipeline.
RAW (Read after Write) [True data dependency]
WAR (Write after Read) [Anti-Data dependency]
WAW (Write after Write) [Output data dependency]
RAW hazard occurs when instruction J tries to read data before
instruction I writes it.
I: R2 <- R1 + R3
J: R4 <- R2 + R3
Solution- RAW
1: Introduce three bubbles at next instruction IF stage.
I IF ID IE WB I IF ID IE WB
I: R2 <- R1 + R3
J: R4 <- R2 + R3 J --- IF ID IE W J --- IF ID IE WB
B
2: Data forwarding :Passing the result directly to the functional unit
that requires it.
The purpose is to make available the solution early to the next
instruction.
It adds special circuitry to the pipeline.
I IF ID IE WB
J … IF ID IE R2 WB
3: Code reordering : Namitha
Ramachandran
We need a special type of software to reorder code.
WAW and WAR hazards can only occur when instructions are
executed in parallel or out of order.
•WAR hazard occurs when instruction J tries to write
data before instruction I reads it.
I: R2 <- R1 + R3
J: R3 <- R4 + R5
WAW hazard occurs when instruction J tries to write
output before instruction I writes it.
I: R2 <- R1 + R2
J: R2 <- R4 + R2
Solution :These occur because the same register have been allotted by the compiler .
This situation is fixed by renaming one of the registers by the compiler .
Namitha
Ramachandran
Delaying the updating of a register until the appropriate value has been produced.
Pipelining- Hazard Detection and Resolution
● We use Resource Object to refer to working
registers, memory locations and special flags.
● The contents of these resource objects are called
data objects.
● Each instruction can be considered a mapping from a set
of data objects to a set of data objects.
● The Domain D(I) of an instruction I is a set of resource
objects whose data objects may affect the execution
of instruction I.
● The Range R(I) of an instruction is the set of resource
objects whose data objects may be modified by the
execution of instruction I.
● Obviously, the operands to be used in an instruction
execution are retrieved (read) from its domain and the results
will be stored (written) in its range.
Pipelining-
Hazard
Detection and
Resolution
Pipelining- Hazard Detection and
Resolution
Once the hazard is detected, the system should
resolve the interlock situation
A straightforward approach is to stop the pipe and to
suspend the execution of the coming instructions until
instruction I has passed the point of resource conflict.
A more sophisticated approach is to suspend only next
instruction J and continue the flow of instruction down the
pipe
Multi level hazard detection may be encountered, requiring
more complex control mechanisms to resolve a stack of
hazards
Pipelining- Hazard Detection and
Resolution
In order to avoid RAW hazards, IBM engineers developed
a short circuiting approach which gives a copy of the
data object to be written directly to the instruction waiting
to read the data.
This concept was generalized into a technique known as
data forwarding, which forward multiple copies of the data
to as many waiting instructions as may wish to read it.
A data forwarding chain can be established in some cases.
The internal forwarding and register-tagging techniques are
helpful in resolving logic hazards in pipelines.
Control hazards or Instruction hazards:
Structural hazard
Solutions
Introduce bubble which stalls the pipeline
This delay is percolated to all the subsequent instructions too.
A better solution would be to increase the structural resources in the system using
one of the few choices below
The pipeline may be increased to 5 or more stages and suitably redefine the
functionality of the stages and adjust the clock frequency.
The memory may physically be separated as Instruction memory and Data
Memory.
If uses Instruction memory and Result writing uses Data Memory. These
become two separate resources avoiding dependency.
There is a possibility of ALU in resource dependency.
ALU may be required in IE machine cycle by an instruction while another
instruction may require ALU in IF stage to calculate Effective Address
based on addressing mode.
The solution would be either stalling or have an exclusive ALU for address
calculation.