0% found this document useful (0 votes)

20 views61 pages

Campmc Unit Ii

Computer architecture

Uploaded by

summaa786786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views61 pages

Campmc Unit Ii

Computer architecture

Uploaded by

summaa786786

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Paavai Engineering College Department of ECE

UNIT II
PIPELINING AND PARALLEL PROCESSING

1
Paavai Engineering College Department of ECE

CONTENTS

S.No. Name of the content Page No

Technical terms

2.1 Basic Concepts 2.7

2.1.1 Types of pipelining 2.8

2.1.2 Pipelining Architecture 2.10

2.2 Pipeline Organization 2.11

2.2.1 Design of a basic pipeline 2.13

2.2.2 Execution in a pipelined processor 2.13

2.2.3 Pipeline Stages 2.14

2.2.4 Performance of a pipelined processor 2.15

2.2.5 Types of pipelining 2.16

2.3 Pipelining Hazards 2.17

2.3.1 Structural Hazards 2.17

2.3.2 Data Hazards 2.19

2.3.3 Control Hazards 2.22

2.3.4 Solutions for Conditional Hazards 2.23

2.4 Influence on instruction sets 2.25

2.5 Data path and control considerations 2.28

2
Paavai Engineering College Department of ECE

2.6 Performance considerations 2.28

2.7 Exception handling 2.34

2.8 Parallel Processing Challenges 2.36

2.9 Multi-core processors 2.39

2.10 Flynn’s classification 2.44

2.11 Hardware multithreading 2.47

2.12 Question Bank 2.60

3
Paavai Engineering College Department of ECE

TECHNICAL TERMS
Technical Literal
S.No Technical Meaning Digester
Terms Meaning
It is the process of storing and https://www.t
The processing queuing tasks and instructions echopedia.co
1 Pipeline of each task in that are executed m/definition/1
parallel. simultaneously by the processor 3051/pipelinin
in an organized way. g
Instruction pipelining is a
technique of organising the https://binaryt
Method of instructions for execution in erms.com/inst
Instruction
2 categorizing the such a way that the execution of ruction-
pipelining
instructions. the current instruction is pipelining.ht
overlapped by the execution of ml
its subsequent instruction
https://www.e
ncyclopedia.c
Sudden situation
A situation in which an om/computing
in which an
instruction is dependent on a /dictionaries-
Data instruction is
3 result from a sequentially thesauruses-
Dependencies dependent on
previous instruction before it pictures-and-
previous
can complete its execution. press-
instruction.
releases/data-
dependency
Fundamental Memory latency is the time https://en.wiki
measure of the between initiating a request for pedia.org/wiki
4 Memory Delay
speed of a byte or word in memory until /Memory_late
memory. it is retrieved by a processor. ncy
The instructions https://cpental
appear to The branch delay slot is a side k.com/887/wh
5 Branch Delays execute in an effect of pipelined architectures at-is-a-
illogical or due to the branch hazard delayed-
incorrect order. branch
Superscalar CPU manages CPU that implements a form https://en.wiki
Operation multiple of parallelism called instruction pedia.org/wiki
6
instruction for -level parallelism within a /Superscalar_
execution. single processor. processor

4
Paavai Engineering College Department of ECE

A multi-core processor is a
computer processor on a https://en.wiki
A multicore
Multi-core single integrated circuit with pedia.org/wiki
processor conta
7 processor two or more separate processing /Multi-
ins multiple core
units, called cores, each of core_processo
processing units.
which reads and r
executes program instructions
SISD stands for ‘Single
Instruction and Single Data
Instructions are Stream'. It represents the https://www.j
SISD
8 executed organization of a single avatpoint.com
sequentially. computer containing a control /sisd
unit, a processor unit, and a
memory unit
All processors SIMD stands for ‘Single
receive the same Instruction and Multiple Data
instruction from Stream'. It represents an https://www.j
SIMD
9 the control unit organization that includes many avatpoint.com
but operate on processing units under the /simd
different items supervision of a common
of data. control unit.
MISD stands for ‘Multiple
In MISD, Instruction and Single Data
multiple stream'. MISD structure is only
processing units of theoretical interest since no https://www.j
MISD practical system has been
10 operate on one avatpoint.com
constructed using this
single-data organization. Each processing /misd
stream. unit operates on the data
independently via separate
instruction stream.

5
Paavai Engineering College Department of ECE

In MIMD, each
processor has a MIMD stands for 'Multiple
separate Instruction and Multiple Data
program and an Stream'. In this organization, all
MIMD instruction processors in a parallel https://www.j
11 computer can execute different avatpoint.com
stream is
generated from instructions and operate on /mimd
each program. various data at the same time.

In computer architecture, multi-

To allow quick
threading is the ability of
switching a central processing unit (CPU) https://en.wiki
Hardware between a (or a single core in a multi-core pedia.org/wiki
12 multithreading blocked thread processor) to provide /Multithreadin
and another multiple threads of g_(computer_
thread ready to execution concurrently, architecture)
run. supported by the operating
system.

6
Paavai Engineering College Department of ECE

2.1 BASIC CONCEPTS

Pipelining is a technique where multiple instructions are overlapped during
execution. Pipeline is divided into stages and these stages are connected with one another to
form a pipe like structure. Instructions enter from one end and exit from another end. Pipelining
increases the overall instruction throughput.
Pipelining is the process of accumulating instruction from the processor through a
pipeline. It allows storing and executing instructions in an orderly process. It is also known
as pipeline processing.
Before moving forward with pipelining, check these topics out to understand the concept
better:
• Memory Organization
• Memory Mapping and Virtual Memory
• Parallel Processing
Pipelining is a technique where multiple instructions are overlapped during execution.
Pipeline is divided into stages and these stages are connected with one another to form a pipe
like structure. Instructions enter from one end and exit from another end.
Pipelining increases the overall instruction throughput.
In pipeline system, each segment consists of an input register followed by a
combinational circuit. The register is used to hold data and combinational circuit performs
operations on it. The output of combinational circuit is applied to the input register of the next
segment.

Fig.2.1 Block diagram of pipeline organization

7
Paavai Engineering College Department of ECE

Pipeline system is like the modern day assembly line setup in factories. For example in a car
manufacturing industry, huge assembly lines are setup and at each point, there are robotic arms
to perform a certain task, and then the car moves on ahead to the next arm.

2.1.1 Types of Pipeline

It is divided into 2 categories:
• Arithmetic Pipeline
• Instruction Pipeline
Arithmetic Pipeline
Arithmetic pipelines are usually found in most of the computers. They are used for
floating point operations, multiplication of fixed point numbers etc. For example: The input to
the Floating Point Adder pipeline is:
X = A*2^a
Y = B*2^b
Here A and B are mantissas (significant digit of floating point numbers), while a and b are
exponents.
The floating point addition and subtraction is done in 4 parts:
• Compare the exponents.
• Align the mantissas.
• Add or subtract mantissas
• Produce the result.
Registers are used for storing the intermediate results between the above operations.
Instruction Pipeline
In this a stream of instructions can be executed by
overlapping fetch, decode and execute phases of an instruction cycle. This type of technique is
used to increase the throughput of the computer system.
An instruction pipeline reads instruction from the memory while previous instructions
are being executed in other segments of the pipeline. Thus we can execute multiple instructions

8
Paavai Engineering College Department of ECE

simultaneously. The pipeline will be more efficient if the instruction cycle is divided into
segments of equal duration.
Pipeline Conflicts
There are some factors that cause the pipeline to deviate its normal performance. Some of
these factors are given below:
1. Timing Variations
All stages cannot take same amount of time. This problem generally occurs in instruction
processing where different instructions have different operand requirements and thus different
processing time.
2. Data Hazards
When several instructions are in partial execution, and if they reference same data then the
problem arises. We must ensure that next instruction does not attempt to access data before
the current instruction, because this will lead to incorrect results.
3. Branching
In order to fetch and execute the next instruction, we must know what that instruction is. If the
present instruction is a conditional branch, and its result will lead us to the next instruction,
then the next instruction may not be known until the current one is processed.
4. Interrupts
Interrupts set unwanted instruction into the instruction stream. Interrupts effect the execution
of instruction.
5. Data Dependency
It arises when an instruction depends upon the result of a previous instruction but this result is
not yet available.
Advantages of Pipelining
• The cycle time of the processor is reduced.
• It increases the throughput of the system
• It makes the system reliable.
Disadvantages of Pipelining
• The design of pipelined processor is complex and costly to manufacture.

9
Paavai Engineering College Department of ECE

• The instruction latency is more.

2.1.2 Pipelining Architecture
Parallelism can be achieved with Hardware, Compiler, and software techniques. To
exploit the concept of pipelining in computer architecture many processor units are
interconnected and are functioned concurrently. In pipelined processor architecture, there are
separated processing units provided for integers and floating point instructions. Whereas in
sequential architecture, a single functional unit is provided.

Fig. 2.2 Block diagram of pipeline processing

Pipelined Processor Unit
In static pipelining, the processor should pass the instruction through all phases of pipeline
regardless of the requirement of instruction. In a dynamic pipeline processor, an instruction
can bypass the phases depending on its requirement but has to move in sequential order. In a
complex dynamic pipeline processor, the instruction can bypass the phases as well as choose
the phases out of order. In 5 stages pipelining the stages are: Fetch, Decode, Execute,
Buffer/data and Write back.
Pipelining Hazards
In a typical computer program besides simple instructions, there are branch instructions,
interrupt operations, read and write instructions. Pipelining is not suitable for all kinds of
instructions. When some instructions are executed in pipe-lining they can stall the pipeline or
flush it totally. This type of problems caused during pipelining is called Pipelining Hazards.

10
Paavai Engineering College Department of ECE

In most of the computer programs, the result from one instruction is used as an operand by the
other instruction. When such instructions are executed in pipelining, break down occurs as the
result of the first instruction is not available when instruction two starts collecting operands.
So, instruction two must stall till instruction one is executed and the result is generated. This
type of hazard is called Read –after-write pipelining hazard.

Fig.2.3 Read After Write Pipelining Hazard

Execution of branch instructions also causes a pipelining hazard. Branch instructions while
executed in pipelining effects the fetch stages of the next instructions.

Fig.2.4 Pipelined Branch Behavior

2.2 PIPELINE ORGANIZATION

Pipelining
To improve the performance of a CPU we have two options:
1) Improve the hardware by introducing faster circuits.

11
Paavai Engineering College Department of ECE

2) Arrange the hardware such that more than one operation can be performed at the same
time.
Since, there is a limit on the speed of hardware and the cost of faster circuits is quite
high, we have to adopt the 2nd option.
Pipelining : Pipelining is a process of arrangement of hardware elements of the CPU
such that its overall performance is increased. Simultaneous execution of more than one
instruction takes place in a pipelined processor.
Let us see a real life example that works on the concept of pipelined operation.
Consider a water bottle packaging plant. Let there be 3 stages that a bottle should pass through,
Inserting the bottle(I), Filling water in the bottle(F), and Sealing the bottle(S). Let us consider
these stages as stage 1, stage 2 and stage 3 respectively. Let each stage take 1 minute to
complete its operation.
Now, in a non pipelined operation, a bottle is first inserted in the plant, after 1 minute it is
moved to stage 2 where water is filled. Now, in stage 1 nothing is happening. Similarly, when
the bottle moves to stage 3, both stage 1 and stage 2 are idle. But in pipelined operation, when
the bottle is in stage 2, another bottle can be loaded at stage 1. Similarly, when the bottle is in
stage 3, there can be one bottle each in stage 1 and stage 2. So, after each minute, we get a new
bottle at the end of stage 3. Hence, the average time taken to manufacture 1 bottle is :
Without pipelining = 9/3 minutes = 3m
IFS||||||
|||IFS|||
| | | | | | I F S (9 minutes)
With pipelining = 5/3 minutes = 1.67m
IFS||
|IFS|
| | I F S (5 minutes)
Thus, pipelined operation increases the efficiency of a system.

12
Paavai Engineering College Department of ECE

2.2.1 Design of a basic pipeline

• In a pipelined processor, a pipeline has two ends, the input end and the output end.
Between these ends, there are multiple stages/segments such that output of one stage
is connected to input of next stage and each stage performs a specific operation.
• Interface registers are used to hold the intermediate output between two stages. These
interface registers are also called latch or buffer.
• All the stages in the pipeline along with the interface registers are controlled by a
common clock.
2.2.2 Execution in a pipelined processor
Execution sequence of instructions in a pipelined processor can be visualized using a
space-time diagram. For example, consider a processor having 4 stages and let there be 2
instructions to be executed. We can visualize the execution sequence through the following
space-time diagrams:
Non overlapped execution:

Stage / Cycle 1 2 3 4 5 6 7 8

S1 I1 I2

S2 I1 I2

S3 I1 I2

S4 I1 I2

Total time = 8 Cycle

13
Paavai Engineering College Department of ECE

Overlapped execution:

Stage / Cycle 1 2 3 4 5

S1 I1 I2

S2 I1 I2

S3 I1 I2

S4 I1 I2

Total time = 5 Cycle

2.2.3 Pipeline Stages

RISC processor has 5 stage instruction pipeline to execute all the instructions in the
RISC instruction set. Following are the 5 stages of RISC pipeline with their respective
operations:
• Stage 1 (Instruction Fetch)
In this stage the CPU reads instructions from the address in the memory whose value is
present in the program counter.
• Stage 2 (Instruction Decode)
In this stage, instruction is decoded and the register file is accessed to get the values from
the registers used in the instruction.
• Stage 3 (Instruction Execute)
In this stage, ALU operations are performed.
• Stage 4 (Memory Access)
In this stage, memory operands are read and written from/to the memory that is present in
the instruction.

14
Paavai Engineering College Department of ECE

• Stage 5 (Write Back)

In this stage, computed/fetched value is written back to the register present in the
instructions.
2.2.4 Performance of a pipelined processor
Consider a ‘k’ segment pipeline with clock cycle time as ‘Tp’. Let there be ‘n’ tasks to be
completed in the pipelined processor. Now, the first instruction is going to take ‘k’ cycles to
come out of the pipeline but the other ‘n – 1’ instructions will take only ‘1’ cycle each, i.e, a
total of ‘n – 1’ cycles. So, time taken to execute ‘n’ instructions in a pipelined processor:
ETpipeline = k + n – 1 cycles
= (k + n – 1) Tp
In the same case, for a non-pipelined processor, execution time of ‘n’ instructions will be:
ETnon-pipeline = n * k * Tp
So, speedup (S) of the pipelined processor over non-pipelined processor, when ‘n’ tasks are
executed on the same processor is:
S = Performance of pipelined processor /
Performance of Non-pipelined processor
As the performance of a processor is inversely proportional to the execution time, we have,
S = ETnon-pipeline / ETpipeline
=> S = [n * k * Tp] / [(k + n – 1) * Tp]
S = [n * k] / [k + n – 1]
When the number of tasks ‘n’ are significantly larger than k, that is, n >> k
S=n*k/n
S=k
where ‘k’ are the number of stages in the pipeline.
Also, Efficiency = Given speed up / Max speed up = S / Smax
We know that, Smax = k
So, Efficiency = S / k
Throughput = Number of instructions / Total time to complete the instructions
So, Throughput = n / (k + n – 1) * Tp

15
Paavai Engineering College Department of ECE

Note: The cycles per instruction (CPI) value of an ideal pipelined processor is 1

2.2.5 Types of pipelining

• Uniform delay pipeline
In this type of pipeline, all the stages will take same time to complete an operation.
In uniform delay pipeline, Cycle Time (Tp) = Stage Delay
If buffers are included between the stages then, Cycle Time (Tp) = Stage Delay + Buffer
Delay

• Non-Uniform delay pipeline

In this type of pipeline, different stages take different time to complete an operation.
In this type of pipeline, Cycle Time (Tp) = Maximum(Stage Delay)
For example, if there are 4 stages with delays, 1 ns, 2 ns, 3 ns, and 4 ns, then
Tp = Maximum(1 ns, 2 ns, 3 ns, 4 ns) = 4 ns
If buffers are included between the stages,
Tp = Maximum(Stage delay + Buffer delay)
Example : Consider a 4 segment pipeline with stage delays (2 ns, 8 ns, 3 ns, 10 ns). Find the
time taken to execute 100 tasks in the above pipeline.
Solution : As the above pipeline is a non-linear pipeline,
Tp = max(2, 8, 3, 10) = 10 ns
We know that ETpipeline = (k + n – 1) Tp = (4 + 100 – 1) 10 ns = 1030 ns
NOTE: MIPS = Million instructions per second

2.2.6 Performance of pipeline with stalls

Speed Up (S) = Performance pipeline / Performancenon-pipeline
=> S = Average Execution Timenon-pipeline / Average Execution Timepipeline
=> S = CPInon-pipeline * Cycle Timenon-pipeline / CPIpipeline * Cycle Timepipeline

16
Paavai Engineering College Department of ECE

Ideal CPI of the pipelined processor is ‘1’. But due to stalls, it becomes greater than ‘1’.
=>S = CPInon-pipeline * Cycle Timenon-pipeline / (1 + Number of stalls per Instruction) * Cycle
Timepipeline
As Cycle Timenon-pipeline = Cycle Timepipeline,
Speed Up (S) = CPInon-pipeline / (1 + Number of stalls per instruction)

2.3 PIPELINE HAZARDS (ISSUES)

In a pipelined design few instructions are in some stage of execution. There are
possibilities for some kind of dependency amongst these set of instructions and thereby
limiting the speed of the Pipeline. The dependencies occur for a few reasons which we will be
discussing soon. The dependencies in the pipeline are called Hazards as these cause hazard to
the execution. We use the word Dependencies and Hazard interchangeably as these are used
so in Computer Architecture. Essentially an occurrence of a hazard prevents an instruction in
the pipe from being executed in the designated clock cycle. We use the word clock cycle,
because each of these instructions may be in different machine cycle of theirs.
There are three kinds of hazards:

• Structural Hazards
• Data Hazards
• Control Hazards
There are many specific solutions to dependencies. The simplest is introducing a bubble which
stalls the pipeline and reduces the throughput. The bubble makes the next instruction wait until
the earlier instruction is done with.
2.3.1 Structural Hazards
Structural hazards arise due to hardware resource conflict amongst the instructions in
the pipeline. A resource here could be the Memory, a Register in GPR or ALU. This resource
conflict is said to occur when more than one instruction in the pipe is requiring access to the
same resource in the same clock cycle. This is a situation that the hardware cannot handle all
possible combinations in an overlapped pipelined execution.

17
Paavai Engineering College Department of ECE

Figure 2.5 Structural dependency scenario

Observe the figure 3.5. In any system, instruction is fetched from memory in IF
machine cycle. In our 4-stage pipeline Result Writing (RW) may access memory or one of the
General Purpose Registers depending on the instruction. At t4, Instruction-1(I1) is at RW stage
and Instruction-4(I4) at IF stage. Alas!. Both I1 and I4 are accessing the same resource i.e
memory if I1 is a STORE instruction. How is it possible to access memory by 2 instructions
from the same CPU in a timing state? Impossible. This is structural dependency. What is the
solution?
Solution 1: Introduce bubble which stalls the pipeline as in figure 3.5. At t4, I4 is not
allowed to proceed, rather delayed. It could have been allowed in t5, but again a clash with I2
RW. For the same reason, I4 is not allowed in t6 too. Finally, I4 could be allowed to proceed
(stalled) in the pipe only at t7.

Fig. 2.6 Structural Hazard solution using "stall" in a 4 stage pipeline design

18
Paavai Engineering College Department of ECE

This delay is percolated to all the subsequent instructions too. Thus, while the ideal 4-stage
system would have taken 8 timing states to execute 5 instructions, now due to structural
dependency it has taken 11 timing states. Just not this. By now you would have guessed that
this hazard is likely to happen at every 4th instruction. Not at all a good solution for a heavy
load on CPU. Is there a better way? Yes!
A better solution would be to increase the structural resources in the system using one of the
few choices below:

• The pipeline may be increased to 5 or more stages and suitably redefine the
functionality of the stages and adjust the clock frequency. This eliminates the issue of
the hazard at every 4th instruction in the 4-stage pipeline
• The memory may physically be separated as Instruction memory and Data Memory.
A Better choice would be to design as Cache memory in CPU, rather than dealing with
Main memory. IF uses Instruction memory and Result writing uses Data Memory.
These become two separate resources avoiding dependency.
• It is possible to have Multiple levels of Cache in CPU too.
• There is a possibility of ALU in resource dependency. ALU may be required in IE
machine cycle by an instruction while another instruction may require ALU in IF stage
to calculate Effective Address based on addressing mode. The solution would be either
stalling or have an exclusive ALU for address calculation.
• Register files are used in place of GPRs. Register files have multiport access with
exclusive read and write ports. This enables simultaneous access on one write register
and read register.
The last two methods are implemented in modern CPUs. Beyond these, if dependency arises,
Stalling is the only option. Keep in mind that increasing resources involves increased cost. So
the trade-off is a designer’s choice.
2.3.2 Data Hazards
Data hazards occur when an instruction's execution depends on the results of some previous
instruction that is still being processed in the pipeline. Consider the example below.

19
Paavai Engineering College Department of ECE

Fig.2.7 Data Hazard scenario

In the above case, ADD instruction writes the result into the register R3 in t5. If bubbles are
not introduced to stall the next SUB instruction, all three instructions would be using the wrong
data from R3, which is earlier to ADD result. The program goes wrong!

Solution 1: Introduce three bubbles at SUB instruction IF stage. This will facilitate SUB – ID
to function at t6. Subsequently, all the following instructions are also delayed in the pipe
.
Solution 2: Data forwarding - Forwarding is passing the result directly to the functional unit
that requires it: a result is forwarded from the output of one unit to the input of another. The
purpose is to make available the solution early to the next instruction.
In this case, ADD result is available at the output of ALU in ADD –IE i.e t3 end. If this
can be controlled and forwarded by the control unit to SUB-IE stage at t4, before writing on
to output register R3, then the pipeline will go ahead without any stalling. This requires extra
logic to identify this data hazard and act upon it. It is to be noted that although normally
Operand Fetch happens in the ID stage, it is used only in IE stage. Hence forwarding is given
to IE stage as input. Similar forwarding can be done with OR and AND instruction too.

20
Paavai Engineering College Department of ECE

Fig.2.8 Data forwarding solution for Data Hazard

Solution 3: Compiler can play a role in detecting the data dependency and reorder (resequence)
the instructions suitably while generating executable code. This way the hardware can be
eased.
Solution 4: In the event, the above reordering is infeasible, the compiler may detect and
introduce NOP ( no operation) instruction(s). NOP is a dummy instruction equivalent bubble,
introduced by the software. The compiler looks into data dependencies in code optimisation
stage of the compilation process.
Data Hazards classification
Data hazards are classified into three categories based on the order of READ or WRITE
operation on the register and as follows:

1. RAW (Read after Write) [Flow/True data dependency]

This is a case where an instruction uses data produced by a previous one. Example
ADD R0, R1, R2
SUB R4, R3,
2. WAR (Write after Read) [Anti-Data dependency]
This is a case where the second instruction writes onto register before the first
instruction reads. This is rare in a simple pipeline structure. However, in some
machines with complex and special instructions case, WAR can happen.

21
Paavai Engineering College Department of ECE

ADD R2, R1, R0

SUB R0, R3, R4

3. WAW (Write after Write) [Output data dependency]

This is a case where two parallel instructions write the same register and must do it in
the order in which they were issued.
ADD R0, R1, R2
SUB R0, R4, R5
WAW and WAR hazards can only occur when instructions are executed in parallel or out of
order. These occur because the same register numbers have been allotted by the compiler
although avoidable. This situation is fixed by renaming one of the registers by the compiler or
by delaying the updating of a register until the appropriate value has been produced. Modern
CPUs not only have incorporated Parallel execution with multiple ALUs but also Out of order
issue and execution of instructions along with many stages of pipelines.

2.3.3 Control Hazards

Control hazards are called Branch hazards and caused by Branch Instructions. Branch
instructions control the flow of program/ instructions execution. Recall that we use conditional
statements in the higher-level language either for iterative loops or with conditions checking
(correlate with for, while, if, case statements). These are transformed into one of the variants
of BRANCH instructions. It is necessary to know the value of the condition being checked to
get the program flow. Life is complicating you! So it is for the CPU!

Thus a Conditional hazard occurs when the decision to execute an instruction is based
on the result of another instruction like a conditional branch, which checks the condition’s
resultant value.

22
Paavai Engineering College Department of ECE

The branch and jump instructions decide the program flow by loading the appropriate
location in the Program Counter(PC). The PC has the value of the next instruction to be fetched
and executed by CPU. Consider the following sequence of instructions.

Fig.2.9 Control Hazard scenario

In this case, there is no point in fetching the I3. What happens to the pipeline? While in I2, the
I3 fetch needs to be stopped. This can be known only after I2 is decoded as JMP and not until
then. So the pipeline cannot proceed at its speed and hence this is a Control Dependency
(hazard). In case I3 is fetched in the meantime, it is not only a redundant work but possibly
some data in registers might have got altered and needs to be undone.
Similar scenarios arise with conditional JMP or BRANCH.
2.3.4 Solutions for Conditional Hazards

1. Stall the Pipeline as soon as decoding any kind of branch instructions. Just not allow
anymore IF. As always, stalling reduces throughput. The statistics say that in a
program, at least 30% of the instructions are BRANCH. Essentially the pipeline
operates at 50% capacity with Stalling.
2. Prediction – Imagine a for or while loop getting executed for 100 times. We know for
sure 100 times the program flows without the branch condition being met. Only in the
101st time, the program comes out of the loop. So, it is wiser to allow the pipeline to

23
Paavai Engineering College Department of ECE

proceed and undo/flush when the branch condition is met. This does not affect the
throttle of the pipeline as much stalling.
3. Dynamic Branch Prediction - A history record is maintained with the help of Branch
Table Buffer (BTB). The BTB is a kind of cache, which has a set of entries, with the
PC address of the Branch Instruction and the corresponding effective branch address.
This is maintained for every branch instruction encountered. SO whenever a
conditional branch instruction is encountered, a lookup for the matching branch
instruction address from the BTB is done. If hit, then the corresponding target branch
address is used for fetching the next instruction. This is called dynamic branch
prediction.

Figure. Branch Table Buffer

This method is successful to the extent of the temporal locality of reference in the
programs. When the prediction fails flushing needs to take place.

4. Reordering instructions - Delayed branch i.e. reordering the instructions to position

the branch instruction later in the order, such that safe and useful instructions which
are not affected by the result of a branch are brought-in earlier in the sequence thus
delaying the branch instruction fetch. If no such instructions are available then NOP is
introduced. This delayed branch is applied with the help of Compiler.
I do not want to load the reader with more timing state diagram. But I am sure the earlier
discussions would have familiarised the reader to understand with words.
Last but not the least, in a pipelined design, the Control unit is expected to handle the following
scenarios:

24
Paavai Engineering College Department of ECE

• No Dependence
• Dependence requiring Stall
• Dependence solution by Forwarding
• Dependence with access in order
• Out of Order Execution
• Branch Prediction Table and more
Pipeline Hazards knowledge is important for designers and Compiler writers.

2.4 INFLUENCE ON INSTRUCTION SETS

In this section, we examine the relationship between pipelined execution and machine
instruction features. We discuss two key aspects of machine instructions — addressing modes and
condition code flags.

ADDRESSING MODES
Addressing modes should provide the means for accessing a variety of data structures simply
and efficiently. Useful addressing modes include index, indirect, autoincrement, and
autodecrement. Many processors provide various combinations of these modes to increase the
flexibility of their instruction sets. Complex addressing modes, such as those involving double
indexing, are often encountered.
In choosing the addressing modes to be implemented in a pipelined processor, we must
consider the effect of each addressing mode on instruction flow in the pipeline. Two important
considerations in this regard are the side effects of modes such as autoincrement and
autodecrement and the extent to which complex addressing modes cause the pipeline to stall.
Another important factor is whether a given mode is likely to be used by compilers.
To compare various approaches, we assume a simple model for accessing operands in the
memory. The load instruction Load X(R1),R2 takes five cycles to complete execution, as
indicated in Figure 8.5. However, the instruction
Load (R1),R2
can be organized to fit a four-stage pipeline because no address computation is required. Access to
memory can take place in stage E. A more complex addressing mode may require several
accesses to the memory to reach the named operand. For example, the instruction
Load (X(R1)),R2
may be executed as shown in Figure 8.16a, assuming that the index offset, X, is given in the
instruction word. After computing the address in cycle 3, the processor needs to access
memory twice — first to read location X [R1] in clock cycle + 4 and then to read location [X
+ operand in the next instruction, that instruction would be
[R1]] in cycle 5. If R2 is a source
stalled for three cycles, which can be reduced to two cycles with operand forwarding, as shown.

25
Time Clock cycle 12
3 4 5 6 7

Load F D X +[R1] [X +[R1]] [[X +[R1]]] W

Forward
Next instruction F D E W

(a) Complex addressing mode

Add F D X +[R1] W

Load F D [X +[R1]] W

Load F D [[X +[R1]]] W

Next instruction F D E W

(b) Simple addressing mode

Figure 2.10 Equivalent operations using complex and simple addressing modes.

To implement the same Load operation using only simple addressing modes requires
several instructions. For example, on a computer that allows three operand addresses, we can
use
Add #X,R1,R2 Load (R2),R2 Load (R2),R2
The Add instruction performs the operation R2
← +X [R1]. The two Load instructions fetch the
address and then the operand from the memory. This sequence of instructions takes exactly the
same number of clock cycles as the original, single Load instruction, as shown in Figure 8.16b.

26
Paavai Engineering College Department of ECE

This example indicates that, in a pipelined processor, complex addressing modes that involve
several accesses to the memory do not necessarily lead to faster execution. The main advantage of
such modes is that they reduce the number of instructions needed to perform a given task and
thereby reduce the program space needed in the main memory. Their main disadvantage is that
their long execution times cause the pipeline to stall, thus reducing its effectiveness. They require
more complex hardware to decode and execute them. Also, they are not convenient for compilers
to work with.
The instruction sets of modern processors are designed to take maximum advantage of pipelined
hardware. Because complex addressing modes are not suitable for pipelined execution, they should be
avoided. The addressing modes used in modern processors often have the following features:
• Access to an operand does not require more than one access to the memory.
• Only load and store instructions access memory operands.
• The addressing modes used do not have side effects.
Three basic addressing modes that have these features are register, register indirect, and index. The
first two require no address computation. In the index mode, the address can be computed in one
cycle, whether the index value is given in the instruction or in a register. Memory is accessed in
the following cycle. None of these modes has any side effects, with one possible exception. Some
architectures, such as ARM, allow the address computed in the index mode to be written back into
the index register. This is a side effect that would not be allowed under the guidelines above. Note
also that relative addressing can be used; this is a special case of indexed addressing in which the
program counter is used as the index register.
The three features just listed were first emphasized as part of the concept of RISC processors.
The SPARC processor architecture, which adheres to these guidelines, is presented in Section 8.7.

CONDITION CODES
In many processors, such as those described in Chapter 3, the condition code flags are stored in
the processor status register. They are either set or cleared by many instructions, so that they can
be tested by subsequent conditional branch instructions to change the flow of program execution. An
optimizing compiler for a pipelined processor attempts to reorder instructions to avoid stalling the
pipeline when branches or data dependencies between successive instructions occur. In doing so,
the compiler must ensure that reordering does not cause a change in the outcome of a computation.
The dependency introduced by the condition-code flags reduces the flexibility available for the
compiler to reorder instructions.
Consider the sequence of instructions in Figure 8.17a, and assume that the execution of the
Compare and Branch 0 instructions proceeds as in=Figure 8.14. The branch decision takes place in
step E2 rather than D2 because it must await the result of the Compare instruction. The execution
time of the Branch instruction can be reduced
These observations lead to two important conclusions about the way condition codes should
be handled. First, to provide flexibility in reordering instructions, the condition-code flags should
be affected by as few instructions as possible. Second, the compiler should be able to specify in
which instructions of a program the condition codes are affected and in which they are not. An
instruction set designed with pipelining in mind usually provides the desired flexibility. Figure
8.17b shows the instructions reordered assuming that the condition code flags are affected only
when this is explicitly stated as part of the instruction OP code. The SPARC and ARM

27
Paavai Engineering College Department of ECE

architectures provide this flexibility.

2.5 DATAPATH AND CONTROL CONSIDERATIONS

Organization of the internal data path of a processor was introduced in Chapter 7. Con- sider the
three-bus structure presented in Figure 2.11. To make it suitable for pipelined execution, it can be
modified as shown in Figure 8.18 to support a 4-stage pipeline. The resources involved in stages F
and E are shown in blue and those used in stages D and W in black. Operations in the data cache
may happen during stage E or at a later stage, depending on the addressing mode and the
implementation details. This section

Bus B
Register

Bus C
file

PC
Control signal pipeline

Incrementer
ALU R

B
Instruction IMAR
decoder

Instruction Memory
queue address(Data
access)

MDR/Write DMAR MDR/Read

Instruction cache Memory address
(Instruction

Figure 2.11 Datapath modified for pipelined

1. There are separate instruction and data caches that use separate address and data
connections to the processor. This requires two versions of the MAR register, IMAR for accessing
the instruction cache and DMAR for accessing the data cache.
2. The PC is connected directly to the IMAR, so that the contents of the PC can
be transferred to IMAR at the same time that an independent ALU operation is taking place.

2.6 PERFORMANCE CONSIDERATIONS

We pointed out in Section 1.6 that the execution time, T , of a program that has a dynamic instruction
count N is given by
N × SR
T =
28
Paavai Engineering College Department of ECE

where S is the average number of clock cycles it takes to fetch and execute one instruction, and R
is the clock rate. This simple model assumes that instructions are executed one after the other, with
no overlap. A useful performance indicator is the instruction throughput, which is the number of
instructions executed per second. For sequential execution, the throughput, Ps is given by
Ps = R/S
In this section, we examine the extent to which pipelining increases instruction throughput.
However, we should reemphasize the point made in Chapter 1 regarding performance measures.
The only real measure of performance is the total execution time of a program. Higher instruction
throughput will not necessarily lead to higher performance if a larger number of instructions is
needed to implement the desired task. For this reason, the SPEC ratings described in Chapter 1 provide
a much better indicator when comparing two processors.
Figure 8.2 shows that a four-stage pipeline may increase instruction throughput by a factor of
four. In general, an n-stage pipeline has the potential to increase throughput n times. Thus, it would
appear that the higher the value of n, the larger the performance gain. This leads to two questions:
• How much of this potential increase in instruction throughput can be realized in practice?
• What is a good value for n?
• Any time a pipeline is stalled, the instruction throughput is reduced. Hence, the per-
formance of a pipeline is highly influenced by factors such as branch and cache miss
penalties. First, we discuss the effect of these factors on performance, and then we return to
the question of how many pipeline stages should be used.

PERFORMANCE EVALUATION
Pipelining is a process of arrangement of hardware elements of the CPU such that its overall
performance is increased. Simultaneous execution of more than one instruction takes place in a
pipelined processor. ... Thus, pipelined operation increases the efficiency of a system.
In computing, computer performance is the amount of useful work accomplished by a computer
system. Outside of specific contexts, computer performance is estimated in terms of
accuracy, efficiency and speed of executing computer program instructions. When it comes to
high computer performance, one or more of the following factors might be involved:
• Short response time for a given piece of work.
• High throughput (rate of processing work).
• Low utilization of computing resource(s).
o Fast (or highly compact) data compression and decompression.
• High availability of the computing system or application.
• High bandwidth.
• Short data transmission time.
Technical and non-technical definitions

29
Paavai Engineering College Department of ECE

The performance of any computer system can be evaluated in measurable, technical terms, using
one or more of the metrics listed above. This way the performance can be
• Compared relative to other systems or the same system before/after changes
• In absolute terms, e.g. for fulfilling a contractual obligation
Whilst the above definition relates to a scientific, technical approach, the following definition
given by Arnold Allen would be useful for a non-technical audience:
The word performance in computer performance means the same thing that performance means
in other contexts, that is, it means "How well is the computer doing the work it is supposed to
do?"[1]
As an aspect of software quality
Computer software performance, particularly software application response time, is an aspect
of software quality that is important in human–computer interactions.
Performance engineering
Performance engineering within systems engineering encompasses the set of roles, skills,
activities, practices, tools, and deliverables applied at every phase of the systems development life
cycle which ensures that a solution will be designed, implemented, and operationally supported to
meet the performance requirements defined for the solution.
Performance engineering continuously deals with trade-offs between types of performance.
Occasionally a CPU designer can find a way to make a CPU with better overall performance by
improving one of the aspects of performance, presented below, without sacrificing the CPU's
performance in other areas. For example, building the CPU out of better, faster transistors.
However, sometimes pushing one type of performance to an extreme leads to a CPU with worse
overall performance, because other important aspects were sacrificed to get one impressive-
looking number, for example, the chip's clock rate (see the megahertz myth).
Application performance engineering
Application Performance Engineering (APE) is a specific methodology within performance
engineering designed to meet the challenges associated with application performance in
increasingly distributed mobile, cloud and terrestrial IT environments. It includes the roles, skills,
activities, practices, tools and deliverables applied at every phase of the application lifecycle that
ensure an application will be designed, implemented and operationally supported to meet non-
functional performance requirements.

30
Paavai Engineering College Department of ECE

Aspects of performance
Computer performance metrics (things to measure) include availability, response time, channel
capacity, latency, completion time, service time, bandwidth, throughput, relative
efficiency, scalability, performance per watt, compression ratio, instruction path length and speed
up. CPU benchmarks are available.[2]
Availability
Availability of a system is typically measured as a factor of its reliability - as reliability increases,
so does availability (that is, less downtime). Availability of a system may also be increased by the
strategy of focusing on increasing testability and maintainability and not on reliability. Improving
maintainability is generally easier than reliability. Maintainability estimates (Repair rates) are also
generally more accurate. However, because the uncertainties in the reliability estimates are in most
cases very large, it is likely to dominate the availability (prediction uncertainty) problem, even
while maintainability levels are very high.
Response time
Response time is the total amount of time it takes to respond to a request for service. In computing,
that service can be any unit of work from a simple disk IO to loading a complex web page. The
response time is the sum of three numbers:[3]
• Service time - How long it takes to do the work requested.
• Wait time - How long the request has to wait for requests queued ahead of it before it gets to run.
• Transmission time – How long it takes to move the request to the computer doing the work and
the response back to the requestor.
Processing speed
Most consumers pick a computer architecture (normally Intel IA32 architecture) to be able to run
a large base of pre-existing, pre-compiled software. Being relatively uninformed on computer
benchmarks, some of them pick a particular CPU based on operating frequency (see megahertz
myth).
Some system designers building parallel computers pick CPUs based on the speed per dollar.

Channel capacity
Channel capacity is the tightest upper bound on the rate of information that can be reliably
transmitted over a communications channel. By the noisy-channel coding theorem, the channel

31
Paavai Engineering College Department of ECE

capacity of a given channel is the limiting information rate (in units of information per unit time)
that can be achieved with arbitrarily small error probability.[4][5]
Latency
Latency is a time delay between the cause and the effect of some physical change in the system
being observed. Latency is a result of the limited velocity with which any physical interaction can
take place. This velocity is always lower or equal to speed of light. Therefore, every physical
system that has spatial dimensions different from zero will experience some sort of latency.
System designers building real-time computing systems want to guarantee worst-case response.
That is easier to do when the CPU has low interrupt latency and when it has deterministic response.
Bandwidth
In computer networking, bandwidth is a measurement of bit-rate of available or consumed data
communication resources, expressed in bits per second or multiples of it (bit/s, kbit/s, Mbit/s,
Gbit/s, etc.).
Bandwidth sometimes defines the net bit rate (aka. peak bit rate, information rate, or physical layer
useful bit rate), channel capacity, or the maximum throughput of a logical or physical
communication path in a digital communication system.
Throughput
In general terms, throughput is the rate of production or the rate at which something can be
processed.
In communication networks, throughput is essentially synonymous to digital bandwidth
consumption. In wireless networks or cellular communication networks, the system spectral
efficiency in bit/s/Hz/area unit, bit/s/Hz/site or bit/s/Hz/cell, is the maximum system throughput
(aggregate throughput) divided by the analog bandwidth and some measure of the system coverage
area.
In integrated circuits, often a block in a data flow diagram has a single input and a single output,
and operate on discrete packets of information. Examples of such blocks are FFT modules
or binary multipliers. Because the units of throughput are the reciprocal of the unit for propagation
delay, which is 'seconds per message' or 'seconds per output', throughput can be used to relate a
computational device performing a dedicated function such as an ASIC or embedded processor to
a communications channel, simplifying system analysis.

32
Paavai Engineering College Department of ECE

Relative efficiency
The efficiency of n stages in a pipeline is defined as ratio of the actual speedup to the maximum
speed

Scalability
Scalability is the ability of a system, network, or process to handle a growing amount of work in a
capable manner or its ability to be enlarged to accommodate that growth
Power consumption
The amount of electric power used by the computer (power consumption). This becomes
especially important for systems with limited power sources such as solar, batteries, human power.
Performance per watt
System designers building parallel computers, such as Google's hardware, pick CPUs based on
their speed per watt of power, because the cost of powering the CPU outweighs the cost of the
CPU itself. For spaceflight computers, the processing speed per watt ratio is a more useful
performance criterion than raw processing speed.
Compression ratio
Compression is useful because it helps reduce resource usage, such as data storage space or
transmission capacity. Because compressed data must be decompressed to use, this extra
processing imposes computational or other costs through decompression; this situation is far from
being a free lunch. Data compression is subject to a space–time complexity trade-off.
Size and weight
This is an important performance feature of mobile systems, from the smart phones you keep in
your pocket to the portable embedded systems in a spacecraft.
Environmental impact
The effect of a computer or computers on the environment, during manufacturing and recycling as
well as during use. Measurements are taken with the objectives of reducing waste, reducing
hazardous materials, and minimizing a computer's ecological footprint.
Transistor count
The transistor count is the number of transistors on an integrated circuit (IC). Transistor count is
the most common measure of IC complexity.
Benchmarks

33
Paavai Engineering College Department of ECE

Because there are so many programs to test a CPU on all aspects of performance, benchmarks were
developed.
The most famous benchmarks are the SPECint and SPECfp benchmarks developed by Standard
Performance Evaluation Corporation and the Certification Mark benchmark developed by the
Embedded Microprocessor Benchmark Consortium EEMBC.
Software performance testing
In software engineering, performance testing is in general testing performed to determine how a
system performs in terms of responsiveness and stability under a particular workload. It can also
serve to investigate, measure, validate or verify other quality attributes of the system, such as
scalability, reliability and resource usage.
Performance testing is a subset of performance engineering, an emerging computer science
practice which strives to build performance into the implementation, design and architecture of a
system.
Advantages of Pipelining
• Instruction throughput increases.
• Increase in the number of pipeline stages increases the number of instructions executed
simultaneously.
• Faster ALU can be designed when pipelining is used.
• Pipelined CPU’s works at higher clock frequencies than the RAM.
• Pipelining increases the overall performance of the CPU.
Disadvantages of Pipelining
• Designing of the pipelined processor is complex.
• Instruction latency increases in pipelined processors.
• The throughput of a pipelined processor is difficult to predict.
• The longer the pipeline, worse the problem of hazard for branch instructions.
2.7 EXCEPTION HANDLING
Exceptions and interrupts are unexpected events that disrupt the normal flow of instruction
execution. An exception is an unexpected event from within the processor. An interrupt is an
unexpected event from outside the processor.
Exceptions or interrupts are unexpected events that require change in flow of control. Different
ISAs use the terms differently. Exceptions generally refer to events that arise within the CPU, for
example, undefined opcode, overflow, system call, etc. Interrupts point to requests coming from

34
Paavai Engineering College Department of ECE

an external I/O controller or device to the processor. Dealing with these events without sacrificing
performance is hard. For the rest of the discussion, we will not distinguish between the two. We
shall refer to them collectively as exceptions. Some examples of such exceptions are listed below:

• I/O device request

• Invoking an OS service from a user program

• Tracing instruction execution

• Breakpoint

• Integer arithmetic overflow

• FP arithmetic anomaly

• Page fault

• Misaligned memory access

• Memory protection violation

• Using an undefined or unimplemented instruction

• Hardware malfunctions

• Power failure

There are different characteristics for exceptions. They are as follows:

• Synchronous vs Asynchronous

• Some exceptions may be synchronous, whereas others may be asynchronous. If the same
exception occurs in the same place with the same data and memory allocation, then it is a
synchronous exception.
They are more difficult to handle.

• Devices external to the CPU and memory cause asynchronous exceptions. They can be handled
after the current instruction and hence easier than synchronous exceptions.

• User requested vs Coerced

• Some exceptions may be user requested and not automatic. Such exceptions are predictable
and can be handled after the current instruction.

35
Paavai Engineering College Department of ECE

• Coerced exceptions are generally raised by hardware and not under the control of the user
program. They are harder to handle.

• User maskable vs unmaskable

• Exceptions can be maskable or unmaskable. They can be masked or unmasked by a user task.
This decides whether the hardware responds to the exception or not. You may have instructions
that enable or disable exceptions.

• Within vs Between instructions

• Exceptions may have to be handled within the instruction or between instructions. Within
exceptions are normally synchronous and are harder since the instruction has to be stopped and
restarted. Catastrophic exceptions like hardware malfunction will normally cause termination.

• Exceptions that can be handled between two instructions are easier to handle.

• Resume vs Terminate

• Some exceptions may lead to the program to be continued after the exception and some of
them may lead to termination. Things are much more complicated if we have to restart.

• Exceptions that lead to termination are much more easier, since we just have to terminate and
need not restore the original status.

Therefore, exceptions that occur within instructions and exceptions that must be restartable are
much more difficult to handle.

2.8 PARALLEL PROCESSING CHALLENGES

The Hardware Model

An ideal processor is one where all constraints on ILP are removed. The only limits on ILP
in such a processor are those imposed by the actual data flows through either registers or memory.

The assumptions made for an ideal or perfect processor are as follows:

1.Register renaming
—There are an infinite number of virtual registers available, and hence all WAW and WAR
hazards are avoided and an unbounded number of instructions can begin execution simultaneously.

2.Branch prediction
—Branch prediction is perfect. All conditional branches are predicted exactly.

36
Paavai Engineering College Department of ECE

3.Jump prediction
—All jumps (including jump register used for return and computed jumps) are
perfectly predicted. When combined with perfect branch prediction, this is equivalent to having a
processor with perfect speculation and an unbounded buffer of instructions available for execution.

4.Memory address alias analysis

—All memory addresses are known exactly, and a load can be moved before a
store provided that the addresses are not identical. Note that this implements perfect address alias
analysis.

5.Perfect caches
—All memory accesses take 1 clock cycle. In practice, superscalar processors
will typically consume large amounts of ILP hiding cache misses, making these results highly
optimistic.

To measure the available parallelism, a set of programs was compiled and optimized with
the standard MIPS optimizing compilers. The programs were instrumented and executed to
produce a trace of the instruction and data references. Every instruction in the trace is then
scheduled as early as possible, limited only by the data dependences. Since a trace is used, perfect
branch prediction and perfect alias analysis are easy to do. With these mechanisms, instructions
may bescheduled much earlier than they would otherwise, moving across large numbers of
instructions on which they are not data dependent, including branches, since branches are perfectly
predicted.

The effects of various assumptions are given before looking at some ambitious but
realizable processors.

Limitations on the Window Size and Maximum Issue Count

To build a processor that even comes close to perfect branch prediction and perfect alias
analysis requires extensive dynamic analysis, since static compile time schemes cannot be perfect.
Of course, most realistic dynamic schemes will not be perfect, but the use of dynamic schemes
will provide the ability to uncover parallelism that cannot be analysed by static compile time
analysis. Thus, a dynamic processor might be able to more closely match the amount of parallelism
uncovered by our ideal processor.

The Effects of Realistic Branch and Jump Prediction

Our ideal processor assumes that branches can be perfectly predicted: The outcome of
any branch in the program is known before the first instruction is executed! Of course, no real
processor can ever achieve this.

37
Paavai Engineering College Department of ECE

We assume a separate predictor is used for jumps. Jump predictors are important
primarily with the most accurate branch predictors, since the branch frequency is higher and the
accuracy of the branch predictors dominates.

1.Perfect —All branches and jumps are perfectly predicted at the start of execution.
2.Tournament-based branch predictor —The prediction scheme uses a correlating 2-bit
predictor and a noncorrelating 2-bit predictor together with a selector, which chooses the
best predictor for each branch.

The Effects of Finite Registers

Our ideal processor eliminates all name dependences among register references using an
infinite set of virtual registers. To date, the IBM Power5 has provided the largest numbers of virtual
registers: 88 additional floating-point and 88 additional integer registers, in addition to the 64
registers available in the base architecture. All 240 registers are shared by two threads when
executing in multithreading mode, and all are available to a single thread when in single-thread
mode.

The Effects of Imperfect Alias Analysis

Our optimal model assumes that it can perfectly analyze all memory dependences, as well
as eliminate all register name dependences. Of course, perfect alias analysis is not possible in
practice: The analysis cannot be perfect at compile time, and it requires a potentially unbounded
number of comparisons at run time (since the number of simultaneous memory references is
unconstrained).

The three models are

1. Global/stack perfect—This model does perfect predictions for global and stack
references and assumes all heap references conflict. This model represents an idealized version of
the best compiler-based analysis schemes currently in production. Recent and ongoing research on
alias analysis for pointers should improve the handling of pointers to the heap in the future.

2. Inspection—This model examines the accesses to see if they can be determined not to
interfere at compile time. For example, if an access uses R10 as a base register with an offset of
20, then another access that uses R10 as a base register with an offset of 100 cannot interfere,
assuming R10 could not have changed. In addition, addresses based on registers that point to
different allocation areas (such as the global area and the stack area) are assumed never to alias.
This analysis is similar to that performed by many existing commercial compilers, though newer
compilers can do better, at least for looporiented programs.

38
Paavai Engineering College Department of ECE

3. None—All memory references are assumed to conflict. As you might expect, for the
FORTRAN programs (where no heap references exist), there is no difference between perfect and
global/stack perfect analysis
2.9 MULTICORE PROCESSING
The first blog entry in this series introduced the basic concepts of multicore processing and
virtualization, highlighted their benefits, and outlined the challenges these technologies present.
This second post will concentrate on multicore processing, where I will define its various types,
list its current trends, examine its pros and cons, and briefly address its safety and security
ramifications.

Definitions

A multicore processor is a single integrated circuit (a.k.a., chip multiprocessor or CMP) that
contains multiple core processing units, more commonly known as cores. There are many different
multicore processor architectures, which vary in terms of

• Number of cores. Different multicore processors often have different numbers of cores.
For example, a quad-core processor has four cores. The number of cores is usually a power
of two.
• Number of core types.
o Homogeneous (symmetric) cores. All of the cores in a homogeneous multicore
processor are of the same type; typically the core processing units are general-
purpose central processing units that run a single multicore operating system.
o Heterogeneous (asymmetric) cores. Heterogeneous multicore processors have a
mix of core types that often run different operating systems and include graphics
processing units.
• Number and level of caches. Multicore processors vary in terms of their instruction and
data caches, which are relatively small and fast pools of local memory.
• How cores are interconnected. Multicore processors also vary in terms of
their bus architectures.
• Isolation. The amount, typically minimal, of in-chip support for the spatial and temporal
isolation of cores:
o Physical isolation ensures that different cores cannot access the same physical
hardware (e.g., memory locations such as caches and RAM).
o Temporal isolation ensures that the execution of software on one core does not
impact the temporal behavior of software running on another core.

Homogeneous Multicore Processor

The following figure notionally shows the architecture of a system in which 14 software
applications are allocated by a single host operating system to the cores in a homogeneous quad-
core processor. In this architecture, there are three levels of cache, which are progressively larger
but slower: L1 (consisting of an instruction cache and a data cache), L2, and L3. Note that the L1
and L2 caches are local to a single core, whereas L3 is shared among all four cores.

39
Paavai Engineering College Department of ECE

Heterogeneous Multicore Processor

The following figure notionally shows how these 14 applications could be allocated to four
different operating systems, which in turn are allocated to four different cores, in a heterogeneous,
quad-core processor. From left to right, the cores include a general-purpose central processing unit
core running Windows; a graphical processing unit (GPU) core running graphics-intensive
applications on Linux; a digital signal processing (DSP) core running a real-time operating system
(RTOS); and a high-performance core also running an RTOS.

Current Trends in Multicore Processing

40
Paavai Engineering College Department of ECE

Multicore processors are replacing traditional, single-core processors so that fewer single-core
processors are being produced and supported. Consequently, single-core processors are
becoming technologically obsolete. Heterogeneous multicore processors, such as computer-on-a-
chip processors, are becoming more common.

Although multicore processors have largely saturated some application domains (e.g., cloud
computing, data warehousing, and on-line shopping), they are just starting to be used in real-time,
safety- and security-critical, cyber-physical systems. One area in which multicore processing is
becoming popular is in environments constrained by size, weight, and power, and cooling (SWAP-
C), in which significantly increased performance is required.

Pros of Multicore Processing

Multicore processing is typically commonplace because it offers advantages in the following seven
areas:

1. Energy Efficiency. By using multicore processors, architects can decrease the number of
embedded computers. They overcome increased heat generation due to Moore's Law (i.e.,
smaller circuits increase electrical resistance, which creates more heat), which in turn
decreases the need for cooling. The use of multicore processing reduces power
consumption (less energy wasted as heat), which increases battery life.
2. True Concurrency. By allocating applications to different cores, multicore processing
increases the intrinsic support for actual (as opposed to virtual) parallel processing within
individual software applications across multiple applications.
3. Performance. Multicore processing can increase performance by running multiple
applications concurrently. The decreased distance between cores on an integrated chip
enables shorter resource access latency and higher cache speeds when compared to using
separate processors or computers. However, the size of the performance increase depends
on the number of cores, the level of real concurrency in the actual software, and the use of
shared resources.
4. Isolation. Multicore processors may improve (but do not guarantee) spatial and temporal
isolation (segregation) compared to single-core architectures. Software running on one
core is less likely to affect software on another core than if both are executing on the same
single core. This decoupling is due to both spatial isolation (of data in core-specific cashes)
and temporal isolation, because threads on one core are not delayed by threads on another
core. Multicore processing may also improve robustness by localizing the impact of defects
to single core. This increased isolation is particularly important in the independent
execution of mixed-criticality applications (mission-critical, safety critical, and security-
critical).
5. Reliability and Robustness. Allocating software to multiple cores increases reliability and
robustness (i.e., fault and failure tolerance) by limiting fault and/or failure propagation
from software on one core to software on another. The allocation of software to multiple
cores also supports failure tolerance by supporting failover from one core to another (and
subsequent recovery).
6. Obsolescence Avoidance. The use of multicore processors enables architects to avoid
technological obsolescence and improve maintainability. Chip manufacturers are applying

41
Paavai Engineering College Department of ECE

the latest technical advances to their multicore chips. As the number of cores continues to
increase, it becomes increasingly hard to obtain single-core chips.
7. Hardware Costs. By using multicore processors, architects can produce systems with
fewer computers and processors.

Cons of Multicore Processing

Although there are many advantages to moving to multicore processors, architects must address
disadvantages and associated risks in the following six areas:

1. Shared Resources. Cores on the same processor share both processor-internal resources
(L3 cache, system bus, memory controller, I/O controllers, and interconnects)
and processor-external resources (main memory, I/O devices, and networks). These shared
resources imply (1) the existence of single points of failure, (2) two applications running
on the same core can interfere with each other, and (3) software running on one core can
impact software running on another core (i.e., interference can violate spatial and temporal
isolation because multicore support for isolation is limited). The diagram below uses the
color red to illustrate six shared resources.
2. Interference. Interference occurs when software executing on one core impacts the
behavior of software executing on other cores in the same processor. This interference
includes failures of both spatial isolation (due to shared memory access) and failure
of temporal isolation (due to interference delays and/or penalties). Temporal isolation is a
bigger problem than spatial isolation since multicore processors may have special hardware
that can be used to enforce spatial isolation (to prevent software running on different cores
from accessing the same processor-internal memory). The number of interference paths
increases rapidly with the number of cores and the exhaustive analysis of all interference
paths is often impossible. The impracticality of exhaustive analysis necessitates the
selection of representative interference paths when analyzing isolation. The following
diagram uses the color red to illustrate three possible interference paths between pairs of
applications involving six shared resources.
3. Concurrency Defects. Cores execute concurrently, creating the potential for concurrency
defects including deadlock, livelock, starvation, suspension, (data) race conditions,
priority inversion, order violations, and atomicity violations. Note that these are essentially
the same types of concurrency defects that can occur when software is allocated to multiple
threads on a single core.
4. Non-determinism. Multicore processing increases non-determinism. For example, I/O
Interrupts have top-level hardware priority (also a problem with single core processors).
Multicore processing is also subject to lock trashing, which stems from excessive lock
conflicts due to simultaneous access of kernel services by different cores (resulting in
decreased concurrency and performance). The resulting non-deterministic behavior can be
unpredictable, can cause related faults and failures, and can make testing more difficult
(e.g., running the same test multiple times may not yield the same test result).
5. Analysis Difficulty. The real concurrency due to multicore processing requires different
memory consistency models than virtual interleaved concurrency. It also breaks traditional
analysis approaches for work on single core processors. The analysis of maximum time
limits is harder and may be overly conservative. Although interference analysis becomes

42
Paavai Engineering College Department of ECE

more complex as the number of cores-per-processor increases, overly-restricting the core

number may not provide adequate performance.
6. Accreditation and Certification. Interference between cores can cause missed deadlines
and excessive jitter, which in turn can cause faults (hazards) and failures (accidents).
Verifying a multicore system requires proper real-time scheduling and timing analysis
and/or specialized performance testing. Moving from a single-core to a multicore
architecture may require recertification. Unfortunately, current safety policy guidelines are
based on single-core architectures and must be updated based on the recommendations that
will be listed in the final blog entry in this series.

SEI-Research on Multicore Processing

Real-time scheduling on multicore processing platforms is a Department of Defense (DoD)

technical area of urgent concern for unmanned aerial vehicles (UAVs) and other systems that
demand ever-increasing computational power. SEI researchers have provided a range
of techniques and tools that improve scheduling on multicore processors. We developed a mode-
change protocol for multicores with several operational modes, such as aircraft taxi, takeoff, flight,
and landing modes. The SEI developed the first protocol to allow multicore software to switch
modes while meeting all timing requirements, thereby allowing software designers add or remove
software functions while ensuring safety.

This SEI is transitioning this research through activities that include the following:

• a workshop with participants from Carnegie Mellon University, Nagoya University, the
University of Illinois at Urbana-Champaign, and NEC Electronics
• collaboration with Lockheed-Martin researchers to publish papers on the use of multicore
architectures in cyber-physical systems
• invited talks at conferences such as IEEE International Conference on Embedded and Real-
Time Computing Systems and Applications

43
Paavai Engineering College Department of ECE

• authoring a chapter of the Multicore Programming Practices Guide published by The

Multicore Association

FLYNN'S CLASSIFICATION OF COMPUTER ARCHITECTURE

In this tutorial, we are going to learn about the Flynn's Classification of Computer Architecture
in Computer Science Organization.

Classification of computer architecture

According to Flynn's there are four different classification of computer architecture,

1) SISD (Single Instruction Single Data Stream)

Single instruction: Only one instruction stream is being acted or executed by CPU during one
clock cycle.

Single data stream: Only one data stream is used as input during one clock cycle.

A SISD computing system is a uniprocessor machine that is capable of executing a single

instruction operating on a single data stream. Most conventional computers have SISD architecture
where all the instruction and data to be processed have to be stored in primary memory.

2) SIMD (Single Instruction Multiple Data Stream)

A SIMD system is a multiprocessor machine, capable of executing the same instruction on all the
CPUs but operating on the different data stream.

IBM 710 is the real life application of SIMD.

44
Paavai Engineering College Department of ECE

3) MISD (Multiple Instruction Single Data stream)

An MISD computing is a multiprocessor machine capable of executing different instructions on

processing elements but all of them operating on the same data set.

4) MIMD (Multiple Instruction Multiple Data Stream)

45
Paavai Engineering College Department of ECE

A MIMD system is a multiprocessor machine that is capable of executing multiple instructions

over multiple data streams. Each processing element has a separate instruction stream and data
stream.

SPMD

In computing, SPMD (single program, multiple data) is a technique employed to achieve

parallelism; it is a subcategory of MIMD. Tasks are split up and run simultaneously on multiple
processors with different input in order to obtain results faster. SPMD is the most common style
of parallel programming.

2.11 HARDWARE MULTITHREADING

As a computer program is executed, there are many events that can cause the CPU hardware
resources to not be fully utilized every CPU cycle. Such events include:

• Data Cache Misses - the required data must be loaded from memory outside the CPU. The
CPU has to wait for that data to arrive from the remote memory.
• Instruction Cache Misses - the next instruction of the program must be fetched from
memory outside the CPU. Again, the CPU has to wait for the next instruction to arrive
from the remote memory.
• Data dependency stalls - the next instruction can't execute yet as one of its input operands
hasn't been calculated yet.
• Functional Unit stalls - the next instruction can't execute yet as the required hardware
resource is currently busy.
When one portion of the program (known as a thread) is blocked for one of these events, the
hardware resources could potentially be used for another thread of execution. By switching
to a second thread when the first thread is blocked, the overall through-put of the system can
be increased. The idea of speeding up the aggregate execution of all threads in the system is

46
Paavai Engineering College Department of ECE

known as "Throughput Computing". This is in contrast to speeding up the execution of a

single thread (or known as single-threaded execution).

If one replicates an entire CPU to execute a second thread, then the technique is known as
multi-processing.

If one replicates only a portion of a CPU to execute a second thread, then the technique is
known as multi-threading.

A simple graphical example can be seen in Figures 1 and 2 - where the multi-threaded
implementation in Figure 2 does more aggregate work in the same number of cycles as the
single-threaded CPU in Figure 1. Instead of having the execution pipeline be idle while
waiting for the Memory data to arrive, the multi-threaded CPU executes code for Thread2
during those same memory access cycles. The idle cycles in black are often known as pipeline
"bubbles".

Sharing hardware resources among multiple threads gives an obvious cost advantage to multi-
threading as compare to full-blown multi-processing. Another potential benefit is that
multiple threads could be working on the same data. By sharing the same data caches, multiple
threads get better utilization of these caches and better synchronization of the shared data.

By minimizing how much hardware is replicated for executing a software thread, Multi -
threading can boost overall system performance and through-put with relatively little
additional hardware cost.

Performance Gains from Multi-threading

The performance boost from Multi-threading comes from filling up all of the CPU cycles with
useful work that otherwise would be un-used due to stalls. Many applications have low
numbers of instructions-executed-per-cycle when run in single-threaded mode and are good
candidates for multi-threading.

Any application which can keep the CPU fully busy every cycle with a single thread is not a
good candidate for multi-threading. Such applications are relatively rare.

47
Paavai Engineering College Department of ECE

Since the introduction of the first MT enabled MIPS CPUs, there have been multiple studies
of the MT performance benefits. We’ve seen performance boosts of up to 226% in which the
gains relate to both the hardware and OS task scheduling changes. Various application notes
about MT performance gains and an app note on MIPS Creator Ci40 Multithreading
Benchmarks can be found on the Imagination website
at: https://community.imgtec.com/developers/mips/resources/application-notes/

Types of Multi-threading

Coarse-Grained MT

The simplest type of multi-threading is known as Coarse-Grained Multi-Threading. For this

type, one thread runs until it is blocked by an event that creates a long latency stall (normally
an all-cache miss). This long latency stall has to be identified and checked by the programmer
and then the processor is programmatically switched to run another thread.

Hardware support for this type of multi-threading is meant to allow quick switching between
the threads. To achieve this goal, the additional hardware cost is to replicate the program
visible state - such as the GPRs and the program counter for each thread. For example, to
quickly switch between two threads, the hardware cost would be having two copies of the
GPRs and the program counter.

For this type of multi-threading, only long latency stalls can cause thread switches, as an
instruction in the program has to be added for each stall check. It would be too costly to add
such instructions to check for very short stalls.

Fine-Grained MT

A more sophisticated type of multi-threading is known as Fine-Grained Multi-Threading. For

this type, the CPU checks every cycle if the current thread is stalled or not. If stalled, a
hardware scheduler will change execution to another thread that is ready to run. Since the
hardware is checking every cycle for stalls, all stall types can be dealt with, even single cycle
stalls.

Early implementations of this type of multi-threading caused a thread switch every CPU
cycle. The motivation for switching every cycle was to reduce the possibility of stalling for
a previous result from the same thread. This early type was known as barrel processing, in
which staves of a barrel represented the pipeline stages of the CPU. It was also known as
interleaved or pre-emptive or time-sliced multi-threading. It was conceptually similar to
preemptive multi-tasking, used in operating systems, where the time slice that is given to each
active thread is one CPU cycle.

The additional hardware cost of fine-grained Multi-threading is to track the Thread ID of the
instruction in each pipeline stage. In addition since there are multiple threads that are
concurrently active, shared resources such as caches and TLBs might need to be increased in
size to avoid thrashing between the different threads.

48
Paavai Engineering College Department of ECE

More modern implementations would only cause a thread switch when the currently running
thread became blocked. For these more modern implementations, a thread can continue
executing until it would produce a stall.

Simultaneous MT

The most sophisticated type of multi-threading applies to superscalar processors. Superscalar

means that the processor can execute multiple instructions in each CPU cycle.

Simultaneous Multi-threading or SMT means that each of these instructions which are issued
together can either be from the same thread or each can be from different threads. The
hardware thread scheduler will pick the most appropriate instructions to maximize the
utilization of the execution pipelines.

MIPS Multi-threading

The first multi-threaded processor from MIPS was the MIPS32 34K, which was released in
2005. The 34K implemented fine-grained multi-threading (the more modern kind which
doesn’t have to blindly switch threads every cycle), with a hardware thread scheduler within
the CPU which picks the most appropriate thread to run each CPU cycle.

All subsequent multi-threaded processors from MIPS have also implemented fine-grained
multi-threading. This includes the 1004K symmetric multiprocessing system, and the
interAptiv family of multi-threaded, multi-core CPUs, which deliver increased multi-core
performance and added features such as dcache ECC, Extended Virtual Addressing
(EVA), multi-threaded FPU, updated DSP ASE, and other features.

The R6 versions of the MIPS32/64 architectures were released in 2014 and introduced a
simplified definition for MIPS Multi-threading. In this simplified definition, the entity
executing a software thread is known as a Virtual Processor. The MIPS Warrior I6400 is a
super-scalar CPU and was the first MIPS CPU which also implemented SMT. It also added
features including hardware virtualization.

More recently Imagination introduced the I6500 CPU, a superset of I6400 with numerous new
features at the core and cluster levels, and highly extended capabilities into heterogeneous
compute configurations.

PRACTICE PROBLEMS BASED ON PIPELINING IN COMPUTER ARCHITECTURE-

Problem-01:
Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns. Given latch delay is 10
ns. Calculate-
1. Pipeline cycle time

49
Paavai Engineering College Department of ECE

2. Non-pipeline execution time

3. Speed up ratio
4. Pipeline time for 1000 tasks
5. Sequential time for 1000 tasks
6. Throughput
Solution-
Given-
• Four stage pipeline is used
• Delay of stages = 60, 50, 90 and 80 ns
• Latch delay or delay due to each register = 10 ns
Part-01: Pipeline Cycle Time-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 60, 50, 90, 80 } + 10 ns
= 90 ns + 10 ns
= 100 ns
Part-02: Non-Pipeline Execution Time-
Non-pipeline execution time for one instruction
= 60 ns + 50 ns + 90 ns + 80 ns
= 280 ns
Part-03: Speed Up Ratio-
Speed up
= Non-pipeline execution time / Pipeline execution time
= 280 ns / Cycle time
= 280 ns / 100 ns
= 2.8
Part-04: Pipeline Time For 1000 Tasks-
Pipeline time for 1000 tasks
= Time taken for 1st task + Time taken for remaining 999 tasks
= 1 x 4 clock cycles + 999 x 1 clock cycle
= 4 x cycle time + 999 x cycle time
= 4 x 100 ns + 999 x 100 ns
= 400 ns + 99900 ns
= 100300 ns
Part-05: Sequential Time For 1000 Tasks-
Non-pipeline time for 1000 tasks
= 1000 x Time taken for one task
= 1000 x 280 ns
= 280000 ns
Part-06: Throughput-
Throughput for pipelined execution
= Number of instructions executed per unit time
= 1000 tasks / 100300 ns
Problem-02:

50
Paavai Engineering College Department of ECE

A four stage pipeline has the stage delays as 150, 120, 160 and 140 ns respectively. Registers are
used between the stages and have a delay of 5 ns each. Assuming constant clocking rate, the total
time taken to process 1000 data items on the pipeline will be-
1. 120.4 microseconds
2. 160.5 microseconds
3. 165.5 microseconds
4. 590.0 microseconds
Solution-
Given-
• Four stage pipeline is used
• Delay of stages = 150, 120, 160 and 140 ns
• Delay due to each register = 5 ns
• 1000 data items or instructions are processed
Cycle Time-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 150, 120, 160, 140 } + 5 ns
= 160 ns + 5 ns
= 165 ns
Pipeline Time To Process 1000 Data Items-
Pipeline time to process 1000 data items
= Time taken for 1st data item + Time taken for remaining 999 data items
= 1 x 4 clock cycles + 999 x 1 clock cycle
= 4 x cycle time + 999 x cycle time
= 4 x 165 ns + 999 x 165 ns
= 660 ns + 164835 ns
= 165495 ns
= 165.5 μs
Thus, Option (C) is correct.
Problem-03:
Consider a non-pipelined processor with a clock rate of 2.5 gigahertz and average cycles per
instruction of 4. The same processor is upgraded to a pipelined processor with five stages but due
to the internal pipeline delay, the clock speed is reduced to 2 gigahertz. Assume there are no stalls
in the pipeline. The speed up achieved in this pipelined processor is-
1. 3.2
2. 3.0
3. 2.2
4. 2.0
Solution-
Cycle Time in Non-Pipelined Processor-
Frequency of the clock = 2.5 gigahertz
Cycle time
= 1 / frequency
= 1 / (2.5 gigahertz)
= 1 / (2.5 x 109 hertz)
= 0.4 ns

51
Paavai Engineering College Department of ECE

Non-Pipeline Execution Time-

Non-pipeline execution time to process 1 instruction
= Number of clock cycles taken to execute one instruction
= 4 clock cycles
= 4 x 0.4 ns
= 1.6 ns
Cycle Time in Pipelined Processor-
Frequency of the clock = 2 gigahertz
Cycle time
= 1 / frequency
= 1 / (2 gigahertz)
= 1 / (2 x 109 hertz)
= 0.5 ns
Pipeline Execution Time-
Since there are no stalls in the pipeline, so ideally one instruction is executed per clock cycle. So,
Pipeline execution time
= 1 clock cycle
= 0.5 ns
Speed Up-
Speed up
= Non-pipeline execution time / Pipeline execution time
= 1.6 ns / 0.5 ns
= 3.2
Thus, Option (A) is correct.

Problem-04:
The stage delays in a 4 stage pipeline are 800, 500, 400 and 300 picoseconds. The first stage is
replaced with a functionally equivalent design involving two stages with respective delays 600 and
350 picoseconds.
The throughput increase of the pipeline is _____%.
Solution-
Execution Time in 4 Stage Pipeline-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 800, 500, 400, 300 } + 0
= 800 picoseconds
Thus, Execution time in 4 stage pipeline = 1 clock cycle = 800 picoseconds.
Throughput in 4 Stage Pipeline-
Throughput
= Number of instructions executed per unit time
= 1 instruction / 800 picoseconds
Execution Time in 2 Stage Pipeline-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 600, 350 } + 0
= 600 picoseconds

52
Paavai Engineering College Department of ECE

Thus, Execution time in 2 stage pipeline = 1 clock cycle = 600 picoseconds.

Throughput in 2 Stage Pipeline-
Throughput
= Number of instructions executed per unit time
= 1 instruction / 600 picoseconds
Throughput Increase-
Throughput increase
= { (Final throughput – Initial throughput) / Initial throughput } x 100
= { (1 / 600 – 1 / 800) / (1 / 800) } x 100
= { (800 / 600) – 1 } x 100
= (1.33 – 1) x 100
= 0.3333 x 100
= 33.33 %

Problem-05:
A non-pipelined single cycle processor operating at 100 MHz is converted into a synchronous
pipelined processor with five stages requiring 2.5 ns, 1.5 ns, 2 ns, 1.5 ns and 2.5 ns respectively.
The delay of the latches is 0.5 sec.
The speed up of the pipeline processor for a large number of instructions is-
1. 4.5
2. 4.0
3. 3.33
4. 3.0
Solution-
Cycle Time in Non-Pipelined Processor-
Frequency of the clock = 100 MHz
Cycle time
= 1 / frequency
= 1 / (100 MHz)
= 1 / (100 x 106 hertz)
= 0.01 μs
Non-Pipeline Execution Time-
Non-pipeline execution time to process 1 instruction
= Number of clock cycles taken to execute one instruction
= 1 clock cycle
= 0.01 μs
= 10 ns
Cycle Time in Pipelined Processor-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 2.5, 1.5, 2, 1.5, 2.5 } + 0.5 ns
= 2.5 ns + 0.5 ns
= 3 ns
Pipeline Execution Time-
Pipeline execution time

53
Paavai Engineering College Department of ECE

= 1 clock cycle
= 3 ns
Speed Up-
Speed up
= Non-pipeline execution time / Pipeline execution time
= 10 ns / 3 ns
= 3.33
Thus, Option (C) is correct.
Problem-06:
We have 2 designs D1 and D2 for a synchronous pipeline processor. D1 has 5 stage pipeline with
execution time of 3 ns, 2 ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages each
with 2 ns execution time. How much time can be saved using design D2 over design D1 for
executing 100 instructions?
1. 214 ns
2. 202 ns
3. 86 ns
4. 200 ns
Solution-
Cycle Time in Design D1-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 3, 2, 4, 2, 3 } + 0
= 4 ns
Execution Time For 100 Instructions in Design D1-
Execution time for 100 instructions
= Time taken for 1st instruction + Time taken for remaining 99 instructions
= 1 x 5 clock cycles + 99 x 1 clock cycle
= 5 x cycle time + 99 x cycle time
= 5 x 4 ns + 99 x 4 ns
= 20 ns + 396 ns
= 416 ns
Cycle Time in Design D2-
Cycle time
= Delay due to a stage + Delay due to its register
= 2 ns + 0
= 2 ns
Execution Time For 100 Instructions in Design D2-
Execution time for 100 instructions
= Time taken for 1st instruction + Time taken for remaining 99 instructions
= 1 x 8 clock cycles + 99 x 1 clock cycle
= 8 x cycle time + 99 x cycle time
= 8 x 2 ns + 99 x 2 ns
= 16 ns + 198 ns
= 214 ns
Time Saved-
Time saved

54
Paavai Engineering College Department of ECE

= Execution time in design D1 – Execution time in design D2

= 416 ns – 214 ns
= 202 ns
Thus, Option (B) is correct.

Problem-07:
Consider an instruction pipeline with four stages (S1, S2, S3 and S4) each with combinational
circuit only. The pipeline registers are required between each stage and at the end of the last stage.
Delays for the stages and for the pipeline registers are as given in the figure-

What is the approximate speed up of the pipeline in steady state under ideal conditions when
compared to the corresponding non-pipeline implementation?
1. 4.0
2. 2.5
3. 1.1
4. 3.0
Solution-
Non-Pipeline Execution Time-
Non-pipeline execution time for 1 instruction
= 5 ns + 6 ns + 11 ns + 8 ns
= 30 ns
Cycle Time in Pipelined Processor-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 5, 6, 11, 8 } + 1 ns
= 11 ns + 1 ns
= 12 ns
Pipeline Execution Time-
Pipeline execution time
= 1 clock cycle
= 12 ns
Speed Up-
Speed up
= Non-pipeline execution time / Pipeline execution time
= 30 ns / 12 ns
= 2.5

55
Paavai Engineering College Department of ECE

Thus, Option (B) is correct.

Problem-08:
Consider a 4 stage pipeline processor. The number of cycles needed by the four instructions I1,
I2, I3 and I4 in stages S1, S2, S3 and S4 is shown below-
S1 S2 S3 S4

I1 2 1 1 1

I2 1 3 2 2

I3 2 1 1 3

I4 1 2 2 2
What is the number of cycles needed to execute the following loop?
for(i=1 to 2) { I1; I2; I3; I4; }
1. 16
2. 23
3. 28
4. 30
Solution-
The phase-time diagram is-

From here, number of clock cycles required to execute the loop = 23 clock cycles.
Thus, Option (B) is correct.

Problem-09:
Consider a pipelined processor with the following four stages-
IF : Instruction Fetch
ID : Instruction Decode and Operand Fetch
EX : Execute
WB : Write Back
The IF, ID and WB stages take one clock cycle each to complete the operation. The number of
clock cycles for the EX stage depends on the instruction. The ADD and SUB instructions need 1
clock cycle and the MUL instruction need 3 clock cycles in the EX stage. Operand forwarding is
used in the pipelined processor. What is the number of clock cycles taken to complete the following
sequence of instructions?
ADD R2, R1, R0 R2 ← R0 + R1
MUL R4, R3, R2 R4 ← R3 + R2
SUB R6, R5, R4 R6 ← R5 + R4

56
Paavai Engineering College Department of ECE

7
1. 8
2. 10
3. 14
Solution-
The phase-time diagram is-

From here, number of clock cycles required to execute the instructions = 8 clock cycles.
Thus, Option (B) is correct.
Problem-10:
Consider the following procedures. Assume that the pipeline registers have zero latency.
P1 : 4 stage pipeline with stage latencies 1 ns, 2 ns, 2 ns, 1 ns
P2 : 4 stage pipeline with stage latencies 1 ns, 1.5 ns, 1.5 ns, 1.5 ns
P3 : 5 stage pipeline with stage latencies 0.5 ns, 1 ns, 1 ns, 0.6 ns, 1 ns
P4 : 5 stage pipeline with stage latencies 0.5 ns, 0.5 ns, 1 ns, 1 ns, 1.1 ns
Which procedure has the highest peak clock frequency?
1. P1
2. P2
3. P3
4. P4
Solution-
It is given that pipeline registers have zero latency. Thus,
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Maximum delay due to any stage
For Processor P1:
Cycle time
= Max { 1 ns, 2 ns, 2 ns, 1 ns }
= 2 ns
Clock frequency
= 1 / Cycle time
= 1 / 2 ns
= 0.5 gigahertz
For Processor P2:
Cycle time
= Max { 1 ns, 1.5 ns, 1.5 ns, 1.5 ns }
= 1.5 ns
Clock frequency
= 1 / Cycle time
= 1 / 1.5 ns

57
Paavai Engineering College Department of ECE

= 0.67 gigahertz
For Processor P3:
Cycle time
= Max { 0.5 ns, 1 ns, 1 ns, 0.6 ns, 1 ns }
= 1 ns
Clock frequency
= 1 / Cycle time
= 1 / 1 ns
= 1 gigahertz
For Processor P4:
Cycle time
= Max { 0.5 ns, 0.5 ns, 1 ns, 1 ns, 1.1 ns }
= 1.1 ns
Clock frequency
= 1 / Cycle time
= 1 / 1.1 ns
= 0.91 gigahertz

Clearly, Process P3 has the highest peak clock frequency.

Thus, Option (C) is correct.

Problem-11:
Consider a 3 GHz (gigahertz) processor with a three-stage pipeline and stage latencies T1, T2 and
T3 such that T1 = 3T2/4 = 2T3. If the longest pipeline stage is split into two pipeline stages of equal
latency, the new frequency is ____ GHz, ignoring delays in the pipeline registers.
Solution-
Let ‘t’ be the common multiple of each ratio, then-
• T1 = t
• T2 = 4t / 3
• T3 = t / 2
Pipeline Cycle Time-
Pipeline cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { t, 4t/3, t/2 } + 0
= 4t/3
Frequency Of Pipeline-
Frequency
= 1 / Pipeline cycle time
= 1 / (4t / 3)
= 3 / 4t
Given frequency = 3 GHz. So,
3 / 4t = 3 GHz
4t = 1 ns
t = 0.25 ns
Stage Latencies Of Pipeline-
Stage latency of different stages are-

58
Paavai Engineering College Department of ECE

• Latency of stage-01 = 0.25 ns

• Latency of stage-02 = 0.33 ns
• Latency of stage-03 = 0.125 ns
Splitting The Pipeline-
The stage with longest latency i.e. stage-02 is split up into 4 stages.
After splitting, the latency of different stages are-
• Latency of stage-01 = 0.25 ns
• Latency of stage-02 = 0.165 ns
• Latency of stage-03 = 0.165 ns
• Latency of stage-04 = 0.125 ns
Splitted Pipeline Cycle Time-
Splitted pipeline cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 0.25, 0.165, 0.165, 0.125 } + 0
= 0.25 ns
Frequency Of Splitted Pipeline-
Frequency
= 1 / splitted pipeline cycle time
= 1 / 0.25 ns
= 4 GHz
Thus, new frequency = 4 GHz.

QUESTION BANK
PART A
1. List the two steps involved in executing an instruction.
2. Define hazard and its types.

59
Paavai Engineering College Department of ECE

3. Define data hazard.

5. Define instruction hazard or control hazard.
6. Define structural hazard.
7. Define operand forwarding.
8. Define Branch penalty.
9. Define Branch folding.
10. Define Branch prediction.
11. Define static branch prediction The branch prediction decision is always the same every
time a given instruction is executed.
12. Define Dynamic branch prediction.
13. Define Superscalar processor.
14. What are the disadvantages of increasing the number of stages in pipelined processing?
15. Define data path element.
16. Define branch target address.
17. Define register file
18. List the five stages of instruction execution.
19. Write about branch prediction buffer Also called branch history table.
20. Define exception. Exceptions are also known as interrupt.
21. Write about the classification of data hazards.
22. How data hazard can be prevented in pipelining?
23. How addressing modes affect the instruction pipelining?
24. List out the methods used to improve system performance.
25. Define Pipelining. In order to reduce the overall processing time several instructions are
being executed simultaneously.

PART B
1. Describe in detail about the functional units and the basic implementation scheme of
MIPS with suitable diagram.

60
Paavai Engineering College Department of ECE

2. Explain how the instruction pipeline works. What are the various situations where an
instruction pipeline can stall?
3. Examine the relationships between pipeline execution and addressing modes.
4. Describe the role of cache memory in pipelined system. (ii) Discuss the influence of
pipelining on instruction set design.
5. What is instruction hazard? Explain the methods for dealing with the instruction hazards.
6.Describe the data path and control considerations for pipelining.
6. Describe the techniques for handling data and instruction hazards in pipelining.
7. Briefly explain about exceptions.
8. Briefly explain about Creating a Single Data path with neat diagram.
9. Explain about operations of the data path with necessary diagrams.
10. Compare the performance of Single-Cycle versus Pipelined Performance.

BMW N20 Valvetronic Gear
100% (1)
BMW N20 Valvetronic Gear
8 pages
Unit 6 - Pipeline, Vector Processing and Multiprocessors
No ratings yet
Unit 6 - Pipeline, Vector Processing and Multiprocessors
23 pages
Pipeline and Vector Processing
No ratings yet
Pipeline and Vector Processing
4 pages
Parallel Computing
No ratings yet
Parallel Computing
46 pages
Concept of Pipelining - Computer Architecture Tutorial What Is Pipelining?
100% (1)
Concept of Pipelining - Computer Architecture Tutorial What Is Pipelining?
5 pages
3rd Unit
No ratings yet
3rd Unit
72 pages
Electrical Wiring Design1 - BTech
No ratings yet
Electrical Wiring Design1 - BTech
33 pages
Unit 5-2 COA
No ratings yet
Unit 5-2 COA
52 pages
Pipelining Basic Concept
No ratings yet
Pipelining Basic Concept
23 pages
Pipeline and Vector Processing
100% (1)
Pipeline and Vector Processing
18 pages
Pipeline and Vector Processing
No ratings yet
Pipeline and Vector Processing
52 pages
Computer Architecture 1
No ratings yet
Computer Architecture 1
37 pages
Chapter 08 - Pipeline and Vector Processing
No ratings yet
Chapter 08 - Pipeline and Vector Processing
14 pages
Parallel Computer Architecture
No ratings yet
Parallel Computer Architecture
22 pages
4 Instruction Pipeline
No ratings yet
4 Instruction Pipeline
13 pages
Unit5 Parallel Processing Multiprocessor
No ratings yet
Unit5 Parallel Processing Multiprocessor
32 pages
Coa Unit 5
No ratings yet
Coa Unit 5
71 pages
5.1-5.3 Pipelining and Parallel Processing
No ratings yet
5.1-5.3 Pipelining and Parallel Processing
56 pages
Milling
No ratings yet
Milling
34 pages
Coa Notes Unit 5
No ratings yet
Coa Notes Unit 5
55 pages
COA Unit - 4
No ratings yet
COA Unit - 4
31 pages
Coa Unit 5
No ratings yet
Coa Unit 5
20 pages
Parallel Processing & Pipelining
No ratings yet
Parallel Processing & Pipelining
33 pages
Presentation 5156 Content Document 20250301102853AM
No ratings yet
Presentation 5156 Content Document 20250301102853AM
40 pages
Cambium-Networks Datasheet PoE Power Injector 30V 15W N000900L001D 01112023
No ratings yet
Cambium-Networks Datasheet PoE Power Injector 30V 15W N000900L001D 01112023
1 page
Ch7 Processing
No ratings yet
Ch7 Processing
22 pages
Lecture 10
No ratings yet
Lecture 10
23 pages
ACA1
No ratings yet
ACA1
26 pages
20-Unit 7-22-04-2024
No ratings yet
20-Unit 7-22-04-2024
97 pages
Unit-V NEW
No ratings yet
Unit-V NEW
21 pages
Unit 5
No ratings yet
Unit 5
51 pages
BNCS1209 Chapter 6
No ratings yet
BNCS1209 Chapter 6
25 pages
CO
No ratings yet
CO
11 pages
Session - 29 and 30 Instruction Pipelining and Pipeline Hazards, Instruction Level Parallelism
No ratings yet
Session - 29 and 30 Instruction Pipelining and Pipeline Hazards, Instruction Level Parallelism
25 pages
Unit-5-Parallel Processing
No ratings yet
Unit-5-Parallel Processing
11 pages
Computer Systems Architecture 308 312
No ratings yet
Computer Systems Architecture 308 312
5 pages
Control of Inspection Equipment
No ratings yet
Control of Inspection Equipment
7 pages
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
No ratings yet
Chapter 9 - Pipeline and Vector Processing Section 9.1 - Parallel Processing
10 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
BCA Semester II Computer Organisation and Architecture (COA
No ratings yet
BCA Semester II Computer Organisation and Architecture (COA
24 pages
Wire Harness Repair-2200SRM1128 - (05-2005) - US-EN
No ratings yet
Wire Harness Repair-2200SRM1128 - (05-2005) - US-EN
76 pages
Chapter 5 Pipelining and Vector Processing Modified
No ratings yet
Chapter 5 Pipelining and Vector Processing Modified
37 pages
Poshan Tracker: Process Flow: Version - 2.0
No ratings yet
Poshan Tracker: Process Flow: Version - 2.0
41 pages
Pipeline Concepts for CS Students
No ratings yet
Pipeline Concepts for CS Students
7 pages
Instruction Pipelining and SuperScalar Development - 2019
No ratings yet
Instruction Pipelining and SuperScalar Development - 2019
53 pages
PCC-CS402
No ratings yet
PCC-CS402
7 pages
PIpeline Processing and Multi Processing
No ratings yet
PIpeline Processing and Multi Processing
16 pages
Unit 5 - Pipeling and Multipoessors
No ratings yet
Unit 5 - Pipeling and Multipoessors
74 pages
Module 4 - Parallel & Pipeline Processing - Final
No ratings yet
Module 4 - Parallel & Pipeline Processing - Final
31 pages
Unit 1
No ratings yet
Unit 1
5 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
Final
No ratings yet
Final
26 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
CA Slides#3 Pipeline Introduction
No ratings yet
CA Slides#3 Pipeline Introduction
26 pages
Multiprocessor Systems & Pipelining
No ratings yet
Multiprocessor Systems & Pipelining
11 pages
5 Pipeline
No ratings yet
5 Pipeline
63 pages
Operating Manual: Continuous Band Sealer (Horizontal Orientation) Model No: FR-900S
No ratings yet
Operating Manual: Continuous Band Sealer (Horizontal Orientation) Model No: FR-900S
5 pages
Tracing Down User and Computer Account Deletion in Active Directory - TechNet Blogs
No ratings yet
Tracing Down User and Computer Account Deletion in Active Directory - TechNet Blogs
4 pages
Advanced Parallel Processing
0% (1)
Advanced Parallel Processing
12 pages
Computer Engineering Thesis Proposal
100% (3)
Computer Engineering Thesis Proposal
8 pages
Free Essential Software for Windows Users
No ratings yet
Free Essential Software for Windows Users
2 pages
Understanding CPU Pipelining Basics
No ratings yet
Understanding CPU Pipelining Basics
5 pages
Pipelining
No ratings yet
Pipelining
21 pages
Parallel Processing and Pipelining
No ratings yet
Parallel Processing and Pipelining
53 pages
Advanced Computer Architecture 2
No ratings yet
Advanced Computer Architecture 2
17 pages
6014 Question Paper
No ratings yet
6014 Question Paper
2 pages
Chap 9
No ratings yet
Chap 9
59 pages
Module 4
No ratings yet
Module 4
12 pages
Excel Module 1 PPT Presentation
No ratings yet
Excel Module 1 PPT Presentation
31 pages
Training Catalogue - EN - Low
No ratings yet
Training Catalogue - EN - Low
12 pages
Enfinity 240Wp Mono
No ratings yet
Enfinity 240Wp Mono
2 pages
Serial Install Manual PDF
No ratings yet
Serial Install Manual PDF
7 pages
Engineering Distributed Objects
100% (1)
Engineering Distributed Objects
391 pages
Unit 9: Fundamentals of Parallel Processing
No ratings yet
Unit 9: Fundamentals of Parallel Processing
16 pages
Vertical Turning Center Guide
No ratings yet
Vertical Turning Center Guide
16 pages
Pipelining & Vector Processing Guide
No ratings yet
Pipelining & Vector Processing Guide
29 pages
Security Incident Response Guide
No ratings yet
Security Incident Response Guide
5 pages
Lec18 Pipeline
No ratings yet
Lec18 Pipeline
59 pages
Metashape-Pro 1 7 en
No ratings yet
Metashape-Pro 1 7 en
187 pages
Facebook PHP SDK v5
No ratings yet
Facebook PHP SDK v5
26 pages
KashifBari CV
No ratings yet
KashifBari CV
4 pages
Quiz, Application Letter - Resume
No ratings yet
Quiz, Application Letter - Resume
4 pages
Grinder English
No ratings yet
Grinder English
4 pages
Ivy - Installation and Programming Manual - A5-WEB
No ratings yet
Ivy - Installation and Programming Manual - A5-WEB
32 pages
WWW Thewindowsclub Com Disable Superfetch Prefetch SSD
No ratings yet
WWW Thewindowsclub Com Disable Superfetch Prefetch SSD
11 pages
Troubleshooting Note For SAP HANA Platform Lifecycle Management Tool HDBLCM
No ratings yet
Troubleshooting Note For SAP HANA Platform Lifecycle Management Tool HDBLCM
9 pages
Static vs Dynamic Websites Explained
No ratings yet
Static vs Dynamic Websites Explained
3 pages
ZFOD - Power Generation Electrical Substation SPV - JA
No ratings yet
ZFOD - Power Generation Electrical Substation SPV - JA
2 pages
Technical Specs Nova35
No ratings yet
Technical Specs Nova35
1 page