Campmc Unit Ii
Campmc Unit Ii
UNIT II
PIPELINING AND PARALLEL PROCESSING
1
Paavai Engineering College Department of ECE
CONTENTS
Technical terms
2
Paavai Engineering College Department of ECE
3
Paavai Engineering College Department of ECE
TECHNICAL TERMS
Technical Literal
S.No Technical Meaning Digester
Terms Meaning
It is the process of storing and https://www.t
The processing queuing tasks and instructions echopedia.co
1 Pipeline of each task in that are executed m/definition/1
parallel. simultaneously by the processor 3051/pipelinin
in an organized way. g
Instruction pipelining is a
technique of organising the https://binaryt
Method of instructions for execution in erms.com/inst
Instruction
2 categorizing the such a way that the execution of ruction-
pipelining
instructions. the current instruction is pipelining.ht
overlapped by the execution of ml
its subsequent instruction
https://www.e
ncyclopedia.c
Sudden situation
A situation in which an om/computing
in which an
instruction is dependent on a /dictionaries-
Data instruction is
3 result from a sequentially thesauruses-
Dependencies dependent on
previous instruction before it pictures-and-
previous
can complete its execution. press-
instruction.
releases/data-
dependency
Fundamental Memory latency is the time https://en.wiki
measure of the between initiating a request for pedia.org/wiki
4 Memory Delay
speed of a byte or word in memory until /Memory_late
memory. it is retrieved by a processor. ncy
The instructions https://cpental
appear to The branch delay slot is a side k.com/887/wh
5 Branch Delays execute in an effect of pipelined architectures at-is-a-
illogical or due to the branch hazard delayed-
incorrect order. branch
Superscalar CPU manages CPU that implements a form https://en.wiki
Operation multiple of parallelism called instruction pedia.org/wiki
6
instruction for -level parallelism within a /Superscalar_
execution. single processor. processor
4
Paavai Engineering College Department of ECE
A multi-core processor is a
computer processor on a https://en.wiki
A multicore
Multi-core single integrated circuit with pedia.org/wiki
processor conta
7 processor two or more separate processing /Multi-
ins multiple core
units, called cores, each of core_processo
processing units.
which reads and r
executes program instructions
SISD stands for ‘Single
Instruction and Single Data
Instructions are Stream'. It represents the https://www.j
SISD
8 executed organization of a single avatpoint.com
sequentially. computer containing a control /sisd
unit, a processor unit, and a
memory unit
All processors SIMD stands for ‘Single
receive the same Instruction and Multiple Data
instruction from Stream'. It represents an https://www.j
SIMD
9 the control unit organization that includes many avatpoint.com
but operate on processing units under the /simd
different items supervision of a common
of data. control unit.
MISD stands for ‘Multiple
In MISD, Instruction and Single Data
multiple stream'. MISD structure is only
processing units of theoretical interest since no https://www.j
MISD practical system has been
10 operate on one avatpoint.com
constructed using this
single-data organization. Each processing /misd
stream. unit operates on the data
independently via separate
instruction stream.
5
Paavai Engineering College Department of ECE
In MIMD, each
processor has a MIMD stands for 'Multiple
separate Instruction and Multiple Data
program and an Stream'. In this organization, all
MIMD instruction processors in a parallel https://www.j
11 computer can execute different avatpoint.com
stream is
generated from instructions and operate on /mimd
each program. various data at the same time.
6
Paavai Engineering College Department of ECE
7
Paavai Engineering College Department of ECE
Pipeline system is like the modern day assembly line setup in factories. For example in a car
manufacturing industry, huge assembly lines are setup and at each point, there are robotic arms
to perform a certain task, and then the car moves on ahead to the next arm.
8
Paavai Engineering College Department of ECE
simultaneously. The pipeline will be more efficient if the instruction cycle is divided into
segments of equal duration.
Pipeline Conflicts
There are some factors that cause the pipeline to deviate its normal performance. Some of
these factors are given below:
1. Timing Variations
All stages cannot take same amount of time. This problem generally occurs in instruction
processing where different instructions have different operand requirements and thus different
processing time.
2. Data Hazards
When several instructions are in partial execution, and if they reference same data then the
problem arises. We must ensure that next instruction does not attempt to access data before
the current instruction, because this will lead to incorrect results.
3. Branching
In order to fetch and execute the next instruction, we must know what that instruction is. If the
present instruction is a conditional branch, and its result will lead us to the next instruction,
then the next instruction may not be known until the current one is processed.
4. Interrupts
Interrupts set unwanted instruction into the instruction stream. Interrupts effect the execution
of instruction.
5. Data Dependency
It arises when an instruction depends upon the result of a previous instruction but this result is
not yet available.
Advantages of Pipelining
• The cycle time of the processor is reduced.
• It increases the throughput of the system
• It makes the system reliable.
Disadvantages of Pipelining
• The design of pipelined processor is complex and costly to manufacture.
9
Paavai Engineering College Department of ECE
10
Paavai Engineering College Department of ECE
In most of the computer programs, the result from one instruction is used as an operand by the
other instruction. When such instructions are executed in pipelining, break down occurs as the
result of the first instruction is not available when instruction two starts collecting operands.
So, instruction two must stall till instruction one is executed and the result is generated. This
type of hazard is called Read –after-write pipelining hazard.
Pipelining
To improve the performance of a CPU we have two options:
1) Improve the hardware by introducing faster circuits.
11
Paavai Engineering College Department of ECE
2) Arrange the hardware such that more than one operation can be performed at the same
time.
Since, there is a limit on the speed of hardware and the cost of faster circuits is quite
high, we have to adopt the 2nd option.
Pipelining : Pipelining is a process of arrangement of hardware elements of the CPU
such that its overall performance is increased. Simultaneous execution of more than one
instruction takes place in a pipelined processor.
Let us see a real life example that works on the concept of pipelined operation.
Consider a water bottle packaging plant. Let there be 3 stages that a bottle should pass through,
Inserting the bottle(I), Filling water in the bottle(F), and Sealing the bottle(S). Let us consider
these stages as stage 1, stage 2 and stage 3 respectively. Let each stage take 1 minute to
complete its operation.
Now, in a non pipelined operation, a bottle is first inserted in the plant, after 1 minute it is
moved to stage 2 where water is filled. Now, in stage 1 nothing is happening. Similarly, when
the bottle moves to stage 3, both stage 1 and stage 2 are idle. But in pipelined operation, when
the bottle is in stage 2, another bottle can be loaded at stage 1. Similarly, when the bottle is in
stage 3, there can be one bottle each in stage 1 and stage 2. So, after each minute, we get a new
bottle at the end of stage 3. Hence, the average time taken to manufacture 1 bottle is :
Without pipelining = 9/3 minutes = 3m
IFS||||||
|||IFS|||
| | | | | | I F S (9 minutes)
With pipelining = 5/3 minutes = 1.67m
IFS||
|IFS|
| | I F S (5 minutes)
Thus, pipelined operation increases the efficiency of a system.
12
Paavai Engineering College Department of ECE
Stage / Cycle 1 2 3 4 5 6 7 8
S1 I1 I2
S2 I1 I2
S3 I1 I2
S4 I1 I2
13
Paavai Engineering College Department of ECE
Overlapped execution:
Stage / Cycle 1 2 3 4 5
S1 I1 I2
S2 I1 I2
S3 I1 I2
S4 I1 I2
14
Paavai Engineering College Department of ECE
15
Paavai Engineering College Department of ECE
Note: The cycles per instruction (CPI) value of an ideal pipelined processor is 1
16
Paavai Engineering College Department of ECE
Ideal CPI of the pipelined processor is ‘1’. But due to stalls, it becomes greater than ‘1’.
=>S = CPInon-pipeline * Cycle Timenon-pipeline / (1 + Number of stalls per Instruction) * Cycle
Timepipeline
As Cycle Timenon-pipeline = Cycle Timepipeline,
Speed Up (S) = CPInon-pipeline / (1 + Number of stalls per instruction)
• Structural Hazards
• Data Hazards
• Control Hazards
There are many specific solutions to dependencies. The simplest is introducing a bubble which
stalls the pipeline and reduces the throughput. The bubble makes the next instruction wait until
the earlier instruction is done with.
2.3.1 Structural Hazards
Structural hazards arise due to hardware resource conflict amongst the instructions in
the pipeline. A resource here could be the Memory, a Register in GPR or ALU. This resource
conflict is said to occur when more than one instruction in the pipe is requiring access to the
same resource in the same clock cycle. This is a situation that the hardware cannot handle all
possible combinations in an overlapped pipelined execution.
17
Paavai Engineering College Department of ECE
Fig. 2.6 Structural Hazard solution using "stall" in a 4 stage pipeline design
18
Paavai Engineering College Department of ECE
This delay is percolated to all the subsequent instructions too. Thus, while the ideal 4-stage
system would have taken 8 timing states to execute 5 instructions, now due to structural
dependency it has taken 11 timing states. Just not this. By now you would have guessed that
this hazard is likely to happen at every 4th instruction. Not at all a good solution for a heavy
load on CPU. Is there a better way? Yes!
A better solution would be to increase the structural resources in the system using one of the
few choices below:
• The pipeline may be increased to 5 or more stages and suitably redefine the
functionality of the stages and adjust the clock frequency. This eliminates the issue of
the hazard at every 4th instruction in the 4-stage pipeline
• The memory may physically be separated as Instruction memory and Data Memory.
A Better choice would be to design as Cache memory in CPU, rather than dealing with
Main memory. IF uses Instruction memory and Result writing uses Data Memory.
These become two separate resources avoiding dependency.
• It is possible to have Multiple levels of Cache in CPU too.
• There is a possibility of ALU in resource dependency. ALU may be required in IE
machine cycle by an instruction while another instruction may require ALU in IF stage
to calculate Effective Address based on addressing mode. The solution would be either
stalling or have an exclusive ALU for address calculation.
• Register files are used in place of GPRs. Register files have multiport access with
exclusive read and write ports. This enables simultaneous access on one write register
and read register.
The last two methods are implemented in modern CPUs. Beyond these, if dependency arises,
Stalling is the only option. Keep in mind that increasing resources involves increased cost. So
the trade-off is a designer’s choice.
2.3.2 Data Hazards
Data hazards occur when an instruction's execution depends on the results of some previous
instruction that is still being processed in the pipeline. Consider the example below.
19
Paavai Engineering College Department of ECE
Solution 1: Introduce three bubbles at SUB instruction IF stage. This will facilitate SUB – ID
to function at t6. Subsequently, all the following instructions are also delayed in the pipe
.
Solution 2: Data forwarding - Forwarding is passing the result directly to the functional unit
that requires it: a result is forwarded from the output of one unit to the input of another. The
purpose is to make available the solution early to the next instruction.
In this case, ADD result is available at the output of ALU in ADD –IE i.e t3 end. If this
can be controlled and forwarded by the control unit to SUB-IE stage at t4, before writing on
to output register R3, then the pipeline will go ahead without any stalling. This requires extra
logic to identify this data hazard and act upon it. It is to be noted that although normally
Operand Fetch happens in the ID stage, it is used only in IE stage. Hence forwarding is given
to IE stage as input. Similar forwarding can be done with OR and AND instruction too.
20
Paavai Engineering College Department of ECE
21
Paavai Engineering College Department of ECE
Thus a Conditional hazard occurs when the decision to execute an instruction is based
on the result of another instruction like a conditional branch, which checks the condition’s
resultant value.
22
Paavai Engineering College Department of ECE
The branch and jump instructions decide the program flow by loading the appropriate
location in the Program Counter(PC). The PC has the value of the next instruction to be fetched
and executed by CPU. Consider the following sequence of instructions.
1. Stall the Pipeline as soon as decoding any kind of branch instructions. Just not allow
anymore IF. As always, stalling reduces throughput. The statistics say that in a
program, at least 30% of the instructions are BRANCH. Essentially the pipeline
operates at 50% capacity with Stalling.
2. Prediction – Imagine a for or while loop getting executed for 100 times. We know for
sure 100 times the program flows without the branch condition being met. Only in the
101st time, the program comes out of the loop. So, it is wiser to allow the pipeline to
23
Paavai Engineering College Department of ECE
proceed and undo/flush when the branch condition is met. This does not affect the
throttle of the pipeline as much stalling.
3. Dynamic Branch Prediction - A history record is maintained with the help of Branch
Table Buffer (BTB). The BTB is a kind of cache, which has a set of entries, with the
PC address of the Branch Instruction and the corresponding effective branch address.
This is maintained for every branch instruction encountered. SO whenever a
conditional branch instruction is encountered, a lookup for the matching branch
instruction address from the BTB is done. If hit, then the corresponding target branch
address is used for fetching the next instruction. This is called dynamic branch
prediction.
24
Paavai Engineering College Department of ECE
• No Dependence
• Dependence requiring Stall
• Dependence solution by Forwarding
• Dependence with access in order
• Out of Order Execution
• Branch Prediction Table and more
Pipeline Hazards knowledge is important for designers and Compiler writers.
ADDRESSING MODES
Addressing modes should provide the means for accessing a variety of data structures simply
and efficiently. Useful addressing modes include index, indirect, autoincrement, and
autodecrement. Many processors provide various combinations of these modes to increase the
flexibility of their instruction sets. Complex addressing modes, such as those involving double
indexing, are often encountered.
In choosing the addressing modes to be implemented in a pipelined processor, we must
consider the effect of each addressing mode on instruction flow in the pipeline. Two important
considerations in this regard are the side effects of modes such as autoincrement and
autodecrement and the extent to which complex addressing modes cause the pipeline to stall.
Another important factor is whether a given mode is likely to be used by compilers.
To compare various approaches, we assume a simple model for accessing operands in the
memory. The load instruction Load X(R1),R2 takes five cycles to complete execution, as
indicated in Figure 8.5. However, the instruction
Load (R1),R2
can be organized to fit a four-stage pipeline because no address computation is required. Access to
memory can take place in stage E. A more complex addressing mode may require several
accesses to the memory to reach the named operand. For example, the instruction
Load (X(R1)),R2
may be executed as shown in Figure 8.16a, assuming that the index offset, X, is given in the
instruction word. After computing the address in cycle 3, the processor needs to access
memory twice — first to read location X [R1] in clock cycle + 4 and then to read location [X
+ operand in the next instruction, that instruction would be
[R1]] in cycle 5. If R2 is a source
stalled for three cycles, which can be reduced to two cycles with operand forwarding, as shown.
25
Time Clock cycle 12
3 4 5 6 7
Forward
Next instruction F D E W
Add F D X +[R1] W
Load F D [X +[R1]] W
Next instruction F D E W
Figure 2.10 Equivalent operations using complex and simple addressing modes.
To implement the same Load operation using only simple addressing modes requires
several instructions. For example, on a computer that allows three operand addresses, we can
use
Add #X,R1,R2 Load (R2),R2 Load (R2),R2
The Add instruction performs the operation R2
← +X [R1]. The two Load instructions fetch the
address and then the operand from the memory. This sequence of instructions takes exactly the
same number of clock cycles as the original, single Load instruction, as shown in Figure 8.16b.
26
Paavai Engineering College Department of ECE
This example indicates that, in a pipelined processor, complex addressing modes that involve
several accesses to the memory do not necessarily lead to faster execution. The main advantage of
such modes is that they reduce the number of instructions needed to perform a given task and
thereby reduce the program space needed in the main memory. Their main disadvantage is that
their long execution times cause the pipeline to stall, thus reducing its effectiveness. They require
more complex hardware to decode and execute them. Also, they are not convenient for compilers
to work with.
The instruction sets of modern processors are designed to take maximum advantage of pipelined
hardware. Because complex addressing modes are not suitable for pipelined execution, they should be
avoided. The addressing modes used in modern processors often have the following features:
• Access to an operand does not require more than one access to the memory.
• Only load and store instructions access memory operands.
• The addressing modes used do not have side effects.
Three basic addressing modes that have these features are register, register indirect, and index. The
first two require no address computation. In the index mode, the address can be computed in one
cycle, whether the index value is given in the instruction or in a register. Memory is accessed in
the following cycle. None of these modes has any side effects, with one possible exception. Some
architectures, such as ARM, allow the address computed in the index mode to be written back into
the index register. This is a side effect that would not be allowed under the guidelines above. Note
also that relative addressing can be used; this is a special case of indexed addressing in which the
program counter is used as the index register.
The three features just listed were first emphasized as part of the concept of RISC processors.
The SPARC processor architecture, which adheres to these guidelines, is presented in Section 8.7.
CONDITION CODES
In many processors, such as those described in Chapter 3, the condition code flags are stored in
the processor status register. They are either set or cleared by many instructions, so that they can
be tested by subsequent conditional branch instructions to change the flow of program execution. An
optimizing compiler for a pipelined processor attempts to reorder instructions to avoid stalling the
pipeline when branches or data dependencies between successive instructions occur. In doing so,
the compiler must ensure that reordering does not cause a change in the outcome of a computation.
The dependency introduced by the condition-code flags reduces the flexibility available for the
compiler to reorder instructions.
Consider the sequence of instructions in Figure 8.17a, and assume that the ex- ecution of the
Compare and Branch 0 instructions proceeds as in=Figure 8.14. The branch decision takes place in
step E2 rather than D2 because it must await the result of the Compare instruction. The execution
time of the Branch instruction can be reduced
These observations lead to two important conclusions about the way condition codes should
be handled. First, to provide flexibility in reordering instructions, the condition-code flags should
be affected by as few instructions as possible. Second, the compiler should be able to specify in
which instructions of a program the condi- tion codes are affected and in which they are not. An
instruction set designed with pipelining in mind usually provides the desired flexibility. Figure
8.17b shows the instructions reordered assuming that the condition code flags are affected only
when this is explicitly stated as part of the instruction OP code. The SPARC and ARM
27
Paavai Engineering College Department of ECE
Bus B
Register
Bus C
file
PC
Control signal pipeline
Incrementer
ALU R
B
Instruction IMAR
decoder
Instruction Memory
queue address(Data
access)
1. There are separate instruction and data caches that use separate address and data
connections to the processor. This requires two versions of the MAR register, IMAR for accessing
the instruction cache and DMAR for accessing the data cache.
2. The PC is connected directly to the IMAR, so that the contents of the PC can
be transferred to IMAR at the same time that an independent ALU operation is taking place.
where S is the average number of clock cycles it takes to fetch and execute one instruc- tion, and R
is the clock rate. This simple model assumes that instructions are executed one after the other, with
no overlap. A useful performance indicator is the instruction throughput, which is the number of
instructions executed per second. For sequential execution, the throughput, Ps is given by
Ps = R/S
In this section, we examine the extent to which pipelining increases instruction throughput.
However, we should reemphasize the point made in Chapter 1 regarding performance measures.
The only real measure of performance is the total execution time of a program. Higher instruction
throughput will not necessarily lead to higher performance if a larger number of instructions is
needed to implement the desired task. For this reason, the SPEC ratings described in Chapter 1 provide
a much better indicator when comparing two processors.
Figure 8.2 shows that a four-stage pipeline may increase instruction throughput by a factor of
four. In general, an n-stage pipeline has the potential to increase throughput n times. Thus, it would
appear that the higher the value of n, the larger the performance gain. This leads to two questions:
• How much of this potential increase in instruction throughput can be realized in practice?
• What is a good value for n?
• Any time a pipeline is stalled, the instruction throughput is reduced. Hence, the per-
formance of a pipeline is highly influenced by factors such as branch and cache miss
penalties. First, we discuss the effect of these factors on performance, and then we return to
the question of how many pipeline stages should be used.
PERFORMANCE EVALUATION
Pipelining is a process of arrangement of hardware elements of the CPU such that its overall
performance is increased. Simultaneous execution of more than one instruction takes place in a
pipelined processor. ... Thus, pipelined operation increases the efficiency of a system.
In computing, computer performance is the amount of useful work accomplished by a computer
system. Outside of specific contexts, computer performance is estimated in terms of
accuracy, efficiency and speed of executing computer program instructions. When it comes to
high computer performance, one or more of the following factors might be involved:
• Short response time for a given piece of work.
• High throughput (rate of processing work).
• Low utilization of computing resource(s).
o Fast (or highly compact) data compression and decompression.
• High availability of the computing system or application.
• High bandwidth.
• Short data transmission time.
Technical and non-technical definitions
29
Paavai Engineering College Department of ECE
The performance of any computer system can be evaluated in measurable, technical terms, using
one or more of the metrics listed above. This way the performance can be
• Compared relative to other systems or the same system before/after changes
• In absolute terms, e.g. for fulfilling a contractual obligation
Whilst the above definition relates to a scientific, technical approach, the following definition
given by Arnold Allen would be useful for a non-technical audience:
The word performance in computer performance means the same thing that performance means
in other contexts, that is, it means "How well is the computer doing the work it is supposed to
do?"[1]
As an aspect of software quality
Computer software performance, particularly software application response time, is an aspect
of software quality that is important in human–computer interactions.
Performance engineering
Performance engineering within systems engineering encompasses the set of roles, skills,
activities, practices, tools, and deliverables applied at every phase of the systems development life
cycle which ensures that a solution will be designed, implemented, and operationally supported to
meet the performance requirements defined for the solution.
Performance engineering continuously deals with trade-offs between types of performance.
Occasionally a CPU designer can find a way to make a CPU with better overall performance by
improving one of the aspects of performance, presented below, without sacrificing the CPU's
performance in other areas. For example, building the CPU out of better, faster transistors.
However, sometimes pushing one type of performance to an extreme leads to a CPU with worse
overall performance, because other important aspects were sacrificed to get one impressive-
looking number, for example, the chip's clock rate (see the megahertz myth).
Application performance engineering
Application Performance Engineering (APE) is a specific methodology within performance
engineering designed to meet the challenges associated with application performance in
increasingly distributed mobile, cloud and terrestrial IT environments. It includes the roles, skills,
activities, practices, tools and deliverables applied at every phase of the application lifecycle that
ensure an application will be designed, implemented and operationally supported to meet non-
functional performance requirements.
30
Paavai Engineering College Department of ECE
Aspects of performance
Computer performance metrics (things to measure) include availability, response time, channel
capacity, latency, completion time, service time, bandwidth, throughput, relative
efficiency, scalability, performance per watt, compression ratio, instruction path length and speed
up. CPU benchmarks are available.[2]
Availability
Availability of a system is typically measured as a factor of its reliability - as reliability increases,
so does availability (that is, less downtime). Availability of a system may also be increased by the
strategy of focusing on increasing testability and maintainability and not on reliability. Improving
maintainability is generally easier than reliability. Maintainability estimates (Repair rates) are also
generally more accurate. However, because the uncertainties in the reliability estimates are in most
cases very large, it is likely to dominate the availability (prediction uncertainty) problem, even
while maintainability levels are very high.
Response time
Response time is the total amount of time it takes to respond to a request for service. In computing,
that service can be any unit of work from a simple disk IO to loading a complex web page. The
response time is the sum of three numbers:[3]
• Service time - How long it takes to do the work requested.
• Wait time - How long the request has to wait for requests queued ahead of it before it gets to run.
• Transmission time – How long it takes to move the request to the computer doing the work and
the response back to the requestor.
Processing speed
Most consumers pick a computer architecture (normally Intel IA32 architecture) to be able to run
a large base of pre-existing, pre-compiled software. Being relatively uninformed on computer
benchmarks, some of them pick a particular CPU based on operating frequency (see megahertz
myth).
Some system designers building parallel computers pick CPUs based on the speed per dollar.
Channel capacity
Channel capacity is the tightest upper bound on the rate of information that can be reliably
transmitted over a communications channel. By the noisy-channel coding theorem, the channel
31
Paavai Engineering College Department of ECE
capacity of a given channel is the limiting information rate (in units of information per unit time)
that can be achieved with arbitrarily small error probability.[4][5]
Latency
Latency is a time delay between the cause and the effect of some physical change in the system
being observed. Latency is a result of the limited velocity with which any physical interaction can
take place. This velocity is always lower or equal to speed of light. Therefore, every physical
system that has spatial dimensions different from zero will experience some sort of latency.
System designers building real-time computing systems want to guarantee worst-case response.
That is easier to do when the CPU has low interrupt latency and when it has deterministic response.
Bandwidth
In computer networking, bandwidth is a measurement of bit-rate of available or consumed data
communication resources, expressed in bits per second or multiples of it (bit/s, kbit/s, Mbit/s,
Gbit/s, etc.).
Bandwidth sometimes defines the net bit rate (aka. peak bit rate, information rate, or physical layer
useful bit rate), channel capacity, or the maximum throughput of a logical or physical
communication path in a digital communication system.
Throughput
In general terms, throughput is the rate of production or the rate at which something can be
processed.
In communication networks, throughput is essentially synonymous to digital bandwidth
consumption. In wireless networks or cellular communication networks, the system spectral
efficiency in bit/s/Hz/area unit, bit/s/Hz/site or bit/s/Hz/cell, is the maximum system throughput
(aggregate throughput) divided by the analog bandwidth and some measure of the system coverage
area.
In integrated circuits, often a block in a data flow diagram has a single input and a single output,
and operate on discrete packets of information. Examples of such blocks are FFT modules
or binary multipliers. Because the units of throughput are the reciprocal of the unit for propagation
delay, which is 'seconds per message' or 'seconds per output', throughput can be used to relate a
computational device performing a dedicated function such as an ASIC or embedded processor to
a communications channel, simplifying system analysis.
32
Paavai Engineering College Department of ECE
Relative efficiency
The efficiency of n stages in a pipeline is defined as ratio of the actual speedup to the maximum
speed
Scalability
Scalability is the ability of a system, network, or process to handle a growing amount of work in a
capable manner or its ability to be enlarged to accommodate that growth
Power consumption
The amount of electric power used by the computer (power consumption). This becomes
especially important for systems with limited power sources such as solar, batteries, human power.
Performance per watt
System designers building parallel computers, such as Google's hardware, pick CPUs based on
their speed per watt of power, because the cost of powering the CPU outweighs the cost of the
CPU itself. For spaceflight computers, the processing speed per watt ratio is a more useful
performance criterion than raw processing speed.
Compression ratio
Compression is useful because it helps reduce resource usage, such as data storage space or
transmission capacity. Because compressed data must be decompressed to use, this extra
processing imposes computational or other costs through decompression; this situation is far from
being a free lunch. Data compression is subject to a space–time complexity trade-off.
Size and weight
This is an important performance feature of mobile systems, from the smart phones you keep in
your pocket to the portable embedded systems in a spacecraft.
Environmental impact
The effect of a computer or computers on the environment, during manufacturing and recycling as
well as during use. Measurements are taken with the objectives of reducing waste, reducing
hazardous materials, and minimizing a computer's ecological footprint.
Transistor count
The transistor count is the number of transistors on an integrated circuit (IC). Transistor count is
the most common measure of IC complexity.
Benchmarks
33
Paavai Engineering College Department of ECE
Because there are so many programs to test a CPU on all aspects of performance, benchmarks were
developed.
The most famous benchmarks are the SPECint and SPECfp benchmarks developed by Standard
Performance Evaluation Corporation and the Certification Mark benchmark developed by the
Embedded Microprocessor Benchmark Consortium EEMBC.
Software performance testing
In software engineering, performance testing is in general testing performed to determine how a
system performs in terms of responsiveness and stability under a particular workload. It can also
serve to investigate, measure, validate or verify other quality attributes of the system, such as
scalability, reliability and resource usage.
Performance testing is a subset of performance engineering, an emerging computer science
practice which strives to build performance into the implementation, design and architecture of a
system.
Advantages of Pipelining
• Instruction throughput increases.
• Increase in the number of pipeline stages increases the number of instructions executed
simultaneously.
• Faster ALU can be designed when pipelining is used.
• Pipelined CPU’s works at higher clock frequencies than the RAM.
• Pipelining increases the overall performance of the CPU.
Disadvantages of Pipelining
• Designing of the pipelined processor is complex.
• Instruction latency increases in pipelined processors.
• The throughput of a pipelined processor is difficult to predict.
• The longer the pipeline, worse the problem of hazard for branch instructions.
2.7 EXCEPTION HANDLING
Exceptions and interrupts are unexpected events that disrupt the normal flow of instruction
execution. An exception is an unexpected event from within the processor. An interrupt is an
unexpected event from outside the processor.
Exceptions or interrupts are unexpected events that require change in flow of control. Different
ISAs use the terms differently. Exceptions generally refer to events that arise within the CPU, for
example, undefined opcode, overflow, system call, etc. Interrupts point to requests coming from
34
Paavai Engineering College Department of ECE
an external I/O controller or device to the processor. Dealing with these events without sacrificing
performance is hard. For the rest of the discussion, we will not distinguish between the two. We
shall refer to them collectively as exceptions. Some examples of such exceptions are listed below:
• Breakpoint
• FP arithmetic anomaly
• Page fault
• Hardware malfunctions
• Power failure
• Synchronous vs Asynchronous
• Some exceptions may be synchronous, whereas others may be asynchronous. If the same
exception occurs in the same place with the same data and memory allocation, then it is a
synchronous exception.
They are more difficult to handle.
• Devices external to the CPU and memory cause asynchronous exceptions. They can be handled
after the current instruction and hence easier than synchronous exceptions.
• Some exceptions may be user requested and not automatic. Such exceptions are predictable
and can be handled after the current instruction.
35
Paavai Engineering College Department of ECE
• Coerced exceptions are generally raised by hardware and not under the control of the user
program. They are harder to handle.
• Exceptions can be maskable or unmaskable. They can be masked or unmasked by a user task.
This decides whether the hardware responds to the exception or not. You may have instructions
that enable or disable exceptions.
• Exceptions may have to be handled within the instruction or between instructions. Within
exceptions are normally synchronous and are harder since the instruction has to be stopped and
restarted. Catastrophic exceptions like hardware malfunction will normally cause termination.
• Exceptions that can be handled between two instructions are easier to handle.
• Resume vs Terminate
• Some exceptions may lead to the program to be continued after the exception and some of
them may lead to termination. Things are much more complicated if we have to restart.
• Exceptions that lead to termination are much more easier, since we just have to terminate and
need not restore the original status.
Therefore, exceptions that occur within instructions and exceptions that must be restartable are
much more difficult to handle.
An ideal processor is one where all constraints on ILP are removed. The only limits on ILP
in such a processor are those imposed by the actual data flows through either registers or memory.
1.Register renaming
—There are an infinite number of virtual registers available, and hence all WAW and WAR
hazards are avoided and an unbounded number of instructions can begin execution simultaneously.
2.Branch prediction
—Branch prediction is perfect. All conditional branches are predicted exactly.
36
Paavai Engineering College Department of ECE
3.Jump prediction
—All jumps (including jump register used for return and computed jumps) are
perfectly predicted. When combined with perfect branch prediction, this is equivalent to having a
processor with perfect speculation and an unbounded buffer of instructions available for execution.
5.Perfect caches
—All memory accesses take 1 clock cycle. In practice, superscalar processors
will typically consume large amounts of ILP hiding cache misses, making these results highly
optimistic.
To measure the available parallelism, a set of programs was compiled and optimized with
the standard MIPS optimizing compilers. The programs were instrumented and executed to
produce a trace of the instruction and data references. Every instruction in the trace is then
scheduled as early as possible, limited only by the data dependences. Since a trace is used, perfect
branch prediction and perfect alias analysis are easy to do. With these mechanisms, instructions
may bescheduled much earlier than they would otherwise, moving across large numbers of
instructions on which they are not data dependent, including branches, since branches are perfectly
predicted.
The effects of various assumptions are given before looking at some ambitious but
realizable processors.
To build a processor that even comes close to perfect branch prediction and perfect alias
analysis requires extensive dynamic analysis, since static compile time schemes cannot be perfect.
Of course, most realistic dynamic schemes will not be perfect, but the use of dynamic schemes
will provide the ability to uncover parallelism that cannot be analysed by static compile time
analysis. Thus, a dynamic processor might be able to more closely match the amount of parallelism
uncovered by our ideal processor.
Our ideal processor assumes that branches can be perfectly predicted: The outcome of
any branch in the program is known before the first instruction is executed! Of course, no real
processor can ever achieve this.
37
Paavai Engineering College Department of ECE
We assume a separate predictor is used for jumps. Jump predictors are important
primarily with the most accurate branch predictors, since the branch frequency is higher and the
accuracy of the branch predictors dominates.
1.Perfect —All branches and jumps are perfectly predicted at the start of execution.
2.Tournament-based branch predictor —The prediction scheme uses a correlating 2-bit
predictor and a noncorrelating 2-bit predictor together with a selector, which chooses the
best predictor for each branch.
Our ideal processor eliminates all name dependences among register references using an
infinite set of virtual registers. To date, the IBM Power5 has provided the largest numbers of virtual
registers: 88 additional floating-point and 88 additional integer registers, in addition to the 64
registers available in the base architecture. All 240 registers are shared by two threads when
executing in multithreading mode, and all are available to a single thread when in single-thread
mode.
Our optimal model assumes that it can perfectly analyze all memory dependences, as well
as eliminate all register name dependences. Of course, perfect alias analysis is not possible in
practice: The analysis cannot be perfect at compile time, and it requires a potentially unbounded
number of comparisons at run time (since the number of simultaneous memory references is
unconstrained).
1. Global/stack perfect—This model does perfect predictions for global and stack
references and assumes all heap references conflict. This model represents an idealized version of
the best compiler-based analysis schemes currently in production. Recent and ongoing research on
alias analysis for pointers should improve the handling of pointers to the heap in the future.
2. Inspection—This model examines the accesses to see if they can be determined not to
interfere at compile time. For example, if an access uses R10 as a base register with an offset of
20, then another access that uses R10 as a base register with an offset of 100 cannot interfere,
assuming R10 could not have changed. In addition, addresses based on registers that point to
different allocation areas (such as the global area and the stack area) are assumed never to alias.
This analysis is similar to that performed by many existing commercial compilers, though newer
compilers can do better, at least for looporiented programs.
38
Paavai Engineering College Department of ECE
3. None—All memory references are assumed to conflict. As you might expect, for the
FORTRAN programs (where no heap references exist), there is no difference between perfect and
global/stack perfect analysis
2.9 MULTICORE PROCESSING
The first blog entry in this series introduced the basic concepts of multicore processing and
virtualization, highlighted their benefits, and outlined the challenges these technologies present.
This second post will concentrate on multicore processing, where I will define its various types,
list its current trends, examine its pros and cons, and briefly address its safety and security
ramifications.
Definitions
A multicore processor is a single integrated circuit (a.k.a., chip multiprocessor or CMP) that
contains multiple core processing units, more commonly known as cores. There are many different
multicore processor architectures, which vary in terms of
• Number of cores. Different multicore processors often have different numbers of cores.
For example, a quad-core processor has four cores. The number of cores is usually a power
of two.
• Number of core types.
o Homogeneous (symmetric) cores. All of the cores in a homogeneous multicore
processor are of the same type; typically the core processing units are general-
purpose central processing units that run a single multicore operating system.
o Heterogeneous (asymmetric) cores. Heterogeneous multicore processors have a
mix of core types that often run different operating systems and include graphics
processing units.
• Number and level of caches. Multicore processors vary in terms of their instruction and
data caches, which are relatively small and fast pools of local memory.
• How cores are interconnected. Multicore processors also vary in terms of
their bus architectures.
• Isolation. The amount, typically minimal, of in-chip support for the spatial and temporal
isolation of cores:
o Physical isolation ensures that different cores cannot access the same physical
hardware (e.g., memory locations such as caches and RAM).
o Temporal isolation ensures that the execution of software on one core does not
impact the temporal behavior of software running on another core.
The following figure notionally shows the architecture of a system in which 14 software
applications are allocated by a single host operating system to the cores in a homogeneous quad-
core processor. In this architecture, there are three levels of cache, which are progressively larger
but slower: L1 (consisting of an instruction cache and a data cache), L2, and L3. Note that the L1
and L2 caches are local to a single core, whereas L3 is shared among all four cores.
39
Paavai Engineering College Department of ECE
The following figure notionally shows how these 14 applications could be allocated to four
different operating systems, which in turn are allocated to four different cores, in a heterogeneous,
quad-core processor. From left to right, the cores include a general-purpose central processing unit
core running Windows; a graphical processing unit (GPU) core running graphics-intensive
applications on Linux; a digital signal processing (DSP) core running a real-time operating system
(RTOS); and a high-performance core also running an RTOS.
40
Paavai Engineering College Department of ECE
Multicore processors are replacing traditional, single-core processors so that fewer single-core
processors are being produced and supported. Consequently, single-core processors are
becoming technologically obsolete. Heterogeneous multicore processors, such as computer-on-a-
chip processors, are becoming more common.
Although multicore processors have largely saturated some application domains (e.g., cloud
computing, data warehousing, and on-line shopping), they are just starting to be used in real-time,
safety- and security-critical, cyber-physical systems. One area in which multicore processing is
becoming popular is in environments constrained by size, weight, and power, and cooling (SWAP-
C), in which significantly increased performance is required.
Multicore processing is typically commonplace because it offers advantages in the following seven
areas:
1. Energy Efficiency. By using multicore processors, architects can decrease the number of
embedded computers. They overcome increased heat generation due to Moore's Law (i.e.,
smaller circuits increase electrical resistance, which creates more heat), which in turn
decreases the need for cooling. The use of multicore processing reduces power
consumption (less energy wasted as heat), which increases battery life.
2. True Concurrency. By allocating applications to different cores, multicore processing
increases the intrinsic support for actual (as opposed to virtual) parallel processing within
individual software applications across multiple applications.
3. Performance. Multicore processing can increase performance by running multiple
applications concurrently. The decreased distance between cores on an integrated chip
enables shorter resource access latency and higher cache speeds when compared to using
separate processors or computers. However, the size of the performance increase depends
on the number of cores, the level of real concurrency in the actual software, and the use of
shared resources.
4. Isolation. Multicore processors may improve (but do not guarantee) spatial and temporal
isolation (segregation) compared to single-core architectures. Software running on one
core is less likely to affect software on another core than if both are executing on the same
single core. This decoupling is due to both spatial isolation (of data in core-specific cashes)
and temporal isolation, because threads on one core are not delayed by threads on another
core. Multicore processing may also improve robustness by localizing the impact of defects
to single core. This increased isolation is particularly important in the independent
execution of mixed-criticality applications (mission-critical, safety critical, and security-
critical).
5. Reliability and Robustness. Allocating software to multiple cores increases reliability and
robustness (i.e., fault and failure tolerance) by limiting fault and/or failure propagation
from software on one core to software on another. The allocation of software to multiple
cores also supports failure tolerance by supporting failover from one core to another (and
subsequent recovery).
6. Obsolescence Avoidance. The use of multicore processors enables architects to avoid
technological obsolescence and improve maintainability. Chip manufacturers are applying
41
Paavai Engineering College Department of ECE
the latest technical advances to their multicore chips. As the number of cores continues to
increase, it becomes increasingly hard to obtain single-core chips.
7. Hardware Costs. By using multicore processors, architects can produce systems with
fewer computers and processors.
Although there are many advantages to moving to multicore processors, architects must address
disadvantages and associated risks in the following six areas:
1. Shared Resources. Cores on the same processor share both processor-internal resources
(L3 cache, system bus, memory controller, I/O controllers, and interconnects)
and processor-external resources (main memory, I/O devices, and networks). These shared
resources imply (1) the existence of single points of failure, (2) two applications running
on the same core can interfere with each other, and (3) software running on one core can
impact software running on another core (i.e., interference can violate spatial and temporal
isolation because multicore support for isolation is limited). The diagram below uses the
color red to illustrate six shared resources.
2. Interference. Interference occurs when software executing on one core impacts the
behavior of software executing on other cores in the same processor. This interference
includes failures of both spatial isolation (due to shared memory access) and failure
of temporal isolation (due to interference delays and/or penalties). Temporal isolation is a
bigger problem than spatial isolation since multicore processors may have special hardware
that can be used to enforce spatial isolation (to prevent software running on different cores
from accessing the same processor-internal memory). The number of interference paths
increases rapidly with the number of cores and the exhaustive analysis of all interference
paths is often impossible. The impracticality of exhaustive analysis necessitates the
selection of representative interference paths when analyzing isolation. The following
diagram uses the color red to illustrate three possible interference paths between pairs of
applications involving six shared resources.
3. Concurrency Defects. Cores execute concurrently, creating the potential for concurrency
defects including deadlock, livelock, starvation, suspension, (data) race conditions,
priority inversion, order violations, and atomicity violations. Note that these are essentially
the same types of concurrency defects that can occur when software is allocated to multiple
threads on a single core.
4. Non-determinism. Multicore processing increases non-determinism. For example, I/O
Interrupts have top-level hardware priority (also a problem with single core processors).
Multicore processing is also subject to lock trashing, which stems from excessive lock
conflicts due to simultaneous access of kernel services by different cores (resulting in
decreased concurrency and performance). The resulting non-deterministic behavior can be
unpredictable, can cause related faults and failures, and can make testing more difficult
(e.g., running the same test multiple times may not yield the same test result).
5. Analysis Difficulty. The real concurrency due to multicore processing requires different
memory consistency models than virtual interleaved concurrency. It also breaks traditional
analysis approaches for work on single core processors. The analysis of maximum time
limits is harder and may be overly conservative. Although interference analysis becomes
42
Paavai Engineering College Department of ECE
This SEI is transitioning this research through activities that include the following:
• a workshop with participants from Carnegie Mellon University, Nagoya University, the
University of Illinois at Urbana-Champaign, and NEC Electronics
• collaboration with Lockheed-Martin researchers to publish papers on the use of multicore
architectures in cyber-physical systems
• invited talks at conferences such as IEEE International Conference on Embedded and Real-
Time Computing Systems and Applications
43
Paavai Engineering College Department of ECE
In this tutorial, we are going to learn about the Flynn's Classification of Computer Architecture
in Computer Science Organization.
Single instruction: Only one instruction stream is being acted or executed by CPU during one
clock cycle.
Single data stream: Only one data stream is used as input during one clock cycle.
A SIMD system is a multiprocessor machine, capable of executing the same instruction on all the
CPUs but operating on the different data stream.
44
Paavai Engineering College Department of ECE
45
Paavai Engineering College Department of ECE
SPMD
As a computer program is executed, there are many events that can cause the CPU hardware
resources to not be fully utilized every CPU cycle. Such events include:
• Data Cache Misses - the required data must be loaded from memory outside the CPU. The
CPU has to wait for that data to arrive from the remote memory.
• Instruction Cache Misses - the next instruction of the program must be fetched from
memory outside the CPU. Again, the CPU has to wait for the next instruction to arrive
from the remote memory.
• Data dependency stalls - the next instruction can't execute yet as one of its input operands
hasn't been calculated yet.
• Functional Unit stalls - the next instruction can't execute yet as the required hardware
resource is currently busy.
When one portion of the program (known as a thread) is blocked for one of these events, the
hardware resources could potentially be used for another thread of execution. By switching
to a second thread when the first thread is blocked, the overall through-put of the system can
be increased. The idea of speeding up the aggregate execution of all threads in the system is
46
Paavai Engineering College Department of ECE
If one replicates an entire CPU to execute a second thread, then the technique is known as
multi-processing.
If one replicates only a portion of a CPU to execute a second thread, then the technique is
known as multi-threading.
A simple graphical example can be seen in Figures 1 and 2 - where the multi-threaded
implementation in Figure 2 does more aggregate work in the same number of cycles as the
single-threaded CPU in Figure 1. Instead of having the execution pipeline be idle while
waiting for the Memory data to arrive, the multi-threaded CPU executes code for Thread2
during those same memory access cycles. The idle cycles in black are often known as pipeline
"bubbles".
Sharing hardware resources among multiple threads gives an obvious cost advantage to multi-
threading as compare to full-blown multi-processing. Another potential benefit is that
multiple threads could be working on the same data. By sharing the same data caches, multiple
threads get better utilization of these caches and better synchronization of the shared data.
By minimizing how much hardware is replicated for executing a software thread, Multi -
threading can boost overall system performance and through-put with relatively little
additional hardware cost.
The performance boost from Multi-threading comes from filling up all of the CPU cycles with
useful work that otherwise would be un-used due to stalls. Many applications have low
numbers of instructions-executed-per-cycle when run in single-threaded mode and are good
candidates for multi-threading.
Any application which can keep the CPU fully busy every cycle with a single thread is not a
good candidate for multi-threading. Such applications are relatively rare.
47
Paavai Engineering College Department of ECE
Since the introduction of the first MT enabled MIPS CPUs, there have been multiple studies
of the MT performance benefits. We’ve seen performance boosts of up to 226% in which the
gains relate to both the hardware and OS task scheduling changes. Various application notes
about MT performance gains and an app note on MIPS Creator Ci40 Multithreading
Benchmarks can be found on the Imagination website
at: https://community.imgtec.com/developers/mips/resources/application-notes/
Types of Multi-threading
Coarse-Grained MT
Hardware support for this type of multi-threading is meant to allow quick switching between
the threads. To achieve this goal, the additional hardware cost is to replicate the program
visible state - such as the GPRs and the program counter for each thread. For example, to
quickly switch between two threads, the hardware cost would be having two copies of the
GPRs and the program counter.
For this type of multi-threading, only long latency stalls can cause thread switches, as an
instruction in the program has to be added for each stall check. It would be too costly to add
such instructions to check for very short stalls.
Fine-Grained MT
Early implementations of this type of multi-threading caused a thread switch every CPU
cycle. The motivation for switching every cycle was to reduce the possibility of stalling for
a previous result from the same thread. This early type was known as barrel processing, in
which staves of a barrel represented the pipeline stages of the CPU. It was also known as
interleaved or pre-emptive or time-sliced multi-threading. It was conceptually similar to
preemptive multi-tasking, used in operating systems, where the time slice that is given to each
active thread is one CPU cycle.
The additional hardware cost of fine-grained Multi-threading is to track the Thread ID of the
instruction in each pipeline stage. In addition since there are multiple threads that are
concurrently active, shared resources such as caches and TLBs might need to be increased in
size to avoid thrashing between the different threads.
48
Paavai Engineering College Department of ECE
More modern implementations would only cause a thread switch when the currently running
thread became blocked. For these more modern implementations, a thread can continue
executing until it would produce a stall.
Simultaneous MT
Simultaneous Multi-threading or SMT means that each of these instructions which are issued
together can either be from the same thread or each can be from different threads. The
hardware thread scheduler will pick the most appropriate instructions to maximize the
utilization of the execution pipelines.
MIPS Multi-threading
The first multi-threaded processor from MIPS was the MIPS32 34K, which was released in
2005. The 34K implemented fine-grained multi-threading (the more modern kind which
doesn’t have to blindly switch threads every cycle), with a hardware thread scheduler within
the CPU which picks the most appropriate thread to run each CPU cycle.
All subsequent multi-threaded processors from MIPS have also implemented fine-grained
multi-threading. This includes the 1004K symmetric multiprocessing system, and the
interAptiv family of multi-threaded, multi-core CPUs, which deliver increased multi-core
performance and added features such as dcache ECC, Extended Virtual Addressing
(EVA), multi-threaded FPU, updated DSP ASE, and other features.
The R6 versions of the MIPS32/64 architectures were released in 2014 and introduced a
simplified definition for MIPS Multi-threading. In this simplified definition, the entity
executing a software thread is known as a Virtual Processor. The MIPS Warrior I6400 is a
super-scalar CPU and was the first MIPS CPU which also implemented SMT. It also added
features including hardware virtualization.
More recently Imagination introduced the I6500 CPU, a superset of I6400 with numerous new
features at the core and cluster levels, and highly extended capabilities into heterogeneous
compute configurations.
49
Paavai Engineering College Department of ECE
50
Paavai Engineering College Department of ECE
A four stage pipeline has the stage delays as 150, 120, 160 and 140 ns respectively. Registers are
used between the stages and have a delay of 5 ns each. Assuming constant clocking rate, the total
time taken to process 1000 data items on the pipeline will be-
1. 120.4 microseconds
2. 160.5 microseconds
3. 165.5 microseconds
4. 590.0 microseconds
Solution-
Given-
• Four stage pipeline is used
• Delay of stages = 150, 120, 160 and 140 ns
• Delay due to each register = 5 ns
• 1000 data items or instructions are processed
Cycle Time-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 150, 120, 160, 140 } + 5 ns
= 160 ns + 5 ns
= 165 ns
Pipeline Time To Process 1000 Data Items-
Pipeline time to process 1000 data items
= Time taken for 1st data item + Time taken for remaining 999 data items
= 1 x 4 clock cycles + 999 x 1 clock cycle
= 4 x cycle time + 999 x cycle time
= 4 x 165 ns + 999 x 165 ns
= 660 ns + 164835 ns
= 165495 ns
= 165.5 μs
Thus, Option (C) is correct.
Problem-03:
Consider a non-pipelined processor with a clock rate of 2.5 gigahertz and average cycles per
instruction of 4. The same processor is upgraded to a pipelined processor with five stages but due
to the internal pipeline delay, the clock speed is reduced to 2 gigahertz. Assume there are no stalls
in the pipeline. The speed up achieved in this pipelined processor is-
1. 3.2
2. 3.0
3. 2.2
4. 2.0
Solution-
Cycle Time in Non-Pipelined Processor-
Frequency of the clock = 2.5 gigahertz
Cycle time
= 1 / frequency
= 1 / (2.5 gigahertz)
= 1 / (2.5 x 109 hertz)
= 0.4 ns
51
Paavai Engineering College Department of ECE
Problem-04:
The stage delays in a 4 stage pipeline are 800, 500, 400 and 300 picoseconds. The first stage is
replaced with a functionally equivalent design involving two stages with respective delays 600 and
350 picoseconds.
The throughput increase of the pipeline is _____%.
Solution-
Execution Time in 4 Stage Pipeline-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 800, 500, 400, 300 } + 0
= 800 picoseconds
Thus, Execution time in 4 stage pipeline = 1 clock cycle = 800 picoseconds.
Throughput in 4 Stage Pipeline-
Throughput
= Number of instructions executed per unit time
= 1 instruction / 800 picoseconds
Execution Time in 2 Stage Pipeline-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 600, 350 } + 0
= 600 picoseconds
52
Paavai Engineering College Department of ECE
Problem-05:
A non-pipelined single cycle processor operating at 100 MHz is converted into a synchronous
pipelined processor with five stages requiring 2.5 ns, 1.5 ns, 2 ns, 1.5 ns and 2.5 ns respectively.
The delay of the latches is 0.5 sec.
The speed up of the pipeline processor for a large number of instructions is-
1. 4.5
2. 4.0
3. 3.33
4. 3.0
Solution-
Cycle Time in Non-Pipelined Processor-
Frequency of the clock = 100 MHz
Cycle time
= 1 / frequency
= 1 / (100 MHz)
= 1 / (100 x 106 hertz)
= 0.01 μs
Non-Pipeline Execution Time-
Non-pipeline execution time to process 1 instruction
= Number of clock cycles taken to execute one instruction
= 1 clock cycle
= 0.01 μs
= 10 ns
Cycle Time in Pipelined Processor-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 2.5, 1.5, 2, 1.5, 2.5 } + 0.5 ns
= 2.5 ns + 0.5 ns
= 3 ns
Pipeline Execution Time-
Pipeline execution time
53
Paavai Engineering College Department of ECE
= 1 clock cycle
= 3 ns
Speed Up-
Speed up
= Non-pipeline execution time / Pipeline execution time
= 10 ns / 3 ns
= 3.33
Thus, Option (C) is correct.
Problem-06:
We have 2 designs D1 and D2 for a synchronous pipeline processor. D1 has 5 stage pipeline with
execution time of 3 ns, 2 ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages each
with 2 ns execution time. How much time can be saved using design D2 over design D1 for
executing 100 instructions?
1. 214 ns
2. 202 ns
3. 86 ns
4. 200 ns
Solution-
Cycle Time in Design D1-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 3, 2, 4, 2, 3 } + 0
= 4 ns
Execution Time For 100 Instructions in Design D1-
Execution time for 100 instructions
= Time taken for 1st instruction + Time taken for remaining 99 instructions
= 1 x 5 clock cycles + 99 x 1 clock cycle
= 5 x cycle time + 99 x cycle time
= 5 x 4 ns + 99 x 4 ns
= 20 ns + 396 ns
= 416 ns
Cycle Time in Design D2-
Cycle time
= Delay due to a stage + Delay due to its register
= 2 ns + 0
= 2 ns
Execution Time For 100 Instructions in Design D2-
Execution time for 100 instructions
= Time taken for 1st instruction + Time taken for remaining 99 instructions
= 1 x 8 clock cycles + 99 x 1 clock cycle
= 8 x cycle time + 99 x cycle time
= 8 x 2 ns + 99 x 2 ns
= 16 ns + 198 ns
= 214 ns
Time Saved-
Time saved
54
Paavai Engineering College Department of ECE
Problem-07:
Consider an instruction pipeline with four stages (S1, S2, S3 and S4) each with combinational
circuit only. The pipeline registers are required between each stage and at the end of the last stage.
Delays for the stages and for the pipeline registers are as given in the figure-
What is the approximate speed up of the pipeline in steady state under ideal conditions when
compared to the corresponding non-pipeline implementation?
1. 4.0
2. 2.5
3. 1.1
4. 3.0
Solution-
Non-Pipeline Execution Time-
Non-pipeline execution time for 1 instruction
= 5 ns + 6 ns + 11 ns + 8 ns
= 30 ns
Cycle Time in Pipelined Processor-
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { 5, 6, 11, 8 } + 1 ns
= 11 ns + 1 ns
= 12 ns
Pipeline Execution Time-
Pipeline execution time
= 1 clock cycle
= 12 ns
Speed Up-
Speed up
= Non-pipeline execution time / Pipeline execution time
= 30 ns / 12 ns
= 2.5
55
Paavai Engineering College Department of ECE
I1 2 1 1 1
I2 1 3 2 2
I3 2 1 1 3
I4 1 2 2 2
What is the number of cycles needed to execute the following loop?
for(i=1 to 2) { I1; I2; I3; I4; }
1. 16
2. 23
3. 28
4. 30
Solution-
The phase-time diagram is-
From here, number of clock cycles required to execute the loop = 23 clock cycles.
Thus, Option (B) is correct.
Problem-09:
Consider a pipelined processor with the following four stages-
IF : Instruction Fetch
ID : Instruction Decode and Operand Fetch
EX : Execute
WB : Write Back
The IF, ID and WB stages take one clock cycle each to complete the operation. The number of
clock cycles for the EX stage depends on the instruction. The ADD and SUB instructions need 1
clock cycle and the MUL instruction need 3 clock cycles in the EX stage. Operand forwarding is
used in the pipelined processor. What is the number of clock cycles taken to complete the following
sequence of instructions?
ADD R2, R1, R0 R2 ← R0 + R1
MUL R4, R3, R2 R4 ← R3 + R2
SUB R6, R5, R4 R6 ← R5 + R4
56
Paavai Engineering College Department of ECE
7
1. 8
2. 10
3. 14
Solution-
The phase-time diagram is-
From here, number of clock cycles required to execute the instructions = 8 clock cycles.
Thus, Option (B) is correct.
Problem-10:
Consider the following procedures. Assume that the pipeline registers have zero latency.
P1 : 4 stage pipeline with stage latencies 1 ns, 2 ns, 2 ns, 1 ns
P2 : 4 stage pipeline with stage latencies 1 ns, 1.5 ns, 1.5 ns, 1.5 ns
P3 : 5 stage pipeline with stage latencies 0.5 ns, 1 ns, 1 ns, 0.6 ns, 1 ns
P4 : 5 stage pipeline with stage latencies 0.5 ns, 0.5 ns, 1 ns, 1 ns, 1.1 ns
Which procedure has the highest peak clock frequency?
1. P1
2. P2
3. P3
4. P4
Solution-
It is given that pipeline registers have zero latency. Thus,
Cycle time
= Maximum delay due to any stage + Delay due to its register
= Maximum delay due to any stage
For Processor P1:
Cycle time
= Max { 1 ns, 2 ns, 2 ns, 1 ns }
= 2 ns
Clock frequency
= 1 / Cycle time
= 1 / 2 ns
= 0.5 gigahertz
For Processor P2:
Cycle time
= Max { 1 ns, 1.5 ns, 1.5 ns, 1.5 ns }
= 1.5 ns
Clock frequency
= 1 / Cycle time
= 1 / 1.5 ns
57
Paavai Engineering College Department of ECE
= 0.67 gigahertz
For Processor P3:
Cycle time
= Max { 0.5 ns, 1 ns, 1 ns, 0.6 ns, 1 ns }
= 1 ns
Clock frequency
= 1 / Cycle time
= 1 / 1 ns
= 1 gigahertz
For Processor P4:
Cycle time
= Max { 0.5 ns, 0.5 ns, 1 ns, 1 ns, 1.1 ns }
= 1.1 ns
Clock frequency
= 1 / Cycle time
= 1 / 1.1 ns
= 0.91 gigahertz
Problem-11:
Consider a 3 GHz (gigahertz) processor with a three-stage pipeline and stage latencies T1, T2 and
T3 such that T1 = 3T2/4 = 2T3. If the longest pipeline stage is split into two pipeline stages of equal
latency, the new frequency is ____ GHz, ignoring delays in the pipeline registers.
Solution-
Let ‘t’ be the common multiple of each ratio, then-
• T1 = t
• T2 = 4t / 3
• T3 = t / 2
Pipeline Cycle Time-
Pipeline cycle time
= Maximum delay due to any stage + Delay due to its register
= Max { t, 4t/3, t/2 } + 0
= 4t/3
Frequency Of Pipeline-
Frequency
= 1 / Pipeline cycle time
= 1 / (4t / 3)
= 3 / 4t
Given frequency = 3 GHz. So,
3 / 4t = 3 GHz
4t = 1 ns
t = 0.25 ns
Stage Latencies Of Pipeline-
Stage latency of different stages are-
58
Paavai Engineering College Department of ECE
QUESTION BANK
PART A
1. List the two steps involved in executing an instruction.
2. Define hazard and its types.
59
Paavai Engineering College Department of ECE
PART B
1. Describe in detail about the functional units and the basic implementation scheme of
MIPS with suitable diagram.
60
Paavai Engineering College Department of ECE
2. Explain how the instruction pipeline works. What are the various situations where an
instruction pipeline can stall?
3. Examine the relationships between pipeline execution and addressing modes.
4. Describe the role of cache memory in pipelined system. (ii) Discuss the influence of
pipelining on instruction set design.
5. What is instruction hazard? Explain the methods for dealing with the instruction hazards.
6.Describe the data path and control considerations for pipelining.
6. Describe the techniques for handling data and instruction hazards in pipelining.
7. Briefly explain about exceptions.
8. Briefly explain about Creating a Single Data path with neat diagram.
9. Explain about operations of the data path with necessary diagrams.
10. Compare the performance of Single-Cycle versus Pipelined Performance.
61