Pipelining and Vector Processing 1
PIPELINING AND VECTOR PROCESSING
• Parallel Processing
• Pipelining
• Arithmetic Pipeline
• Instruction Pipeline
• RISC Pipeline
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 2 Parallel Processing
PARALLEL PROCESSING
Parallel processing is a term used for a large class
of techniques that are used to provide
simultaneous data-processing tasks for the
purpose of increasing the computational speed of
a computer system.
Instead of processing each instruction
sequentially as in a conventional computer.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 3
• Ex: while an instruction is being executed in the ALU, the
next instruction can be read from memory.
• The system may have two or more ALUs and be able to
execute two or more instructions at the same time.
• Purpose: To increase the throughput
• Throughput: The amount of processing that can be
accomplished during a given interval of time.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 4
• The amount of hardware increases with parallel processing
, and with it, the cost of the system increases.
• However, technologies developments have reduced
hardware costs to the point where parallel processing
techniques are economically feasible.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 5
PARALLEL PROCESSING
• Example of parallel Processing:
– Multiple Functional Unit:
Separate the execution unit into
eight functional units operating in
parallel.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 6
• Above figure shows one possible way of separating the
execution unit into eight functional units operating in
parallel.
• Arithmetic operations with integers: Adder-Subtractor,
Integer multiplier.
• All units are independent of each other.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 7
• Parallel processing can be classified as:
(i) Internal organization of the processors
(ii) Interconnection between processors
(iii) Flow of information through the system.
• M.J.Flynn: Organization of a computer system by the
number of instruction and data items that are manipulated
simultaneously.
• The sequence of instructions read from memory constitutes
an instruction stream. The operations performed on the
data in the processor constitutes a data stream.
• Flynn’s classification divides computers into four major
groups.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 8 Parallel Processing
PARALLEL COMPUTERS
Architectural Classification
– Flynn's classification
» Based on the multiplicity of Instruction Streams and Data Streams
» Instruction Stream
• Sequence of Instructions read from memory
» Data Stream
• Operations performed on the data in the processor
Number of Data Streams
Single Multiple
Number of Single SISD SIMD
Instruction
Streams Multiple MISD MIMD
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 9 Parallel Processing
SISD COMPUTER SYSTEMS
Control Processor Data stream Memory
Unit Unit
Instruction stream
• Characteristics:
Ø One control unit, one processor unit, and one memory unit
Ø Parallel processing may be achieved by means of:
ü multiple functional units
ü pipeline processing
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 10 Parallel Processing
MISD COMPUTER SYSTEMS
M CU P
M CU P Memory
• •
• •
• •
M CU P Data stream
Instruction stream
Characteristics
- There is no computer at present that can be classified as
MISD
- Only theoretical interest since no practical system has been
constructed using this organization.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 11 Parallel Processing
SIMD COMPUTER SYSTEMS
Memory
Data bus
Control Unit
Instruction stream
P P ••• P Processor units
Data stream
Alignment network
M M ••• M Memory modules
• Characteristics
Ø Only one copy of the program exists
Ø All processors receive the same instruction from the control
unit but operate on different items of data.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 12 Parallel Processing
MIMD COMPUTER SYSTEMS
P M P M ••• P M
Interconnection Network
Shared Memory
• Characteristics:
Ø Multiple processing units (multiprocessor system)
Ø Execution of multiple instructions on multiple data
• Types of MIMD computer systems
- Shared memory multiprocessors
- Message-passing multicomputer (multicomputer system)
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 13 Pipelining
PIPELINING
• A technique of decomposing a sequential process into suboperations,
with each subprocess being executed in a special dedicated segment
that operates concurrently with all other segments.
Ai * B i + C i for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2
Multiplier
Segment 2
R3 R4
Adder
Segment 3
R5
Suboperations in each segment: R1 Ai, R2 Bi Load Ai and Bi
R3 R1 * R2, R4 Ci Multiply and load Ci
R5 R3 + R4 Add
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 14 Pipelining
OPERATIONS IN EACH PIPELINE STAGE
Clock Segment 1 Segment 2 Segment 3
Pulse
Number R1 R2 R3 R4 R5
1 A1 B1 --- --- -------
2 A2 B2 A1 * B1 C1 -------
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 A7 * B7 C7 A6 * B6 + C6
9 A7 * B7 + C7
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 15 Pipelining
GENERAL PIPELINE
• General Structure of a 4-Segment Pipeline
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
• Space-Time Diagram
The following diagram shows 6 tasks T1 through T6 executed in 4
segments.
Clock cycles
1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
No matter how many
segments, once the
Segment 2 T1 T2 T3 T4 T5 T6
pipeline is full, it takes only
3 T1 T2 T3 T4 T5 T6 one clock period to obtain
4 T1 T2 T3 T4 T5 T6 an output.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 16 Pipelining
PIPELINE SPEEDUP
Consider the case where a k-segment pipeline used to execute n tasks.
Ø n = 6 in previous example
Ø k = 4 in previous example
• Pipelined Machine (k stages, n tasks)
ØThe n tasks clock cycles = k+(n-1) (9 in previous example)
• Conventional Machine (Non-Pipelined)
Ø Cycles to complete each task in nonpipeline =n
Ø For k tasks, nk cycles required is
• Speedup (S)
Ø S = Nonpipeline time /Pipeline time
Ø For n tasks: S = nk/(k+n-1)
Ø As n becomes much larger than k-1; Therefore, S = nk/n = k
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 17 Pipelining
PIPELINE AND MULTIPLE FUNCTION UNITS
Example:
- 4-stage pipeline
- 100 tasks to be executed in sequence
- 1 task in non-pipelined system; 4 clock cycles
Pipelined System : k + n - 1 = 4 + 99 = 103 clock cycles
Non-Pipelined System : n*k = 100 * 4 = 400 clock cycles
Speedup : Sk = 400 / 103 = 3.88
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 18
Types of Pipelining
• Arithmetic Pipeline
• Instruction Pipeline
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 19
Arithmetic Pipeline
• Pipe line arithmetic units are usually found in very high speed
computers.
• They are used to implement floating point operations (addition
and subtraction), multiplication of fixed point numbers.
• The inputs to the floating point adder pipeline are two
normalized floating point binary numbers.
• The floating point addition and subtraction can be performed in
four segments as shown in figure below.
• The registers labeled R are placed between the segments to
store intermediate results.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 20
• The sub operations that are performed in the four segments are
• 1. Compare the exponents
• 2. Align the mantissas
• 3. Add or Subtract the mantissa
• 4. Normalize the result
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 21 Arithmetic Pipeline
ARITHMETIC PIPELINE
Floating-point adder Exponents
a b
Mantissas
A B
[1] Compare the exponents
[2] Align the mantissa R R
[3] Add/sub the mantissa
Compare
[4] Normalize the result Segment 1: exponents
Difference
by subtraction
X = A x 10a = 0.9504 x 103 R
Y = B x 10b = 0.8200 x 102
Segment 2: Choose exponent Align mantissa
1) Compare exponents :
3-2=1 R
2) Align mantissas
Add or subtract
X = 0.9504 x 103 Segment 3: mantissas
Y = 0.08200 x 103
3) Add mantissas R R
Z = 1.0324 x 103
Adjust Normalize
4) Normalize result Segment 4:
exponent result
Z = 0.10324 x 104
R R
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 22
• The comparator, shifter, adder-subtractor, incrementer, and
decrementer in the floating point pipeline are implemented with
combinational circuits.
• Let say individual segment delays= 60+70+100+80=310ns
• Register delay= 10 ns
• Non pipelined total delay= 320ns
• Pipelined adder=100 +10= 110ns
• Speed up= 320/110=2.9.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 23 Instruction Pipeline
INSTRUCTION PIPE LINE
Pipeline processing can occur not only in the data stream but in the
instruction stream as well.
An instruction pipeline reads consecutive instructions from memory
while previous instructions are being executed in other segments.
Six Phases* in an Instruction Cycle
[1] Fetch an instruction from memory
[2] Decode the instruction
[3] Calculate the effective address of the operand
[4] Fetch the operands from memory
[5] Execute the operation
[6] Store the result in the proper place
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 24
• * Some instructions skip some phases
• * Effective address calculation can be done in the part of
the decoding phase
• * Storage of the operation result into a register is done
automatically in the execution phase
• ==> 4-Stage Pipeline
• [1] FI: Fetch an instruction from memory
• [2] DA: Decode the instruction and calculate the
effective address of the operand
• [3] FO: Fetch the operand
• [4] EX: Execute the operation
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 25 Instruction Pipeline
INSTRUCTION PIPELINE
Execution of Three Instructions in a 4-Stage Pipeline
Conventional
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelined
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 26 Instruction Pipeline
INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE
Segment1: Fetch instruction
from memory
Decode instruction
Segment2: and calculate
effective address
Branch?
yes
no
Fetch operand
Segment3: from memory
Segment4: Execute instruction
Interrupt yes
Interrupt?
handling
no
Update PC
Empty pipe
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 27 Instruction Pipeline
INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
1 FI DA FO EX
Instruction
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 28
Pipeline Conflicts
– Pipeline Conflicts : 3 major difficulties
1) Resource conflicts: memory access by two segments at the
same time. Most of these conflicts can be resolved by using
separate instruction and data memories.
2) Data dependency: when an instruction depend on the result
of a previous instruction, but this result is not yet available.
3) Branch difficulties: branch and other instruction (interrupt,
ret, ..) that change the value of PC.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 29 RISC Pipeline
RISC Computer
• RISC (Reduced Instruction Set Computer)
- Machine with a very fast clock cycle that executes at the rate of one
instruction per cycle.
• Major Characteristic
1. Relatively few instructions
2. Relatively few addressing modes
3. Memory access limited to load and store instructions
4. All operations done within the registers of the CPU
5. Fixed-length, easily decoded instruction format
6. Single-cycle instruction execution
7. Hardwired rather than microprogrammed control
8. Relatively large number of registers in the processor unit
9. Efficient instruction pipeline
10. Compiler support for efficient translation of high-level language
programs into machine language programs
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 30
RISC Pipeline
RISC PIPELINE
• The Instruction Cycle can be divided into three sub
operations and implemented in three segments( I,A,E).
The I- segment fetches the instruction from program memory.
The instruction is decoded and an ALU operation is performed in
the A segment.
E segment- Transfer the output of ALU to a register, memory,
or PC.
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 31
• Types of instructions
- Data Manipulation Instructions
- Load and Store Instructions
- Program Control Instructions
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 32
• 9-5 RISC Pipeline
– Example : Three-segment Instruction Pipeline
– Pipeline timing with data conflict :
– Pipeline timing with delayed load :
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 33
Computer Organization Computer Architectures Lab
Pipelining and Vector Processing 34
• In figure (a), There will be a data conflict in instruction 3
because the operand in R2 is not yet available in the A
segment.
• 1.LOAD: R1 M[address 1]
• 2. LOAD : R2 M[address 2]
• 3. ADD: R3 R1+R2
• 4. STORE: M[address 3] R3
• Solution: Delayed load:
Computer Organization Computer Architectures Lab