Basic Processing Unit
Overview
Instruction Set Processor (ISP)
Central Processing Unit (CPU)
A typical computing task consists of a series
of steps specified by a sequence of machine
instructions that constitute a program.
An instruction is executed by carrying out a
sequence of more rudimentary operations.
Some Fundamental
Concepts
Fundamental Concepts
Processor fetches one instruction at a time and
perform the operation specified.
Instructions are fetched from successive memory
locations until a branch or a jump instruction is
encountered.
Processor keeps track of the address of the memory
location containing the next instruction to be fetched
using Program Counter (PC).
Instruction Register (IR)
Executing an Instruction
Fetch the contents of the memory location pointed
to by the PC. The contents of this location are
loaded into the IR (fetch phase).
IR ← [[PC]]
Assuming that the memory is byte addressable,
increment the contents of the PC by 4 (fetch phase).
PC ← [PC] + 4
Carry out the actions specified by the instruction in
the IR (execution phase).
Processor Organization Internal processor
bus
Control signals
PC
Instruction
Address
decoder and
lines
MDR HAS MAR control logic
TWO INPUTS Memory
AND TWO bus
OUTPUTS MDR
Data
lines IR
Datapath
Y
Constant 4 R0
Select MUX
Add
A B
ALU Sub R n - 1
control ALU
lines
Carry-in
XOR TEMP
Z
Textbook Page 413
Figure 7.1. Single-bus organization of the datapath inside a processor.
Executing an Instruction
Transfer a word of data from one processor
register to another or to the ALU.
Perform an arithmetic or a logic operation
and store the result in a processor register.
Fetch the contents of a given memory
location and load them into a processor
register.
Store a word of data from a processor
register into a given memory location.
Register Transfers Riin
Internal processor
bus
Ri
Riout
Yin
Constant 4
Select MUX
A B
ALU
Zin
Z out
Figure 7.2. Input and output gating for the registers in Figure 7.1.
Register Transfers
All operations and data transfers are controlled by the processor clock.
Bus
D Q
1
Q
Riout
Ri in
Clock
Figure
Figure7.3. Inputand
7.3. Input andoutput
outputgating
gatingforfor one
one registerbit.
register bit.
Performing an Arithmetic or
Logic Operation
The ALU is a combinational circuit that has no
internal storage.
ALU gets the two operands from MUX and bus.
The result is temporarily stored in register Z.
What is the sequence of operations to add the
contents of register R1 to those of R2 and store the
result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in
Fetching a Word from Memory
Address into MAR; issue Read operation; data into MDR.
Figure: Connection and control signals for register MDR.
Fetching a Word from Memory
The response time of each memory access varies
(cache miss, memory-mapped I/O,…).
To accommodate this, the processor waits until it
receives an indication that the requested operation
has been completed (Memory-Function-Completed,
MFC).
Move (R1), R2
MAR ← [R1]
Start a Read operation on the memory bus
Wait for the MFC response from the memory
Load MDR from the memory bus
R2 ← [MDR]
Step 1 2 3
Timing Clock
MARin MAR ← [R1]
Assume MAR
is always available Address
on the address lines
of the memory bus. Start a Read operation on the memory bus
Read
MR
MDRinE
Data
Wait for the MFC response from the memory
MFC
MDR out Load MDR from the memory bus
R2 ← [MDR]
Figure 7.5. Timing of a memory Read operation.
Execution of a Complete
Instruction
Add (R3), R1
Fetch the instruction
Fetch the first operand (the contents of the
memory location pointed to by R3)
Perform the addition
Load the result into R1
Architecture Internal processor
bus
Riin
Ri
Ri out
Yin
Constant 4
Select MUX
A B
ALU
Z in
Z out
Figure: Input and output gating for the registers.
Execution of a Complete
Instruction Internal processor
bus
Add (R3), R1
Control signals
PC
Instruction
Step Action Address
decoder and
lines
MAR control logic
1 PCout , MAR in , Read,Select4,Add, Zin Memory
bus
2 Zout , PC in , Yin , WMF C MDR
Data
IR
3 MDR out , IR in lines
4 R3out , MAR in , Read Y
Constant 4 R0
5 R1out , Yin , WMF C
6 MDR out , SelectY,Add, Zin Select MUX
7 Zout , R1in , End Add
A B
ALU Sub R n - 1
control ALU
lines
Carry-in
XOR TEMP
Figure: Control sequenceforexecutionoftheinstructionAdd (R3),R1.
Z
Figure 7.1. Single-bus organization of the datapath inside a processor.
Execution of Branch Instructions
A branch instruction replaces the contents of
PC with the branch target address, which is
usually obtained by adding an offset X given
in the branch instruction.
The offset X is usually the difference between
the branch target address and the address
immediately following the branch instruction.
Conditional branch
Step Action
1 PCout , MAR in , Read,Select4,Add, Zin
2 Zout, PCin , Yin, WMF C
3 MDRout , IR in
4 Offset-field-of-IR
out, Add, Zin
5 Zout, PCin , End
Figure : Control sequence for an unconditional branch instruction.
Pipelining
Overview
Pipelining is widely used in modern
processors.
Pipelining improves system performance in
terms of throughput.
Pipelined organization requires sophisticated
compilation techniques.
Basic Concepts
Making the Execution of
Programs Faster
Use faster circuit technology to build the
processor and the main memory.
Arrange the hardware so that more than one
operation can be performed at the same time.
In the latter way, the number of operations
performed per second is increased even though
the elapsed time needed to perform any one
operation is not changed.
Traditional Pipeline Concept
Laundry Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold A B C D
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
Sequential laundry takes 6
A hours for 4 loads
If they learned pipelining, how
long would laundry take?
B
D
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
k A
Pipelined laundry takes 3.5
hours for 4 loads
O B
r
d C
e
r D
Traditional Pipeline Concept
Pipelining doesn’t help latency
6 PM 7 8 9 of single task, it helps
throughput of entire workload
Time Pipeline rate limited by slowest
T pipeline stage
a 30 40 40 40 40 20
Multiple tasks operating
s simultaneously using different
A
k resources
Potential speedup = Number
pipe stages
O B
Unbalanced lengths of pipe
r
stages reduces speedup
d C Time to “fill” pipeline and time
e to “drain” it reduces speedup
r Stall for Dependences
D
Use the Idea of Pipelining in a
Computer
Fetch + Execution
T ime
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction
I1 F1 E1
(a) Sequential execution
I2 F2 E2
Interstage buffer
B1
I3 F3 E3
Instruction Execution
fetch unit (c) Pipelined execution
unit
(b) Hardware organization
Basic idea of instruction pipelining.
Fetch + Decode+ Execution + Write
T ime
Clock cycle 1 2 3 4 5 6 7
Instruction
I1 F1 D1 E 1 W 1
F D E W
I2 2 2 2 2
F D E W
I 3 3 3 3
3
F D E W
I4 4 4 4 4
(a) Instruction execution divided into four steps
Interstage b uf fers
D : Decode
F : Fetch instruction E: Ex ecute W : Write
instruction and fetch operation
operands results
B1 B2 B3
(b) Hardware organization
A 4-stage pipeline.
Role of Cache Memory
Each pipeline stage is expected to complete in one clock
cycle.
The clock period should be long enough to let the slowest
pipeline stage to complete.
Faster stages can only wait for the slowest one to
complete.
Since main memory is very slow compared to the
execution, if each instruction needs to be fetched from
main memory, pipeline is almost useless.
Fortunately, we have cache.
Pipeline Performance
The potential increase in performance resulting
from pipelining is proportional to the number of
pipeline stages.
However, this increase would be achieved only if
all pipeline stages require the same time to
complete, and there is no interruption throughout
program execution.
Unfortunately, this is not true.
T ime
Clock c ycle 1 2 3 4 5 6 7 8 9
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 E5
Figure: Ef fect of an ex ecution operation taking more than one clock c ycle.