Processor Architecture and Design
Introduction
How Does a
Microprocessor
handle an
Instruction?
HOW DOES A MICROPROCESSOR HANDLE AN
INSTRUCTION?
Fetch Cycle
The fetch cycle takes the instruction required from memory, stores it
in the instruction register
Execute Cycle
The actual actions which occur during the execute cycle of an
instruction
3
Address bus
BIU RD Discs
I/o
WR ROM RAM
Ports Video
Data Bus
ALU
CLK
Control
& Timing
EU
Memory Interface
Program Address Generator
Instruction Register
BIU
Registers
Control & Timing
A
B ALU
C
D
E EU
F
G
H Registers
Block Diagram of a Microprocessor
RISC VS CISC
Which is better ?
WHAT IS THE EFFECT ?
If Instructions can be present anywhere
Size of Instruction Varies
Complicates Instruction Decoder
ISA
CISC
Operands for Arithmetic/Logic operation
can be in Register/ Memory
RISC
Operands for Arithmetic/Logic operation
only in Registers
Register – Register Architecture
RISC Vs CISC
Goal: Multiply data in mem A with B- put it
back in A
A
CISC:
B Mem
MUL A,B
RISC: C
LDA R0,A
LDA R1,B R0 R1
MUL R0,R1 R2 R3
STR A,R0
x,÷,+,-
Time = Time x Cycles x Instructions
Program cycle Inst Program
RISC CISC
Processor Speed-up
Introduction
PAD© K.R.Anupama
Deeply pipelined machines
Speed up Many instructions/cycle
Out-of-order execution of
instructions
Aggressive branch prediction
techniques
PAD© K.R.Anupama
The Three Walls
THE
THE POWER THE ILP
MEMORY
WALL WALL
WALL
PAD© K.R.Anupama
Power Wall
• Power dissipation depends on
• Clock rate, capacitive load, voltage
• Increases in clock frequency – more power dissipated,
more cooling
• Decreases in voltage – reduce dynamic power
consumption – but increase static power leakage –
transistors
• Reached practical power limit in cooling
The Power Wall
PAD© K.R.Anupama
PAD© K.R.Anupama
The Memory Wall
PAD© K.R.Anupama
ILP
• Pipelined
• VLIW
Course
• Superscalar
DLP
• SIMD
• Vector Architectures
• GPU
PAD© K.R.Anupama
• TLP Course
• MIMD
• Multi-threaded
• Distributed memory MIMD
• Shared memory MIMD
PAD© K.R.Anupama
Arch, Implementation & Realization
• ISA
Architecture • Functional level behavior of processor
• Micro-architecture
Implementation • Logic structure that implements the arch
Realization • Physical Implementation
PAD© K.R.Anupama
Contract between
ISA h/w & s/w
PAD© K.R.Anupama
• Contract between software and hardware
• Multiple machines can implement ISA
• Advantage – program portability
• Microprocessor design – starts with ISA
ISA
• ISA produces – micro architecture
• Micro architecture has to be rigorously
verified
PAD© K.R.Anupama
• Development is very slow
• ISAs varied
ISA • No. of operands
• Implied operands
• Operands may be stored in stack
PAD© K.R.Anupama
Dynamic – Static interface
Separates
Compile Time At run time
what is done
• Statically • Dynamically
PAD© K.R.Anupama
DSI
Program (Software)
Compiler Exposed to
Complexity software Static
Architecture
Hardware Exposed to
Complexity hardware Dynamic
Machine (Hardware)
PAD© K.R.Anupama
DSI
DEL CISC VLIW RISC
HLL
DSI1
DSI2
DSI3
Hardware
PAD© K.R.Anupama
• Traditionally, software has been written for
serial computation:
• To be run on a single computer having a single
What is parallel Central Processing Unit (CPU)
computing ? • A problem is broken into a discrete series of
Serial instructions
Computing • Instructions are executed one after another
• Only one instruction may execute at any
moment in time
PAD© K.R.Anupama
Serial Computing
Problem
CPU
TN T3 T2 T1
PAD© K.R.Anupama
What is parallel computing
• In the simplest sense - parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem:
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved concurrently
• Each part is further broken down to a series of insts
• Insts from each part execute simultaneously on different CPUs
PAD© K.R.Anupama
What is parallel computing
Problem
1 CPU 1
Problem
2
CPU 2
Problem
3
CPU 3
Problem
4
CPU 4
PAD© K.R.Anupama
• A Single processor with multiple cores
• A single computer with multiple processors
Parallel
• An arbitrary number of computers connected
Computing by a network
• A combination of all three
Parallel Computing
• The computational problem should be able to:
• Be broken apart into discrete pieces of work that
can be solved simultaneously
• Execute multiple program instructions at any
moment in time
• Be solved in less time with multiple compute
resources than with a single compute resource.
PAD© K.R.Anupama
The most import
law in micro-
architecture
PAD© K.R.Anupama
Amdahl’s Law
Ttotal = 1
Timproved [ Ttotal - Tcomponent ]+ Tcomponent
n
PAD© K.R.Anupama
Law of Diminishing Returns
1-f enh
PAD© K.R.Anupama
Types and Levels of parallelism
Functional parallelism irregular
Data level parallelism regular
PAD© K.R.Anupama
Functional Parallelism
Instruction level
Loop Level
• recurrences
Procedure level
Program level
PAD© K.R.Anupama
Flynn’s Taxonomy
SISD SIMD MISD MIMD
PAD© K.R.Anupama
Basic Parallel Techniques
Pipelining
Replication
ILP
TYPES OF ILP-PROCESSORS
Traditional Von- Scalar ILP Superscalar ILP
Neumann
• Sequential • Sequential Issue • Parallel Issue –
Issue – – Parallel Parallel
Sequential Execution Execution
Execution • VLIW – static
schedule
• Superscalar -
dynamic
PAD© K.R.ANUPAMA
INTERNAL OPERATION
Pipelined Processors VLIW /superscalar
PAD© K.R.ANUPAMA
VLIW & SUPERSCALAR ARCHITECTURE
EU1 EU2 EU3
Register File
PAD© K.R.ANUPAMA
VLIW
Instruction Fetch
EU1 EU2 EU3
Register File
PAD© K.R.ANUPAMA
SUPERSCALAR
Instruction
Dispatch unit
Fetch
EU1 EU2 EU3
Register File
PAD© K.R.ANUPAMA
PIPELINE Scalar
PAD© K.R.ANUPAMA
AMDAHL’S LAW
Ttotal = 1
Timproved [ Ttotal - Tcomponent]+ Tcomponent
n
PAD© K.R.ANUPAMA
PIPELINE – N STAGES
Phase 1 Filling
Phase 2 Full Phase
Phase 3 Draining
PAD© K.R.ANUPAMA
IDEALIZED PIPELINE EXECUTION
N
1-g g
PAD© K.R.ANUPAMA
IDEALIZED PIPELINE EXECUTION
N
1-g g
PAD© K.R.ANUPAMA
IDEALIZED PIPELINE EXECUTION
N
1-g g
PAD© K.R.ANUPAMA
REALISTIC PIPELINE EXECUTION PROFILE
N
PAD© K.R.ANUPAMA
REALISTIC PIPELINE EXECUTION PROFILE
N
PAD© K.R.ANUPAMA
AMDAHL’S LAW
S = 1
[ 1 - g]+ g
N
PAD© K.R.ANUPAMA
100% - S =5
N=5 90% - S = 3.57
AMDAHL’S LAW N = 10
100% - S = 10
90% - S = 5.26
100% - S = 20
N = 20 90% - S = 6.897
PAD© K.R.ANUPAMA
AMDAHL’S LAW
S = 1
g1+ g2 + … gn
1 2 N
PAD© K.R.ANUPAMA
AMDAHL’S LAW - SUPERSCALAR
S = 1
[ 1 – f ]+ f
N
PAD© K.R.ANUPAMA
AMDAHL’S LAW - SUPERSCALAR
S = 1
[ 1 – f ]+ f
N1 N2
PAD© K.R.ANUPAMA
3F
• Flynn • 2
• Foster • 51
• Fishers • 90
PAD© K.R.ANUPAMA
PARAMETERS – JOUPPI
CLASSIFICATION
Operation Latency
Issue Latency
Machine Parallelism
Issue Parallelism
All static parameters
Super Pipeline
Minor Cycles
PAD© K.R.ANUPAMA
Fetch Decode Execute Writeback
F unit D unit E unit
Register File
Cache Memory
PAD© K.R.ANUPAMA
BASE PIPELINE
IF DE EX WB
OL 1
MP 4
IL 1
IP 1
PAD© K.R.ANUPAMA
SUPER-PIPELINED
IF DE EX WB
OL 1(3)
MP 12
IL 1 per1minor
cycle
IP 3
PAD© K.R.ANUPAMA
PIPELINE
Under- Pipelined Super-Pipelined
Execution > Issue Issue > Executions
Deeply Pipelined machine
Restrictions on forwarding Paths
PAD© K.R.ANUPAMA
MIPS R4000
8 Physical Stages
Each Stages -10ns
ClK – 50MHz
Clock doubler present internally
20ns – base line
PAD© K.R.ANUPAMA
MIPSR4000
PIPELINE IF1 IF2 RF EX DF1 DF2 TC WB
PAD© K.R.ANUPAMA
SUPERSCALAR
IF DE EX WB
IF DE EX WB
IF DE EX WB
OL 1
MP 12
IL 1
IP 3
PAD© K.R.ANUPAMA
SUPERSCALAR- SUPER PIPELINED
IF DE EX WB
IF DE EX WB
IF DE EX WB
OL 3
1
MP 36
IL 1/minor
1
cycle
IP 9
PAD© K.R.ANUPAMA
IF DE EX WB
EX WB
EX WB
VLIW
PAD© K.R.ANUPAMA
DYNAMIC BEHAVIOR Effect of Dependencies
PAD© K.R.ANUPAMA
Data
DEPENDENCIES Control
BETWEEN Resource
INSTRUCTIONS
PAD© K.R.ANUPAMA
Straight line
DATA
DEPENDENCY
Loops
PAD© K.R.ANUPAMA
STRAIGHT LINE CODE
RAW/ true
Load- use
I1: load r1, a;
I2: add r2,r1,r1;
Define –used
I1: mul r1 ,r4, r5
I2:add r2,r1,r1;
PAD© K.R.ANUPAMA
STRAIGHT LINE CODE
WAR /false/anti
I1: mul r1,r2,r3
I2: add r2,r4,r5
PAD© K.R.ANUPAMA
STRAIGHT LINE CODE
WAW/output
I1: mul r1,r2,r3
I2: add r1,r4,r5
PAD© K.R.ANUPAMA
Inter-iteration/ loop-carried
do I = 2, n
X(I) = A * X(I-1) + B
RECURRENCES End do
First order
kth order
PAD© K.R.ANUPAMA
DATA DEPENDENCY GRAPH
i1 i2
load r1,a
load r2,b δt δt
i3
add r3,r2,r1
mul r1,r2,r4 δa
i4
div r1,r2,r4
δo
i5
PAD© K.R.ANUPAMA
No control statements
DIFFERENCE
BETWEEN DFG & DFG shows only RAW
DDG
Compilers create both
DDG & DFG
PAD© K.R.ANUPAMA
BASIC BLOCK
calc: add r3,r1,r2
sub r4,r1,r2
mul r5,r3,r4
mul r7,r6,r6
sub r8,r7,r5
jn negproc;
PAD© K.R.ANUPAMA
CONTROL DEPENDENCIES
mul r1,r2,r3
jz zproc
sub r4,r7,r1
.
.
zproc: load r1,x
PAD© K.R.ANUPAMA
CONTROL DEPENDENCIES
General purpose program 20-30%
Scientific/technical program 5-10 %
Avg branch distance
4.6
3-6th inst
9.2
10-20th inst
PAD© K.R.ANUPAMA
VLIW
PAD© K.R.ANUPAMA
RESOURCE Single non-pipelined division unit
div r1,r2,r3
DEPENDENCY div r4,r5,r6
PAD© K.R.ANUPAMA