CSC405 Microprocessor
Pentium Processor
1
Pentium Processor – Features
• Wider data Bus : 64 bit data bus
• 32 bit address bus
• Separate data and instruction caches (L1, 8 KB each)
• Parallel Integer Execution Units
• Floating point Unit
• Branch prediction
[email protected] 2
Pentium Processor
Block Diagram
[email protected] 4
Code Cache
2 way set associative
cache.
256 lines code cache and
prefetch buffer, permitting
prefetching of 32 bytes
(256/8) of instructions.
Translation Lookaside
Buffer (TLB) - translate
linear addresses to the
physical addresses.
[email protected] 5
Prefetcher
Instructions are
requested from
code cache by the
prefetcher
If the requested line
is not in cache, a
burst bus cycle is run
to external memory
to perform a cache
line fill
[email protected] 6
Prefetch Buffers
Four prefetch buffers within the
processor works as two
independent pairs. (64 Byte)
When instructions are
prefetched from cache, they
are placed into one set of
prefetch buffers.
The other set is used as &
when a branch operation is
predicted.
Prefetch buffer sends a pair of
instructions to instruction
decoder
[email protected] 7
Branch Target Buffer
Pentium uses Branch Target Buffer(BTB)
for dynamic branch prediction.
BTB is a special cache which stores the
branch instruction that occur in the
instruction stream.
When branch operation is predicted,
BTB requests the predicted branch’s
target addresses from cache, which
are placed in second pair of buffers
that was previously idle.
[email protected] 8
Pentium@ Processor Pipe-line Execution
PF Prefetch
D1 Instruction Decode
D2 Address Generate
EX Execute - ALU and Cache Access
WB Writeback
[email protected] 9
Super Scalar
[email protected] 10
Instruction Decode Unit
Instruction Decode occurs in
two stages – Decode1 (D1)
and Decode2 (D2)
During D1 – opcode is
decoded in both pipelines
- determine whether
instructions can be paired
D2 calculates the address of
memory resident operands
[email protected] 11
Control Unit
This unit interprets the instruction
word and microcode entry point
fed to it by Instruction Decode
Unit
It handles exceptions,
breakpoints and interrupts.
It controls the integer pipelines
and floating point sequences
Microcode ROM :
Stores microcode sequences
[email protected] 12
Address Generators
Pentium provides two
address generators
(one for each
pipeline).
They generate the
address specified by
the instruction in their
respective pipeline.
[email protected] 13
Data Cache
A separate internal Data
Cache holds copies of the
most frequently used data
requested by the two integer
pipelines and Floating Point
Unit.
The internal data cache is an
8KB write-back cache,
organized as two-way set
associative with 32 byte-
lines.
[email protected] 14
Paging Unit
If paging is enabled ,the Paging Unit translates the linear
address from address generator to a physical address.
It can handle two linear addresses at the same time to
support both pipelines.
[email protected] 15
Arithmetic/Logic Units (ALUs)
Two ALUs perform the
arithmetic and logical
operations specified by
their instructions in their
respective pipeline
ALU for the “U” pipeline
can complete the
operation prior to ALU
in the “V” pipeline but
the opposite is not true.
[email protected] 16
Floating Point Unit
Perform floating
point operations.
It can accept up to
two floating point
operations per clock
when one instruction
is an exchange
instruction.
[email protected] 17
BUS UNIT
It provides a physical interface between the Pentium
Processor and the rest of the system.
The Pentium communicates with the outside world via a
32-bit address bus and a 64-bit data bus.
[email protected] 18
BUS UNIT
It consist of following units:
Address Drivers and Receivers :
During bus cycles the address drivers push the address onto the
processor’s local address bus.
The address bus transfers addresses back to the Pentium address
receivers during cache snoop cycles.
Write Buffers:
The Pentium processor provides two write buffers ,one for each of the two
internal execution pipelines.
This architecture improves performance when back-to-back writes occur.
Data Bus Transceivers:
It consists of bidirectional tristate buffers for data.
[email protected] 19
BUS UNIT
Bus Master Control :
In multiprocessor system, request from the Bus Arbiter is handle by this unit.
Bus Control Logic:
This unit generates control signals for the system bus.
Level Two(L2)Cache Control:
To check whether L2 cache is present or not.
Internal Cache Control:
Internal Cache control logic monitors input signal to determine when to
snoop the address bus.
It also ensures proper cache coherency.
Parity Generation and Control:
To assure error free transmission of data
[email protected] 20
Integer Pipeline Stages
In Pentium, there are two instruction pipelines U Pipeline and V
pipeline.
These are five stage pipelines and operate independently.
PF : Prefetch
D1 : Instruction Decode
D2 : Address generate
EX : Execute, Cache and ALU Access
WB : Write back
[email protected] 21
Integer Pipeline Stages
Prefetch(PF) :
Instructions are prefetched from the instruction cache or memory and fed into
the PF stage of both U and V Pipeline.
Decode1(D1):
In this stage, decoder in each pipeline checks if the current pair of instructions
can execute together.
If the instruction contains a prefix byte ,an additional clock cycle is required in
this stage. Also such an instruction may only execute in the U pipeline and may
not be paired with any other instruction.
It decodes the instruction to generate a control word
A single control word causes direct execution of an instruction
Complex instructions require microcode control sequencing
[email protected] 22
Integer Pipeline Stages
Address Generate Stage(D2):
Decodes the control word
Address of memory resident operands are calculated
Execute (EX):
ALU operation
Data cache is accessed at this stage
For both ALU and data cache access require more than one clock.
Write back(WB):
The CPU stores the result and updates the flags
[email protected] 23
Integer Instruction Pairing Rules
In order to issue two instructions simultaneously they must satisfy the
following conditions:
Both instructions in the pair must be "simple"
There must be no read-after-write or write-after-write register
dependencies between them
Register Contention (read-after-write)
mov ax, 4b
mov [bp], ax
[email protected] 24
Integer Instruction Pairing Rules
Register Contention (write-after-write)
Mov ax, 4b
Mov ax, [bp]
Neither instruction may contain both a displacement and an
immediate
Implicit Register contention
If two instructions imply reference to the same register
Exceptions
Flag References - compare and branch
Stack Pointer Reference - PUSHes or POPs
[email protected] 25
Integer Instruction Pairing Rules
What are Simple Instructions?
They are entirely hardwired
They do not require any microcode control
Executes in one clock cycle
Exception: ALU mem, reg and ALU reg, mem are 3
and 2 clock operations respectively(Arithmetic and
logic instructions that use both register and memory
operand)
[email protected] 26
Integer Instruction Pairing Rules
integer instructions are considered simple and may be paired
1. mov reg, reg/mem/imm
2. mov mem, reg/imm
3. alu reg, reg/mem/imm
4. alu mem, reg/imm
5. inc reg/mem
6. dec reg/mem
7. push reg/mem
8. pop reg
9. Lea reg,mem
10. jmp/call/jcc near
11. nop
[email protected] 27
Instruction Issue Algorithm
Decode the two consecutive instructions I1 and I2
If the following are all true
I1 and I2 are simple instructions
I1 is not a jump instruction
Destination of I1 is not a source of I2
Destination of I1 is not a destination of I2
Then issue I1 to u pipeline and I2 to v pipeline
Else issue I1 to u pipeline
[email protected] 28
Floating Point Pipeline
The floating point pipeline has 8 stages as follows:
1. Prefetch(PF) :
Instructions are prefetched from the on-chip instruction cache
2. Instruction Decode(D1):
Two parallel decoders attempt to decode and issue the next two
sequential instructions
It decodes the instruction to generate a control word
A single control word causes direct execution of an instruction
Complex instructions require microcode control sequencing
[email protected] 29
Floating Point Pipeline
3. Address Generate (D2):
Decodes the control word
Address of memory resident operands are calculated
4. Memory and Register Read (Execution Stage) (EX):
Register read, memory read or memory write performed as
required by the instruction to access an operand.
5. Floating Point Execution Stage 1(X1):
Information from register or memory is written into FP register.
Data is converted to floating point format before being loaded into
the floating point unit
[email protected] 30
Floating Point Pipeline
6. Floating Point Execution Stage 2(X2):
Floating point operation performed within floating point unit.
7. Write FP Result (WF):
Floating point results are rounded and the result is written to the
target floating point register.
8. Error Reporting(ER)
If an error is detected, an error reporting stage is entered where
the error is reported and FPU status word is updated
[email protected] 31
Instruction Issue for Floating Point Unit
The rules of how floating-point (FP) instructions get issued on the
Pentium processor are :
1. When a pair of FP instructions is issued to the FPU, only the FXCH
instruction can be the second instruction of the pair.
The first instruction of the pair must be one of a set F where F = [
FLD,FADD, FSUB, FMUL, FDIV, FCOM, FUCOM, FTST, FABS, FCHS].
2. FP instructions other than FXCH and instructions belonging to set F,
always get issued singly to the FPU.
3. FP instructions that are not directly followed by an FXCH instruction
are issued singly to the FPU.
[email protected] 32
Branch Prediction Logic
Other than the Superscalar ability of the Pentium processor, the
branch prediction mechanism is a much improvement.
Predicting the behaviors of branches can have a very strong impact
on the performance of a machine. Since a wrong prediction would
result in a flush of the pipes and wasted cycles.
The branch prediction mechanism is done through a branch target
buffer. The branch target buffer contains the information about all
branches
[email protected] 33
Branch Prediction Logic
BTB
4 way set associative cache with 256 entries
Directory entry of each line :
A valid bit – whether or not the entry is in use
History Bits – how often the branch has been taken before
Memory address of the branch inst. – for identification
Target address of the branch is stored in the corresponding entry
in BTB
[email protected] 34
Branch Target Buffer Branch History Bits
Movement when branch is not taken
Strongly Weakly Weakly Strongly
Taken Taken NOT Taken NOT Taken
(ST) (WT) (WNT) (SNT)
History
11 10 01 00
bits
Movement when branch is taken
[email protected] 35
Branch Prediction Logic
The prediction of whether a jump will occur or no, is based on the
branch’s previous behavior.
There are four possible states that depict a branch’s disposition to
jump:
Stage 0: Very unlikely a jump will occur
Stage 1: Unlikely a jump will occur
Stage 2: Likely a jump will occur
Stage 3: Very likely a jump will occur
[email protected] 36
Branch Prediction Logic
When a branch has its address in the branch target
buffer, its behavior is tracked.
This diagram portrays the four states associated with
branch prediction.
If a branch doesn’t jump two times in a row, it will go
down to State 0.
Once in State 0, the algorithm won’t predict another
jump unless the branch will jump for two consecutive
jumps (so it will go from State 0 to State 2)
Once in State 3, the algorithm won’t predict another
no jump unless the branch is not taken for two
consecutive times.
[email protected] 37
Code Cache
8KB in size
2 way set associative cache
Cache line - 32 bytes wide
Each cache way – 128 cache lines
128 entry directory associated with
each of the cache way
Triple ported
Split line access capability from
prefetcher & snooping
[email protected] 38
Code Cache
Code Cache Directory entry format
20 - bit tag field
One of one mega 4KB pages
within 4GB address space
State – Shared / Invalid
Parity – used to detect errors when
reading each entry
Code Cache line
4 Quad Word info
[email protected] 39
Code Cache
Line Storage Algorithm
Split Line Access
[email protected] 40
Data Cache
8KB, two-way set
associative
4GB of memory
address space is
viewed as 1M
pages each of
which is 4KB size
each page 128
lines containing 32
bytes data
Data cache is banked
on four bytes
(doubleword)
boundaries
[email protected] 41
Data Cache
[email protected] 42
Data Cache Consistency Protocol (MESI Protocol)
State Description
Modified The line in cache has been modified due to a write hit in
the cache. This alerts the cache subsystem to snoop the
system bus and write the modified line back to memory
when a snoop hit to the line is detected
Exclusive Indicates that this cache knows that no other cache in
the system possesses a copy of this line therefore it is
exclusive to this line
Shared Indicates that the line may be present in several caches
and if so that an exact duplicate of the information exists
in each source
Invalid The initial state after reset indicating that the line is not
present in the cache
[email protected] 43
Data Cache Consistency Protocol (MESI Protocol)
[email protected] 44
[email protected] 45