Multiple Instruction Issue
and
Hardware Based Speculation
Soner Önder
Michigan Technological University, Houghton MI
www.cs.mtu.edu/~soner
Hardware Based Speculation 2
•Exploiting more ILP requires that we overcome the
limitation of control dependence:
With branch prediction we allowed the processor continue
issuing instructions past a branch based on a prediction:
Those fetched instructions do not modify the processor state.
These instructions are squashed if prediction is incorrect.
We now allow the processor to execute these instructions
before we know if it is ok to execute them:
We need to correctly restore the processor state if such an
instruction should not have been executed.
We need to pass the results from these instructions to future
instructions as if the program is just following that path.
Hardware Based Speculation 3
B1 x < y? •Assume the processor
predicts B1 to be taken and
N T executes.
A =b+c C=0
C=c-1 A=0 •What will happen if the
prediction was wrong?
X<z B •What value of each variable
N T
2 should be used if the
B=b+1 C=a processor predicts B1 and B2
A=a+1 taken and executes
instructions along the way?
D=a+b+c
….
Use d
Hardware Based Speculation 4
•In order to execute instructions speculatively, we
need to provide means:
To roll back the values of both registers and the memory to their
correct values upon a misprediction,
To communicate speculatively calculated values to the new uses
of those values.
•Both can be provided by using a simple structure
called Reorder Buffer (ROB).
Reorder Buffer 5
•It is a simple circular array with a head and a tail
pointer:
New instructions is allocated a position at the tail in program
order.
Each entry provides a location for storing the instruction’s result.
New instructions look for the values starting from tail – back.
When the instruction at the head complete and becomes non-
speculative the values are committed and the instruction is
removed from the buffer.
Tail Head
Reorder Buffer 6
3 fields: instr, destination, value
Reorder buffer can be operand source => more registers like
RS
Use reorder buffer number instead of reservation station
when execution completes
Supplies operands between execution complete & commit
Once operand commits, result is put into register
Instructions commit
As a result, its easy to undo speculated instructions
on mispredicted branches
or on exceptions
Steps of Speculative Tomasulo Algorithm
7
1. Issue [get instruction from FP Op Queue]
• Check if the reorder buffer is full.
• Check if a reservation station is available.
• Access the register file and the reorder buffer for the current
values of the source operands.
• Send the instruction, its reorder buffer slot number and the
source operands to the reservation station.
Once issued, the instruction stays in the reservation
station until it gets both operands.
Steps of Speculative Tomasulo Algorithm
8
2. Execute [operate on operands (EX)]
When both operands ready and a functional
unit is available, the instruction executes.
This step checks RAW hazards and as long as
operands are not ready, watches CDB for results.
Steps of Speculative Tomasulo Algorithm
9
3. Write result [finish execution (WB)]
Write on Common Data Bus to all awaiting FUs and
the reorder buffer; mark reservation station available.
Steps of Speculative Tomasulo Algorithm
10
4. Commit [update register file with reorder result]
When instruction reaches the head of reorder buffer
The result is present
No exceptions associated with the instruction:
The instruction becomes non-speculative:
Update register file with result (or store to memory)
Remove the instruction from the reorder buffer.
A mispredicted branch flushes the reorder buffer.
MIPS FP Unit 11
Renaming Registers 12
Common variation of speculative design
Reorder buffer keeps instruction information
but not the result
Extend register file with extra
renaming registers to hold speculative results
Rename register allocated at issue;
result into rename register on execution complete;
rename register into real register on commit
Operands read either from register file
(real or speculative) or via Common Data Bus
Advantage: operands are always from single source
(extended register file)
Renaming Registers 13
1. Index a MAP table using the
0
source register identifiers to 1
get the physical register 2
125 Map table
number. .
.
2. Get the previous physical 29
register number for the 30
destination register. 31
3. Allocate a free physical
register and modify the MAP
table by indexing it with the 0
1
destination register 2
identifier. .
.
4. When instruction commits,
125
return the previous physical 126 Physical registers
register to the pool. 127
Renaming Registers 14
0 0
1 1 R7=r4+r3
2 2
3 R6=r2+r6
4
3 R3=r6+r7
5 4 R6=r6+10
6 5
7 6
8 7
Map table Code sequence
9
10
22
13
17
Renaming Registers 15
0 0
1 1
2 2 R7=r4+r3
3 3 R6=r2+r6
4 4 R3=r6+r7
5 5 R6=r6+10
6 6
7 7
Map table Code sequence Renamed Code sequence
9
10
22
13
17
Renaming Registers 16
Previous Dest
0 0 R9=r4+r3 R7
1 1
2 2 R7=r4+r3
3 3 R6=r2+r6
4 4 R3=r6+r7
5 5 R6=r6+10
6 6
7 9
Map table Code sequence Renamed Code sequence
10
22
13
17
Renaming Registers 17
Previous Dest
0 0 R9=r4+r3 R7
1 1 R10=r2+r6 r6
2 2 R7=r4+r3
3 3 R6=r2+r6
4 4 R3=r6+r7
5 5 R6=r6+10
6 10
7 9
Map table Code sequence Renamed Code sequence
22
13
17
Renaming Registers 18
Previous Dest
0 0 R9=r4+r3 R7
1 1 R10=r2+r6 R6
2 2 R7=r4+r3 R22=r10+r9 R3
3 22 R6=r2+r6
4 4 R3=r6+r7
5 5 R6=r6+10
6 10
7 9
Map table Code sequence Renamed Code sequence
13
17
Renaming Registers 19
Previous Dest
0 0 R9=r4+r3 R7
1 1 R10=r2+r6 R6
2 2 R7=r4+r3 R22=r10+r9 R3
3 22 R6=r2+r6 R13=r10+10 R10
4 4 R3=r6+r7
5 5 R6=r6+10
6 13
7 9
Map table Code sequence Renamed Code sequence
17
Renaming Registers 20
Previous Dest
0 0 R9=r4+r3 R7
1 1 R10=r2+r6 R6
2 2 R7=r4+r3 R22=r10+r9 R3
3 22 R6=r2+r6 R13=r10+10 R10
4 4 R3=r6+r7
5 5 R6=r6+10
6 13
7 9
Map table Code sequence Renamed Code sequence
17
10 When r13=r10+10
retires
Limits to ILP 21
Assumptions for ideal/perfect machine to start:
1. Register renaming–infinite virtual registers and all
WAW & WAR hazards are avoided
2. Branch prediction–perfect; no mispredictions
3. Jump prediction–all jumps perfectly predicted =>
machine with perfect speculation & an unbounded buffer of
instructions available
4. Memory-address alias analysis–addresses are known &
a load can be moved before a store provided addresses not
equal
1 cycle latency for all instructions; unlimited number of
instructions issued per clock cycle
Upper Limit to ILP: Ideal Machine 22
160 150.1
FP: 75 - 150
140
Inst ruct ion Issues per cycle
120 Integer: 18 - 60 118.7
100
75.2
IPC
80
62.6
54.8
60
40
17.9
20
0
gcc espresso li f pppp doducd t omcat v
Programs
More Realistic HW: Branch Impact
23
Change from Infinite window 61 FP: 15 -6045
to examine to 2000 and
60 58
maximum issue of 64
50 instructions per clock cycle 48
46 45 46 45 45
Inst ruct ion issues per cycle
41
40
35
Integer: 6 - 12
29
30
IPC
19
20 16
15
13 14
12
10
9
10 6 7 6 6 7
6
4
2 2 2
gcc espresso li fpppp doducd tomcatv
Program
Perfect Selective predictor Standard 2-bit Static None
More Realistic HW: Register Impact
24
59
FP: 11 - 45
60
Change 2000 instr 54
window, 64 instr issue, 8K 49
2 level Prediction
50
45
44
Inst ruct ion issues per cycle
40
IPC
35
30 Integer: 5 - 15 29 28
20
20 16
15 15 15
13
12 12 12 11 11
11 10 10 10
9
10 7
5 6 5 5 5 5
4 5 4 5
4
gcc espresso li fpppp doducd tomcatv
Program
Infinite 256 128 64 32 None
More Realistic HW: Alias Impact
25
49 49
50
Change 2000 instr window,
45 45
45
64 instr issue, 8K 2 level FP: 4 - 45
Prediction, 256 renaming
40
Inst ruct ion issues per cycle
35 registers (Fortran,
30 no heap)
25
20 Integer: 4 - 9
IPC
16 16
15
15
12
10
10 9
7 7
5 5 6
4 4 4 5
3 3 3 4 4
5
gcc espresso li fpppp doducd tomcatv
Program
Perfect Global/stack Perfect Inspection None
Realistic HW for ‘9X: Window Impact
26
60
56
Perfect disambiguation (HW), 1K
Selective Prediction, 16 entry 47
52
50 FP: 8 - 45
return, 64 registers, issue as 45
Inst ruct ion issues per cycle
many as window
40
35
Integer: 6 - 12 34
30
IPC
22 22
20 17 16
15 15 15 14
13 14
12 12 11 11 12
10 10 10 10
9 8 9 8 9 9
10 8
6 6 6 7
5 6
4 4 4 4
3 2 3 3 3 3
gcc expresso li fpppp doducd tomcatv
Program
Infinite 256 128 64 32 16 8 4