Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
2
Structural Hazards - Reminded
Structural hazard: inadequate hardware to simultaneously support
all instructions in the pipeline in the same clock cycle
E.g., suppose single – not separate – instruction and data memory
in pipeline below with one read port
then a structural hazard between first and fourth lw instructions
Program
execution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access
Pipelined
Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Hazard if single memory
Instruction Data
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
lw $4, 400($0) Reg ALU Reg
2 ns fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
• Loop
Sequential
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, Label #assume not equal
add $t5, $t2, $t3
sw $t5, 8($t3)
Label: j 32
11
Taking Advantage of Locality
Memory hierarchy
Store everything on disk
Copy recently accessed (and nearby) items
from disk to smaller DRAM memory
Main memory
Copy more recently accessed (and nearby)
items from DRAM to smaller SRAM memory
Cache memory attached to CPU
12
Cache position
13
Intel
14
Memory Hierarchy Levels
Block (aka line): unit of copying
May be multiple words
If accessed data is present in
upper level
Hit: access satisfied by upper level
Hit ratio: hits/accesses
If accessed data is absent
Miss: block copied from lower level
Time taken: miss penalty
Miss ratio: misses/accesses
= 1 – hit ratio
Then accessed data supplied from
upper level
Direct Mapped Cache
Location determined by address
Direct mapped: only one choice
(Block address) modulo (#Blocks in cache)
#Blocks is a
power of 2
Use low-order
address bits
16
§5.2 The Basics of Caches
Cache Memory
Cache memory
The level of the memory hierarchy closest to the
CPU
Given accesses X1, …, Xn–1, Xn
How do we know if
the data is present?
Where do we look?
17
Tags and Valid Bits
18
Cache Example
8-blocks, 1 word/block, direct mapped
Initial state
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 …
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 …
Block offset Block offset Block offset Block offset …
0 1 2 3 …
Program is divided into equal chunks of the same size as the cache block
BlockIndex = x / block_size
BlockOffset = x mod block_size
For example: x=14
• BlockIndex = 14 / 4 = 3
• BlockOffset = 14 mod 4 = 2
Map: use BlockIndex for calculating the Block in the cache
Address Subdivision
26
Example: Larger Block Size
64 blocks, 16 bytes/block, 32 bit address regs
To what block number does address 1200 map?
Offset=?
Index=?
Tag=?
31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits
27
Cache usage protocol
28
Example
Suppose a cache 16KB, slot size=32 bytes,
address reg=32 bit. Identify the bits for tag,
index and offset (ignore tag and valid bit).
Given the above cache with direct map
Cache with 64 slots, slot size=32 bytes, address
reg=16 bit
Show hit/miss for the reference sequence of MIPS
instructions: 184, 188, 192, 196, 200, 204, 208,
212, 216, 192, 196, 200, 204.
29
Associative Caches
Fully associative
Allow a given block to go in any cache entry
Requires all entries to be searched at once
Comparator per entry (expensive)
n-way set associative
Each set contains n entries
Block number determines which set
(Block number) modulo (#Sets in cache)
Search all entries in a given set at once
n comparators (less expensive)
30
Associative Cache Example
31
Spectrum of Associativity
For a cache with 8 entries
32
Associativity Example
Compare 4-block caches
Direct mapped, 2-way set associative,
fully associative
Block access sequence: 0, 8, 0, 6, 8
Direct mapped
Block Cache Hit/miss Cache content after access
address index 0 1 2 3
0 0 miss Mem[0]
8 0 miss Mem[8]
0 0 miss Mem[0]
6 2 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
33
Associativity Example
2-way set associative
Block Cache Hit/miss Cache content after access
address index Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
Fully associative
Block Hit/miss Cache content after access
address
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]
34
How Much Associativity
35
Set Associative Cache
Organization
36
Exercise
37
Replacement Policy
40
Write-Through
On data-write hit, could just update the block in
cache
But then cache and memory would be inconsistent
Write through: also update memory
But makes writes take longer
e.g., if base CPI = 1, 10% of instructions are stores,
write to memory takes 100 cycles
Effective CPI = 1 + 0.1×100 = 11
Solution: write buffer
Holds data waiting to be written to memory
CPU continues immediately
Only stalls on write if write buffer is already full
41
Write-Back
42
Write Allocation
43
Main Memory Supporting Caches
Use DRAMs for main memory
Fixed width (e.g., 1 word)
Connected by fixed-width clocked bus
Bus clock is typically slower than CPU clock
Example cache block read
1 bus cycle for address transfer
15 bus cycles per DRAM access
1 bus cycle per data transfer
For 4-word block, 1-word-wide DRAM
Miss penalty = 1 + 4×15 + 4×1 = 65 bus cycles
Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle
44
Increasing Memory Bandwidth
48
Multilevel Cache Example
Given
CPU base CPI = 1, clock rate = 4GHz
Miss rate/instruction = 2%
Main memory access time = 100ns
With just primary cache
CPU clock cycle=1/(4*10^9) = ¼ ns=0.25ns
Miss penalty = 100ns/0.25ns = 400 cycles
Effective CPI = 1 + 0.02 × 400 = 9
49
Example (cont.)
Now add L-2 cache
Access time = 5ns
Global miss rate to main memory = 0.5%
Primary miss with L-2 hit
Penalty = 5ns/0.25ns = 20 cycles
Primary miss with L-2 miss
Extra penalty = 500 cycles
CPI = 1 + 0.02 × 20 + 0.005 × 500 = 3.9
Performance ratio = 9/3.9 = 2.3
50
Performance Summary