EE282 Lecture 4
Advanced Caching (2)
Jacob Leverich
http://eeclass.stanford.edu/ee282
EE282 – Spring 2011 – Lecture 04
Announcements
HW1 out
Due Wed 4/20 @ 5pm, box outside Gates 305
2
Today’s lecture: More Caches
Advanced cache optimizations
H&P Chapter 5
Cache coherence
H&P Chapter 4
Software-managed memories
Beyond processor caches
3
Advanced Cache Optimizations
Multi-level caches and inclusion
Victim caches
Pseudo-associative caches
Skew-associative caches
Critical word first
Non-blocking caches
Prefetching
Multi-ported caches
Readings: H&P 5.1-2 and 4.2
Read on your own about way prediction, pipelined caches, merging
write buffers, compiler optimizations
4
Non-blocking or Lockup Free
Caches
Basic idea
Allow for hits while serving a miss (hit-under-miss)
Allow for more than one outstanding miss (miss-under-miss)
When does it make sense (for L1, L2, …)
When the processor can handle >1 pending load/store
This is the case with superscalar processors
When the cache serves >1 processor or other cache
When the lower level allows for multiple pending accesses
More on this later
What is difficult about non-blocking caches
Handling multiple misses at the same time
Handling loads to pending misses
Handling stores to pending misses
5
Potential of Non-blocking
Caches
CPU Stall CPU on miss
Miss Penalty
Miss
Miss Hit
CPU CPU Hit under miss
Miss Penalty
Stall only when
result needed
Miss Hit Miss
CPU
Miss Penalty Multiple out-standing misses
Miss Penalty
Miss Penalty
6
Miss Status Handling Register
Keeps track of
Outstanding cache misses
Pending load & stores that refer to that cache block
Fields of an MSHR
Valid bit
Cache block address
Must support associative search
Issued bit (1 if already request issued to memory)
For each pending load or store
Valid bit
Type (load/store) and format (byte/halfword/…)
Block offset
Destination register for load OR store buffer entry for stores
7
MSHR
1 27 1 1 3 5 5
Valid Block Address Issued Valid Type Block Offset Destination Load/store 0
Valid Type Block Offset Destination Load/store 1
Valid Type Block Offset Destination Load/store 2
Valid Type Block Offset Destination Load/store 3
8
Non-block Caches: Operation
On a cache miss:
Search MSHRs for pending access to same cache block
If yes, just allocate new load/store entry
(if no) Allocate free MSHR
Update block address and first load/store entry
If no MSHR or load/store entry free, stall
When one word/sub-block for cache line become available
Check which load/stores are waiting for it
Forward data to LSU
Mark loads/store as invalid
Write word in the cache
When last word for cache line is available
Mark MSHR as invalid
9
Non-blocking Cache Efficacy
Cache Miss rate Miss Hit Band-
optimization Cold Capacity Conflict penalty time width
Non-blocking
cache
10
Prefetching
Idea: fetch data into the cache before processors
request them
Can address cold misses
Can be done by the programmer, compiler, or hardware
Characteristics of ideal prefetching
You only prefetch data that are truly needed
Avoid bandwidth waste
You issue prefetch requests early enough
To hide the memory latency
You don’t issue prefetch requests too early
To avoid cache pollution
11
Software Prefetching
for (i=0; i<N; i++) { Issues software prefetching
__prefetch(a[i+8]); Takes up issue slots
__prefetch(b[i+8]); Not big issue with superscalar
sum += a[i]*b[i]; Takes up system bandwidth
}
Must have non-blocking caches
Doesn’t have to be correct!
Prefetch distance depends on
__prefetch(-1);
specific system implementation
Non-portable code
Not easy to use for pointer based
structures
Requires ninja
programmer/compiler!
12
Hardware Prefetching
Same goal with software prefetching but initiated by hardware
Can tune to specific system implementation
Does not waste instruction issue bandwidth
More portable code
Major design questions
Where to place a prefetch engine?
L1, L2, …
What to prefetch?
Next sequential cache line(s), strided patterns, pointers, …
When to prefetch?
On a load, on a miss, when other prefetched data used, …
Where to place prefetched data
In the cache or in a special prefetch buffer
How to handle VM exceptions?
Don’t prefetch beyond a page?
13
Simple Sequential Prefetching
On a cache miss, fetch two sequential memory
blocks
Exploits spatial locality in both instructions & data
Exploits high bandwidth for sequential accesses
Called “Adjacent Cache Line Prefetch” or “Spatial
Prefetch” by Intel
Extend to fetching N sequential memory blocks
Pick N large enough to hide the memory latency
14
Stream Prefetching
Sequential prefetching problem
Performance slows down once every N cache lines
Stream prefetching is a continuous version of prefetching
Stream buffer can fit N cache lines
On a miss, start fetching N sequential cache lines
On a stream buffer hit:
Move cache line to cache, start fetching line (N+1)
In other words, stream buffer tries to stay N cache lines ahead
Design issues
When is a stream buffer allocated
When is a stream buffer released
Can use multiple stream buffers to capture multiple streams
E.g. a program operating on 2 arrays
15
Stream Buffer Design
16
Strided Prefetching
PC Stride Last Addr Conf
Idea: detect and prefetch strided accesses
PC
for (i=0; i<N; i++) A[i*1024]++; 0x08ab0 8 0xff024 10
Stride detected using a PC-based table 0x03fa8 1024 0xf0ab2 11
For each PC, remember the stride
Stride detection
Remember the last address used for this PC
Compare to currently used address for this PC
Track confidence using a two bit saturating counter
Increment when stride correct, decrement when incorrect
How to use the PC-based table
Similar to stream prefetching except using stride instead of +1
18
Sandybridge Prefetching
(Intel Core i7-2600K)
“Intel 64 and IA-32 Architectures Optimization
Reference Manual, Jan 2011”, pg 2-24
http://www.intel.com/Assets/PDF/manual/248966.pdf
19
Other Ideas in Prefetching
Prefetch for pointer-based data structures
Predict if fetched data contain a pointer & follow it
Works for linked-lists, graphs, etc
Must be very careful:
What is a pointer?
How far to prefetch?
Different correlation techniques
Markov prefetchers
Delta correlation prefetchers
20
Prefetching Efficacy
Cache Miss rate Miss Hit Band-
optimization Cold Capacity Conflict penalty time width
Prefetching
21
Multi-ported Caches
Idea: allow for multiple accesses in parallel
Processor with many LSUs, I+D access in L2, …
Can be implemented in multiple ways
True multi-porting
Multiple banks
What is difficult about multi-porting
Interaction between parallel accesses (especially for
stores)
22
True Multi-porting
True multiporting
Use 2-ported tag/data storage
Problem: large area increase
Problem: hit time increase
Request 1 Data 1
Request 2 Cache Data 2
23
Multi-banked Caches
Cache
Request 1 Read Data 1
Bank 1
Cache
Request 2 Read Data 2
Bank 2
Partition address space into multiple banks
Bank0 caches addresses from partition 0, bank1 from partition 1…
Can use least or most significant address bits for partitioning
What are the advantages of each approach?
Benefits: accesses can go in parallel if no conflicts
Challenges: conflicts, distribution network, bank
utilization
24
Sun UltraSPARC T2
8-bank L2 cache
25
Multi-porting Efficacy
Cache Miss rate Miss Hit Band-
optimization Cold Capacity Conflict penalty time width
Multi-porting
26
Summary of
Advanced Cache Optimizations
Cache Miss rate Miss Hit Band-
optimization Cold Capacity Conflict penalty time width
Multi-level +
Victim cache ~ +
Pseudo-assoc. +/~
Skew-assoc. + ~
Non-blocking + ~
Critical-word-
first +
Prefetching + - +
Multi-porting ~ +
Also see Figure 5.11 in H&P 27
Today’s lecture: More Caches
Advanced cache optimizations
H&P Chapter 5
Cache coherence
H&P Chapter 4
Software-managed memories
Beyond processor caches
28
Cache Coherence Problem
P1 P2 P3
u=? 3
u=?
4 5 $
$ $
u :5 u :5 u= 7
I/O devices
1
2
u:5
Memory
Cores may see different values for u
With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when
Threads or processes accessing main memory may see very stale value
Unacceptable for programming, and its frequent!
29
Hardware Cache Coherence
Using Snooping
Hardware guarantees that loads from all
cores will return the value of the latest
write
Coherence mechanisms
Metadata to track state for cached data
Controller that snoops bus (or interconnect)
activity and reacts if needed to adjust the state
of the cache data
There needs to be a serialization point
Shared L3, memory controller, or memory bus
30
MSI: Simple Coherence Protocol for Write
Back Caches
Each cache line has a tag M: Modified
S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other process reads, M or writes
P1 writes back
Other processor
intent to write
Read
miss
S I
Read by any Other processor
processor intent to write Cache state in
processor P1
31
Quick Questions
How many copies of a cache line can you have
in S state?
How many copies can you have in M state?
How does L2 inclusion help?
33
Today’s lecture: More Caches
Advanced cache optimizations
H&P Chapter 5
Cache coherence
H&P Chapter 4
Software-managed memories
Beyond processor caches
34
Software-managed Memory
Caches are complex, hard to design, hard to
optimize, hard to analyze, hard to use well,
hard to keep coherent…
Private on-chip memory with its own address
space
Not implicitly backed by main memory
Also called “Local Store”, “Local Memory”,
“Scratchpad”, “Stream Register File”
Ubiquitous in embedded computing space
35
Local Stores in the wild
IBM Cell Processor
256KB LS per core
Shared by inst. and data!
Playstation 3!
36
Cache vis-à-vis Local Store
Cache Local Store
37
Local Stores: AMAT
AMAT = HitTime + MissRate * MissPenalty
MissRate = 0%!
Consequences?
Simpler performance analysis
Less motivation for out-of-order cores
Cell processor is in-order
High clock rate and low power
38
Local Stores: Operation
LD/ST instructions to LS proceed normally
No LD/ST to non-LS memory
DMA transfers (Direct Memory Access) to
move data to/from main memory and LS
Bulk, like memcpy()
Asynchronous
dma(void *local_address, void *remote_address,
int size, int tag, boolean direction);
39
Stream Programming
Time
get(a) do_something(a) get(b) do_something(b)
get(a) get(b)
do_something(a) do_something(b)
Overlap communication with computation
Hide memory latency
“Macroscopic” software prefetching
No ugly prefetch instructions interlaced w/ your code
Doesn’t waste instruction issue bandwidth
40
Local Stores: Pros and Cons
Pros Cons
No coherence! No coherence…
Simple to implement Complex to program
Less overhead (no tags) Can’t run existing SW
Predictable performance, Unpredictable access
great for in-order cores patterns perform poorly
Can potentially hide all Pointer chasing difficult
memory latency (linked lists, trees, etc.)
People resort to implementing
set-associative caches in
software…
41
Local Store Efficacy
Cache Miss rate Miss Hit Band-
optimization Cold Capacity Conflict penalty time width
Local Store
SW Complexity
42
Today’s lecture: More Caches
Advanced cache optimizations
H&P Chapter 5
Cache coherence
H&P Chapter 4
Software-managed memories
Beyond processor caches
43
Everything is a Cache for
Something Else
Access Time Capacity Managed by
Registers 1 cycle ~500B Software/compiler
Level 1 Cache 1-3 cycles ~64KB Hardware
Level 2 Cache 5-10 cycles 1-10MB Hardware
DRAM ~100 cycles ~10GB Software/OS
Disk
106-107 cycles ~1TB Software/OS
The
Tape Interwebs
44
Example: File cache
Do files exhibit locality?
Prefetching?
Write back or write Microsoft “SuperFetch”:
through? load common programs
at boot
When should we write
to disk?
Coherence?
Associativity? “Leases” in network
filesystems
Place arbitrarily and
keep an index
Most disks have caches
45
Example: Browser cache
Do web pages you visit
exhibit locality?
Coherence?
Write back or write Did the page change
through? since I last checked?
No writes! Relaxed coherence
“If-Modified-Since”
header
Replacement policy?
Probably LRU
AMAT?
46
Caching is a ubiquitous tool
Same design issues in system design as in
processor design
Placement, lookup, write policies, replacement
policies, coherence
Same optimization dimensions
Size, associativity, granularity
Hit time, miss rate, miss penalty, bandwidth,
complexity
47
Next Lecture
DRAM (Main Memory)
48