Average access time (AAT)
• Recall:
Memory stall cycles = Number of misses x Miss penalty (in cycles)
• Equivalent measure in units of time:
Memory stall time = Number of misses x Miss penalty (in seconds)
• Total time spent on memory references, including both hits and misses:
Total access time = (Number of references) x (Hit time) + (Number of misses) x (Miss penalty)
Note, in above expression: (1) “Hit time” is the cache access time in seconds, (2) “Miss penalty” is in seconds.
• Average access time (AAT) for a single memory reference:
AAT = Total access time / Number of references
AAT = (Hit time) + (Number of misses / Number of references) x (Miss penalty)
AAT = (Hit time) + (Miss rate) x (Miss penalty)
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-1
Measuring cache performance
m Run a program and collect a trace of
accesses
m Simulate “tag store” part of caches under
consideration
m Measure miss rate
u Can use to estimate average access time
Average access time = Hit time + Miss rate × Miss penalty
block size (bytes)
Miss penalty = Memory access latency +
memory bandwidth (bytes/sec.)
Example Hit time = 1 ns
Miss rate = 0.01
Memory access latency = 100 ns
Memory bandwidth = 8 GB/s ( = 8 B/ns)
Block size = 64 B
64 B
Miss penalty = 100 ns + = 108 ns
8 B/ns
Average access time = 1 + 0.01 × 108 = 2.08 ns
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-2
Improving cache performance
m Reduce miss rate
u Block size, cache size, associativity
u Prefetching: Hardware, Software
u Layout of instructions and data
m Reduce miss penalty
u Write buffers
u L2 caches
u Victim cache
u Subblocking
u Early restart
u Critical word first
m Reduce hit time
u Avoid address translation (TLB accesses in parallel)
u Simple caches, small caches
u Pipeline writes
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-3
Categories of misses (3C’s model)
m Compulsory misses
u To have something in the cache, first it must be fetched
u The initial fetch of anything is a miss
u Also called unique references or first-time references
m Capacity misses
u A miss that occurs due to the limited capacity of the cache
u The block was replaced before it was re-referenced
u Also called dimensional misses
m Conflict misses
u For set-associative or direct-mapped only
u The difference between capacity and conflict misses: in the
latter, the sets have limited capacity, even if the cache does not
u For example...
◊ Suppose a 2-way set-assoc. cache has capacity for 256
blocks
◊ Suppose there are only 4 blocks accessed by a program,
all of which map to the same set
u Also called mapping misses
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-4
Reduce miss rate: Block size
m Increase block size
u Idea: exploit spatial locality
u Problems:
◊ Don’t over do it: cache pollution from useless data
◊ Also increases miss penalty (have to bring more in)
“cache pollution”
Miss rate
block size
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-5
Reduce miss rate: Cache size
m Advantages:
u Larger caches hold more
m Disadvantages:
u Increases hit time: Larger caches are slower to access
u Yields diminishing returns: double size != double performance
u Steals resources from other units (esp. for on-chip caches)
The larger this distance,
tag date (block) the longer it takes to drive
and latch contents of a block
Miss rate store store
“diminishing returns”
=? word select
log(cache size)
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-6
Reduce miss rate: Inc. assoc.
m Increase associativity
u Advantages:
◊ For same total cache size, fully-associative has
lower miss rate than direct-mapped
u Disadvantages:
◊ Increases hit time: slower (searching sets), for
same total cache size
◊ Diminishing returns
l 4-way set-associative is almost equivalent to fully-
associative in many cases
Miss rate
diminishing returns
log(associativity)
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-7
Reduce miss rate: Prefetch
m Idea: get it before you need it
m Prefetching can be implemented in hardware,
software (e.g., compiler), or both
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-8
Hardware prefetching
m General idea
u Autonomous hardware prefetcher sits alongside cache
u Predict which blocks may be accessed in the future
u Prefetch these predicted blocks
m Simplest hardware prefetchers:
stride prefetchers
u +1 prefetch (stride = 1): fetch missing block, and next
sequential block
◊ Works great for streams with high sequential
locality, e.g., instruction caches
◊ Uses unused memory bandwidth between misses
l Can “hurt” if there isn’t enough leftover bandwidth
u +n prefetch (stride = n): observe memory is being
accessed every n blocks, so prefetch block +n:
l example of code that has this behavior: block X b[0] b[1] b[2] b[3]
for (i = 1; i < MAX; i += 8)
X+1 b[4] b[5] b[6] b[7]
a[i] = b[i];
X+2 b[8] b[9] b[10] b[11]
X+3 b[12] b[13] b[14] b[15]
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-9
Compiler directed prefetching
m Need a “nonbinding prefetch” instruction
u Doesn’t cause a page fault
u Doesn’t change processor’s state
u Doesn’t delay processor on a miss
m Compiler estimates which accesses miss
m Inserts prefetch instructions well enough ahead to
prevent the disaster of a cache miss
m Reduces compulsory misses for the original
instructions (the compulsory misses simply move around,
since the prefetch instructions still generate the misses)
for (j = 0; j < 100; j++) for (j = 0; j < 100; j++)
for (i = 0; i < 100; i++) for (i = 0; i < 100; i++) {
x[i][j] = c * x[i][j]; prefetch(x[i+k][j]);
x[i][j] = c * x[i][j];
}
Where k depends on (1) the miss penalty
and (2) the time it takes to execute an
iteration assuming hits
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-10
Compiler directed prefetching
(cont.)
for (j = 0; j < 100; j++)
for (i = 0; i < 100; i++) { Where k depends on (1) the miss penalty
prefetch(x[i+k][j]); and (2) the time it takes to execute an
x[i][j] = c * x[i][j]; iteration assuming hits
}
miss penalty
CPU is currently in iteration i
k=
time for 1 iter. assuming hits
In the example below: k = 11
prefetch
x[i+k][j]
... i i+k ...
execution time miss penalty: time to service a miss
for one iteration
of inner loop,
assuming cache hits
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-11
Potential issues with prefetching
m Cache pollution
u Inaccurate prefetches bring in useless blocks,
displacing useful ones
u Must be careful not to increase miss rate
u Solution: prefetch block into a “stream buffer” or
“candidate cache”, transfer block to main cache only
when the block is actually referenced by the program
m Bandwidth hog
u Inaccurate prefetches waste bandwidth throughout the
memory hierarchy
u Must be careful that prefetch misses (prefetch traffic)
do not delay demand misses (legitimate traffic)
u Solutions:
◊ Strike reasonable balance between prefetch
coverage and prefetch accuracy
◊ Request queues throughout memory hierarchy
should prioritize demand misses over prefetch
misses
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-12