Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
87 views6 pages

Cache Performance Optimization Guide

The document discusses techniques for measuring and improving cache performance. It defines average access time as the hit time plus the miss rate multiplied by the miss penalty. It describes ways to reduce the miss rate such as increasing the block size, cache size, and associativity. Prefetching, either in hardware or software, can also reduce miss rates by fetching data before it is needed. Reducing the miss penalty involves techniques like larger L2 caches and write buffers. Reducing the hit time focuses on aspects like avoiding address translation overhead and using simple, small caches.

Uploaded by

Alex Paige
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views6 pages

Cache Performance Optimization Guide

The document discusses techniques for measuring and improving cache performance. It defines average access time as the hit time plus the miss rate multiplied by the miss penalty. It describes ways to reduce the miss rate such as increasing the block size, cache size, and associativity. Prefetching, either in hardware or software, can also reduce miss rates by fetching data before it is needed. Reducing the miss penalty involves techniques like larger L2 caches and write buffers. Reducing the hit time focuses on aspects like avoiding address translation overhead and using simple, small caches.

Uploaded by

Alex Paige
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Average access time (AAT)

• Recall:
Memory stall cycles = Number of misses x Miss penalty (in cycles)

• Equivalent measure in units of time:


Memory stall time = Number of misses x Miss penalty (in seconds)

• Total time spent on memory references, including both hits and misses:
Total access time = (Number of references) x (Hit time) + (Number of misses) x (Miss penalty)

Note, in above expression: (1) “Hit time” is the cache access time in seconds, (2) “Miss penalty” is in seconds.

• Average access time (AAT) for a single memory reference:


AAT = Total access time / Number of references
AAT = (Hit time) + (Number of misses / Number of references) x (Miss penalty)
AAT = (Hit time) + (Miss rate) x (Miss penalty)

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-1

Measuring cache performance


m Run a program and collect a trace of
accesses
m Simulate “tag store” part of caches under
consideration
m Measure miss rate
u Can use to estimate average access time
Average access time = Hit time + Miss rate × Miss penalty
block size (bytes)
Miss penalty = Memory access latency +
memory bandwidth (bytes/sec.)

Example Hit time = 1 ns


Miss rate = 0.01
Memory access latency = 100 ns
Memory bandwidth = 8 GB/s ( = 8 B/ns)
Block size = 64 B

64 B
Miss penalty = 100 ns + = 108 ns
8 B/ns

Average access time = 1 + 0.01 × 108 = 2.08 ns

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-2
Improving cache performance
m Reduce miss rate
u Block size, cache size, associativity
u Prefetching: Hardware, Software
u Layout of instructions and data
m Reduce miss penalty
u Write buffers
u L2 caches
u Victim cache
u Subblocking
u Early restart
u Critical word first
m Reduce hit time
u Avoid address translation (TLB accesses in parallel)
u Simple caches, small caches
u Pipeline writes

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-3

Categories of misses (3C’s model)


m Compulsory misses
u To have something in the cache, first it must be fetched
u The initial fetch of anything is a miss
u Also called unique references or first-time references
m Capacity misses
u A miss that occurs due to the limited capacity of the cache
u The block was replaced before it was re-referenced
u Also called dimensional misses
m Conflict misses
u For set-associative or direct-mapped only
u The difference between capacity and conflict misses: in the
latter, the sets have limited capacity, even if the cache does not
u For example...
◊ Suppose a 2-way set-assoc. cache has capacity for 256
blocks
◊ Suppose there are only 4 blocks accessed by a program,
all of which map to the same set
u Also called mapping misses

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-4
Reduce miss rate: Block size
m Increase block size
u Idea: exploit spatial locality
u Problems:
◊ Don’t over do it: cache pollution from useless data
◊ Also increases miss penalty (have to bring more in)

“cache pollution”

Miss rate

block size

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-5

Reduce miss rate: Cache size


m Advantages:
u Larger caches hold more
m Disadvantages:
u Increases hit time: Larger caches are slower to access
u Yields diminishing returns: double size != double performance
u Steals resources from other units (esp. for on-chip caches)

The larger this distance,


tag date (block) the longer it takes to drive
and latch contents of a block
Miss rate store store

“diminishing returns”

=? word select

log(cache size)

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-6
Reduce miss rate: Inc. assoc.
m Increase associativity
u Advantages:
◊ For same total cache size, fully-associative has
lower miss rate than direct-mapped
u Disadvantages:
◊ Increases hit time: slower (searching sets), for
same total cache size
◊ Diminishing returns
l 4-way set-associative is almost equivalent to fully-
associative in many cases

Miss rate
diminishing returns

log(associativity)

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-7

Reduce miss rate: Prefetch


m Idea: get it before you need it
m Prefetching can be implemented in hardware,
software (e.g., compiler), or both

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-8
Hardware prefetching
m General idea
u Autonomous hardware prefetcher sits alongside cache
u Predict which blocks may be accessed in the future
u Prefetch these predicted blocks
m Simplest hardware prefetchers:
stride prefetchers
u +1 prefetch (stride = 1): fetch missing block, and next
sequential block
◊ Works great for streams with high sequential
locality, e.g., instruction caches
◊ Uses unused memory bandwidth between misses
l Can “hurt” if there isn’t enough leftover bandwidth
u +n prefetch (stride = n): observe memory is being
accessed every n blocks, so prefetch block +n:
l example of code that has this behavior: block X b[0] b[1] b[2] b[3]
for (i = 1; i < MAX; i += 8)
X+1 b[4] b[5] b[6] b[7]
a[i] = b[i];
X+2 b[8] b[9] b[10] b[11]
X+3 b[12] b[13] b[14] b[15]

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-9

Compiler directed prefetching


m Need a “nonbinding prefetch” instruction
u Doesn’t cause a page fault
u Doesn’t change processor’s state
u Doesn’t delay processor on a miss
m Compiler estimates which accesses miss
m Inserts prefetch instructions well enough ahead to
prevent the disaster of a cache miss
m Reduces compulsory misses for the original
instructions (the compulsory misses simply move around,
since the prefetch instructions still generate the misses)

for (j = 0; j < 100; j++) for (j = 0; j < 100; j++)


for (i = 0; i < 100; i++) for (i = 0; i < 100; i++) {
x[i][j] = c * x[i][j]; prefetch(x[i+k][j]);
x[i][j] = c * x[i][j];
}
Where k depends on (1) the miss penalty
and (2) the time it takes to execute an
iteration assuming hits
ECE 463/521, Profs Conte/Rotenberg/Sair
Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-10
Compiler directed prefetching
(cont.)
for (j = 0; j < 100; j++)
for (i = 0; i < 100; i++) { Where k depends on (1) the miss penalty
prefetch(x[i+k][j]); and (2) the time it takes to execute an
x[i][j] = c * x[i][j]; iteration assuming hits
}

miss penalty
CPU is currently in iteration i
k=
time for 1 iter. assuming hits
In the example below: k = 11
prefetch
x[i+k][j]

... i i+k ...

execution time miss penalty: time to service a miss


for one iteration
of inner loop,
assuming cache hits

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-11

Potential issues with prefetching


m Cache pollution
u Inaccurate prefetches bring in useless blocks,
displacing useful ones
u Must be careful not to increase miss rate
u Solution: prefetch block into a “stream buffer” or
“candidate cache”, transfer block to main cache only
when the block is actually referenced by the program
m Bandwidth hog
u Inaccurate prefetches waste bandwidth throughout the
memory hierarchy
u Must be careful that prefetch misses (prefetch traffic)
do not delay demand misses (legitimate traffic)
u Solutions:
◊ Strike reasonable balance between prefetch
coverage and prefetch accuracy
◊ Request queues throughout memory hierarchy
should prioritize demand misses over prefetch
misses

ECE 463/521, Profs Conte/Rotenberg/Sair


Conte/Rotenberg/Sair,, Dept. of ECE, NC State University CACHE4-12

You might also like