CPSC 312
Cache Memories
Slides Source: Bryant
Topics
class11.ppt
Generic cache memory organization
Direct mapped caches
Set associative caches
Impact of caches on performance
Cache Memories
Cache memories are small, fast SRAM-based memories
managed automatically in hardware.
Hold frequently accessed blocks of main memory
CPU looks first for data in L1, then in L2, then in main
memory.
Typical bus structure:
CPU chip
register file
cache bus
L2 cache
2
L1
cache
bus interface
ALU
system bus memory bus
I/O
bridge
main
memory
15-213, F02
Inserting an L1 Cache Between
the CPU and Main Memory
The tiny, very fast CPU register file
has room for four 4-byte words.
The transfer unit between
the CPU register file and
the cache is a 4-byte block.
line 0
The small fast L1 cache has room
for two 4-word blocks.
line 1
The transfer unit between
the cache and main
memory is a 4-word block
(16 bytes).
block 10
abcd
block 21
pqrs
block 30
wxyz
...
...
...
The big slow main memory
has room for many 4-word
blocks.
15-213, F02
General Org of a Cache Memory
Cache is an array
of sets.
Each set contains
one or more lines.
1 valid bit t tag bits
per line
per line
valid
set 0:
Each line holds a
block of data.
S = 2s sets
set 1:
tag
B = 2b bytes
per cache block
0
B1
valid
tag
B1
valid
tag
B1
B1
B1
B1
E lines
per set
valid
tag
valid
set S-1:
tag
valid
tag
Cache size: C = B x E x S data bytes
15-213, F02
Addressing Caches
Address A:
t bits
set 0:
set 1:
tag
tag
tag
tag
B1
B1
B1
B1
B1
B1
set S-1:
tag
tag
m-1
s bits
b bits
0
<tag> <set index> <block offset>
The word at address A is in the cache if
the tag bits in one of the <valid> lines in
set <set index> match <tag>.
The word contents begin at offset
<block offset> bytes from the beginning
of the block.
15-213, F02
Direct-Mapped Cache
Simplest kind of cache
Characterized by exactly one line per set.
set 0:
valid
tag
cache block
set 1:
valid
tag
cache block
E=1 lines per set
set S-1:
valid
tag
cache block
15-213, F02
Accessing Direct-Mapped Caches
Set selection
Use the set index bits to determine the set of interest.
selected set
set 0:
valid
tag
cache block
set 1:
valid
tag
cache block
t bits
m-1
tag
s bits
b bits
00 001
set index block offset0
set S-1: valid
tag
cache block
15-213, F02
Accessing Direct-Mapped Caches
Line matching and word selection
Line matching: Find a valid line in the selected set with a
matching tag
Word selection: Then extract the word
=1? (1) The valid bit must be set
0
selected set (i):
0110
w0
w1 w2
(2) The tag bits in the cache
=?
line must match the
tag bits in the address
m-1
t bits
0110
tag
s bits
b bits
i
100
set index block offset0
w3
(3) If (1) and (2), then
cache hit,
and block offset
selects
starting byte.
15-213, F02
Direct-Mapped Cache Simulation
t=1 s=2
x
xx
M=16 byte addresses, B=2 bytes/block,
S=4 sets, E=1 entry/set
b=1
x
11
Address trace (reads):
0 [00002], 1 [00012], 13 [11012], 8 [10002], 0 [00002]
0 [00002] (miss)
tag
data
0
0
m[1]
m[0]
M[0-1]
(1)
(3)
11
(4)
9
13 [11012] (miss)
v tag
data
8 [10002] (miss)
tag
data
1
1
m[9]
m[8]
M[8-9]
M[12-13]
1
1
0
0
1
1
1 m[13]
m[12]
1
M[12-13]
0 [00002] (miss)
tag
data
11
(5)
11
m[1]
m[0]
M[0-1]
0
0
m[1]
m[0]
M[0-1]
1
1
m[13]
m[12]
M[12-13]
15-213, F02
Why Use Middle Bits as Index?
High-Order
Bit Indexing
4-line Cache
00
01
10
11
High-Order Bit Indexing
Adjacent memory lines would
map to same cache entry
Poor use of spatial locality
Middle-Order Bit Indexing
10
Consecutive memory lines map
to different cache lines
Can hold C-byte region of
address space in cache at one
time
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Middle-Order
Bit Indexing
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
15-213, F02
Set Associative Caches
Characterized by more than one line per set
set 0:
set 1:
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
E=2 lines per set
set S-1:
11
valid
tag
cache block
valid
tag
cache block
15-213, F02
Accessing Set Associative Caches
Set selection
identical to direct-mapped cache
set 0:
Selected set
set 1:
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
valid
tag
cache block
t bits
m-1
tag
12
set S-1:
s bits
b bits
00 001
set index block offset0
valid
tag
cache block
valid
tag
cache block
15-213, F02
Accessing Set Associative Caches
Line matching and word selection
must compare the tag in each valid line in the selected set.
=1? (1) The valid bit must be set.
0
selected set (i):
1001
0110
(2) The tag bits in one
of the cache lines must
match the tag bits in
the address
w0
t bits
0110
tag
w1 w2
w3
(3) If (1) and (2), then
cache hit, and
block offset selects
starting byte.
=?
m-1
13
s bits
b bits
i
100
set index block offset0
15-213, F02
Multi-Level Caches
Options: separate data and instruction caches, or a
unified cache
Processor
Regs
L1
d-cache
L1
i-cache
size:
speed:
$/Mbyte:
line size:
14
200 B
3 ns
Unified
Unified
L2
L2
Cache
Cache
Memory
Memory
disk
disk
8-64 KB
3 ns
1-4MB SRAM 128 MB DRAM 30 GB
6 ns
60 ns
8 ms
$100/MB
$1.50/MB
$0.05/MB
8B
32 B
32 B
8 KB
larger, slower, cheaper (Get an update with current data)
15-213, F02
Intel Pentium Cache Hierarchy
Regs.
L1 Data
1 cycle latency
16 KB
4-way assoc
Write-through
32B lines
L1 Instruction
16 KB, 4-way
32B lines
L2
L2Unified
Unified
128KB--2
128KB--2MB
MB
4-way
assoc
4-way assoc
Write-back
Write-back
Write
Writeallocate
allocate
32B
lines
32B lines
Main
Main
Memory
Memory
Up
Uptoto4GB
4GB
Processor
ProcessorChip
Chip
15
15-213, F02
Cache Performance Metrics
Miss Rate
Fraction of memory references not found in cache
(misses/references)
Typical numbers:
3-10% for L1
can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
Time to deliver a line in the cache to the processor (includes
time to determine whether the line is in the cache)
Typical numbers:
1 clock cycle for L1
3-8 clock cycles for L2
Miss Penalty
16
Additional time required because of a miss
Typically 25-100 cycles for main memory
15-213, F02
Writing Cache Friendly Code
Repeated references to variables are good (temporal
locality)
Stride-1 reference patterns are good (spatial locality)
Examples:
cold cache, 4-byte words, 4-word cache blocks
int sumarrayrows(int a[M][N])
{
int i, j, sum = 0;
int sumarraycols(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
Miss rate = 1/4 = 25%
17
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
Miss rate = 100%
15-213, F02
The Memory Mountain
Read throughput (read bandwidth)
Number of bytes read from memory per second (MB/s)
Memory mountain
18
Measured read throughput as a function of spatial and
temporal locality.
Compact way to characterize memory system performance.
15-213, F02
Memory Mountain Test Function
/* The test function */
void test(int elems, int stride) {
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i += stride)
result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
}
/* Run test(elems, stride) and return read throughput (MB/s) */
double run(int size, int stride, double Mhz)
{
double cycles;
int elems = size / sizeof(int);
test(elems, stride);
/* warm up the cache */
cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */
return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */
}
19
15-213, F02
Memory Mountain Main Routine
/* mountain.c - Generate the memory mountain. */
#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */
#define MAXBYTES (1 << 23) /* ... up to 8 MB */
#define MAXSTRIDE 16
/* Strides range from 1 to 16 */
#define MAXELEMS MAXBYTES/sizeof(int)
int data[MAXELEMS];
int main()
{
int size;
int stride;
double Mhz;
/* The array we'll be traversing */
/* Working set size (in bytes) */
/* Stride (in array elements) */
/* Clock frequency */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */
Mhz = mhz(0);
/* Estimate the clock frequency */
for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++)
printf("%.1f\t", run(size, stride, Mhz));
printf("\n");
}
exit(0);
}
20
15-213, F02
The Memory Mountain
Pentium III Xeon
550 MHz
16 KB on-chip L1 d-cache
16 KB on-chip L1 i-cache
512 KB off-chip unified
L2 cache
read throughput (MB/s)
1200
1000
L1
800
600
400
Slopes of
Spatial
Locality
Ridges of
Temporal
Locality
xe
L2
200
2k
8k
32k
128k
512k
2m
s11
s13
s15
8m
21
s9
stride (words)
mem
s7
s5
s3
s1
working set size (bytes)
15-213, F02
Ridges of Temporal Locality
Slice through the memory mountain with stride=1
illuminates read throughputs of different caches and
memory
1200
main memory
region
read througput (MB/s)
1000
L2 cache
region
L1 cache
region
800
600
400
working set size (bytes)
1k
2k
4k
8k
16k
32k
64k
128k
256k
512k
2m
1024k
22
4m
8m
200
15-213, F02
A Slope of Spatial Locality
Slice through memory mountain with size=256KB
shows cache block size.
800
read throughput (MB/s)
700
600
500
one access per cache line
400
300
200
100
0
s1
23
s2
s3
s4
s5
s6
s7
s8
s9 s10 s11 s12 s13 s14 s15 s16
stride (words)
15-213, F02
Matrix Multiplication Example
Major Cache Effects to Consider
Total cache size
Exploit temporal locality and keep the working set small (e.g., by using
blocking)
Block size
Exploit spatial locality
Description:
Multiply N x N matrices
O(N3) total operations
Accesses
Variable sum
/*
/* ijk
ijk */
*/
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{ held in register
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
sum
sum == 0.0;
0.0;
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum
sum +=
+= a[i][k]
a[i][k] ** b[k][j];
b[k][j];
c[i][j]
c[i][j] == sum;
sum;
}}
}}
N reads per source element
N values summed per destination
but may be able to hold in register
24
15-213, F02
Miss Rate Analysis for Matrix Multiply
Assume:
Line size = 32B (big enough for 4 64-bit words)
Matrix dimension (N) is very large
Approximate 1/N as 0.0
Cache is not even big enough to hold multiple rows
Analysis Method:
Look at access pattern of inner loop
k
i
25
j
i
15-213, F02
Layout of C Arrays in Memory
(review)
C arrays allocated in row-major order
each row in contiguous memory locations
Stepping through columns in one row:
for (i = 0; i < N; i++)
sum += a[0][i];
accesses successive elements
if block size (B) > 4 bytes, exploit spatial locality
compulsory miss rate = 4 bytes / B
Stepping through rows in one column:
for (i = 0; i < n; i++)
sum += a[i][0];
accesses distant elements
no spatial locality!
compulsory miss rate = 1 (i.e. 100%)
26
15-213, F02
Matrix Multiplication (ijk)
/*
/* ijk
ijk */
*/
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
sum
sum == 0.0;
0.0;
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum
sum +=
+= a[i][k]
a[i][k] ** b[k][j];
b[k][j];
c[i][j]
c[i][j] == sum;
sum;
}}
}}
Inner loop:
(*,j)
(i,*)
A
Row-wise
Columnwise
(i,j)
C
Fixed
Misses per Inner Loop Iteration:
27
A B
0.25
1.0
0.0
15-213, F02
Matrix Multiplication (jik)
/*
/* jik
jik */
*/
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
sum
sum == 0.0;
0.0;
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++)
sum
sum +=
+= a[i][k]
a[i][k] ** b[k][j];
b[k][j];
c[i][j]
c[i][j] == sum
sum
}}
}}
Misses per Inner Loop Iteration:
28
A B
0.25
1.0
Inner loop:
(*,j)
(i,*)
A
Row-wise Columnwise
(i,j)
C
Fixed
0.0
15-213, F02
Matrix Multiplication (kij)
/*
/* kij
kij */
*/
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
rr == a[i][k];
a[i][k];
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++)
c[i][j]
c[i][j] +=
+= rr ** b[k][j];
b[k][j];
}}
}}
Inner loop:
(i,k)
A
Fixed
(k,*)
B
(i,*)
C
Row-wise Row-wise
Misses per Inner Loop Iteration:
A B
0.00.25
29
0.25
15-213, F02
Matrix Multiplication (ikj)
/*
/* ikj
ikj */
*/
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++) {{
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{
rr == a[i][k];
a[i][k];
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++)
c[i][j]
c[i][j] +=
+= rr ** b[k][j];
b[k][j];
}}
}}
Inner loop:
(i,k)
A
Fixed
(k,*)
B
(i,*)
C
Row-wise Row-wise
Misses per Inner Loop Iteration:
A B
0.00.25
30
0.25
15-213, F02
Matrix Multiplication (jki)
/*
/* jki
jki */
*/
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{
rr == b[k][j];
b[k][j];
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++)
c[i][j]
c[i][j] +=
+= a[i][k]
a[i][k] ** r;
r;
}}
}}
Misses per Inner Loop Iteration:
A B
1.00.0
31
Inner loop:
(*,k)
Column wise
(k,j)
B
Fixed
(*,j)
Columnwise
1.0
15-213, F02
Matrix Multiplication (kji)
/*
/* kji
kji */
*/
for
for (k=0;
(k=0; k<n;
k<n; k++)
k++) {{
for
for (j=0;
(j=0; j<n;
j<n; j++)
j++) {{
rr == b[k][j];
b[k][j];
for
for (i=0;
(i=0; i<n;
i<n; i++)
i++)
c[i][j]
c[i][j] +=
+= a[i][k]
a[i][k] ** r;
r;
}}
}}
Inner loop:
(*,k)
Columnwise
(k,j)
B
Fixed
(*,j)
Columnwise
Misses per Inner Loop Iteration:
A B
1.00.0
32
1.0
15-213, F02
Summary of Matrix Multiplication
ijk (& jik):
2 loads, 0 stores
misses/iter = 1.25
for (i=0; i<n; i++) {
kij (& ikj):
2 loads, 1 store
misses/iter = 0.5
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
[j];
2 loads, 1 store
misses/iter = 2.0
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
sum = 0.0;
r = a[i][k];
r = b[k][j];
for (k=0; k<n; k++)
for (j=0; j<n; j++)
for (i=0; i<n; i++)
sum += a[i][k] * b[k]
c[i][j] += r * b[k][j];
c[i][j] = sum;
}
jki (& kji):
}
}
c[i][j] += a[i][k] * r;
}
}
33
15-213, F02
Pentium Matrix Multiply Performance
Miss rates are helpful but not perfect predictors.
Code scheduling matters, too.
60
Cycles/iteration
50
40
kji
jki
kij
ikj
jik
ijk
30
20
10
34
25
50
75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
15-213, F02
Improving Temporal Locality by
Blocking
Example: Blocked matrix multiplication
block (in this context) does not mean cache block.
Instead, it mean a sub-block within the matrix.
Example: N = 8; sub-block size = 4
A11 A12
A21 A22
B11 B12
X
B21 B22
C11 C12
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
35
C11 = A11B11 + A12B21
C12 = A11B12 + A12B22
C21 = A21B11 + A22B21
C22 = A21B12 + A22B22
15-213, F02
Blocked Matrix Multiply (bijk)
for (jj=0; jj<n; jj+=bsize) {
for (i=0; i<n; i++)
for (j=jj; j < min(jj+bsize,n); j++)
c[i][j] = 0.0;
for (kk=0; kk<n; kk+=bsize) {
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
}
}
}
}
36
15-213, F02
Blocked Matrix Multiply Analysis
Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize
X bsize block of B and accumulates into 1 X bsize sliver of C
Loop over i steps through n row slivers of A & C, using same B
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
Innermost
kk
jj
jj
}
Loop Pair
i
37
kk
Update successive
row sliver accessed
elements of sliver
bsize times
block reused n
times in succession
15-213, F02
Pentium Blocked Matrix
Multiply Performance
Blocking (bijk and bikj) improves performance by a
factor of two over unblocked versions (ijk and jik)
relatively insensitive to array size.
60
Cycles/iteration
50
kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)
40
30
20
10
0
38
Array size (n)
15-213, F02
Concluding Observations
Programmer can optimize for cache performance
How data structures are organized
How data are accessed
Nested loop structure
Blocking is a general technique
All systems favor cache friendly code
Getting absolute optimum performance is very platform
specific
Cache sizes, line sizes, associativities, etc.
Can get most of the advantage with generic code
Keep working set reasonably small (temporal locality)
Use small strides (spatial locality)
39
15-213, F02