COMP 206: Computer Architecture and Implementation
Montek Singh
Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5)
(Caches, Main Memory and Virtual Memory)
1
Outline
Motivation for Caches Principle of locality
Levels of Memory Hierarchy
Cache Organization
Cache Read/Write Policies Block replacement policies Write-back vs. write-through caches Write buffers
Reading: HP3 Sections 5.1-5.2
The Big Picture: Where are We Now?
The Five Classic Components of a Computer
Processor
Input Control Memory Datapath
Output
This lecture (and next few): Memory System
The Motivation for Caches
Motivation Large (cheap) memories (DRAM) are slow Small (costly) memories (SRAM) are fast Make the average access time small service most accesses from a small, fast memory reduce the bandwidth required of the large memory
Memory System
Processor
Cache
DRAM
The Principle of Locality
Frequency of reference
Address Space
2n
The Principle of Locality Program accesses a relatively small portion of the address space at any instant of time Example: 90% of time in 10% of the code Two different types of locality Temporal Locality (locality in time):
if an item is referenced, it will tend to be referenced again soon
Spatial Locality (locality in space):
if an item is referenced, items close by tend to be referenced soon
Levels of the Memory Hierarchy
Capacity Access Time Cost/bit CPU Registers 500 Bytes 0.25 ns ~$.01 Cache 16K-1M Bytes 1 ns ~$.0001 Main Memory 64M-2G Bytes 100ns ~$.0000001
Staging Transfer Unit
Upper Level Faster
Registers
Words
L1, L2, Cache Blocks
programmer/compiler 1-8 bytes
cache controller 8-128 bytes
Memory
Pages Disk Files
user/operator Mbytes OS 4-64K bytes
Disk 100 G Bytes 5 ms 10-5- 10-7 cents
Tape/Network infinite secs. 10-8 cents
Larger Lower Level
6
Tape/Network
Memory Hierarchy: Principles of Operation
At any given time, data is copied between only 2
adjacent levels
Upper Level (Cache): the one closer to the processor
Smaller, faster, and uses more expensive technology
Lower Level (Memory): the one further away from the
processor
Bigger, slower, and uses less expensive technology
Block The smallest unit of information that can either be present or not present in the two-level hierarchy
To Processor Upper Level (Cache)
Blk X
Lower Level (Memory)
From Processor
Blk Y 7
Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level
(e.g.: Block X in previous slide)
Hit Rate = fraction of memory access found in upper level Hit Time = time to access the upper level
memory access time + Time to determine hit/miss
Miss: data needs to be retrieved from a block in the
lower level (e.g.: Block Y in previous slide)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: includes time to fetch a new block from lower
level
Time to replace a block in the upper level from lower level + Time to deliver the block the processor
Hit Time: significantly less than Miss Penalty
8
Cache Addressing
Set 0
Set j-1 Block 0 Sector 0 Byte 0 Block k-1 Replacement info Sector m-1 Byte n-1 Valid Tag Dirty Shared
Block/line is unit of allocation Sector/sub-block is unit of transfer and coherence Cache parameters j, k, m, n are integers, and generally
powers of 2
Cache Shapes
Direct-mapped (A = 1, S = 16)
2-way set-associative (A = 2, S = 8)
4-way set-associative (A = 4, S = 4)
8-way set-associative (A = 8, S = 2)
Fully associative (A = 16, S = 1)
10
Examples of Cache Configurations
# Sets 1 j j j 64 # Blocks k 1 k 4 8 # Sectors m m 1 m 2 # Bytes n n n n 32 Name Fully associative Direct mapped A cache that is not sectored 4-way set-associative cache PowerPC 601
11
Storage Overhead of Cache
Total number of bits j repl k tag m 3 n 8 Number of data bits j k m n 8 repl k tag 3 k m 1 k m n 8
System IBM 360/85 IBM 3033 Motorola 68030 Intel i486 DEC Alpha AXP 21064 IBM PowerPC 601 # Address bits 24 32 32 32 34 32 (j,k,m,n) (1,16,16,64) (64,16,1,64) (24,4,2,2) (128,4,1,16) (256,1,1,32) (64,8,2,32) Cache size 16 KB 64 KB 256 B 8 KB 8 KB 32 KB Storage overhead 0.85% 5.95% 28.10% 19.90% 9.37% 5.76%
12
Cache Organization
Direct Mapped Cache Each memory location can only mapped to 1 cache location No need to make any decision :-)
Current item replaces previous item in that cache location
N-way Set Associative Cache Each memory location have a choice of N cache locations Fully Associative Cache Each memory location can be placed in ANY cache location Cache miss in a N-way Set Associative or Fully
Associative Cache
Bring in new block from memory Throw out a cache block to make room for the new block Need to decide which block to throw out!
13
Write Allocate versus Not Allocate
Assume that a write to a memory location causes a
cache miss
Do we read in the block?
Yes: Write Allocate No: Write No-Allocate
14
Basics of Cache Operation: Overview
READ HIT CPU reads from cache MISS Allocate and load block from MM, then CPU reads from it Write through into MM with or without write allocate
WRITE Write into cache plus write through into MM
WRITE Write into cache only Write allocate with write and set dirty bit (so that back on replacement, block is written back to MM only if modified)
15
Details of Simple Blocking Cache
HIT READ
CPU reads cache
MISS
CPU detects miss, stalls Cache selects replacement block New block loaded from MM Requested word sent to CPU CPU resumes operation CPU detects miss CPU writes MM (cache also if write allocate) stalls until write completes
Write Through
WRITE
CPU writes cache CPU writes MM and stalls until write completes
HIT READ
CPU reads cache
MISS
CPU detects miss, stalls Cache selects replacement block New block loaded from MM Word sent to CPU CPU resumes operation CPU detects miss, stalls Cache selects replacement block Old block evicted from cache New block loaded from MM (write allocate) CPU resumes operation
16
WRITE
CPU writes cache
Write Back
A-way Set-Associative Cache
A-way set associative: A entries for each cache index A direct-mapped caches operating in parallel
Example: Two-way set associative cache Cache Index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result
Valid Cache Tag Cache Index Cache Data Cache Data Cache Block 0 Cache Block 0 Cache Tag Valid
:
Addr. Tag
:
Addr. Tag
Compare
SEL11
Mux
0 SEL0
Compare
OR Hit Cache Block
17
Fully Associative Cache
Push the set-associative idea to its limit! Forget about the Cache Index Compare the Cache Tags of all cache tag entries in parallel Example: Block Size = 32B, we need N 27-bit comparators
31 Cache Tag (27 bits long) 4 Byte Select Ex: 0x01 Cache Tag X X X X X Valid Bit Cache Data Byte 31 Byte 63 0
: :
Byte 1
Byte 0
Byte 33 Byte 32
:
18
Cache Block Replacement Policies
Random Replacement Hardware randomly selects a cache item and throw it out
Least Recently Used Hardware keeps track of the access history Replace the entry that has not been used for the longest time For 2-way set-associative cache, need one bit for LRU repl. Example of a Simple Pseudo LRU Implementation Assume 64 Fully Associative entries Hardware replacement pointer points to one cache entry Whenever access is made to the entry the pointer points to:
Move the pointer to the next entry
Otherwise: do not move the pointer
Replacement Pointer Entry 0 Entry 1
:
Entry 63
19
Cache Write Policy
Cache read is much easier to handle than cache write Instruction cache is much easier to design than data cache
Cache write How do we keep data in the cache and memory consistent? Two options (decision time again :-) Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss
Need a dirty bit for each cache block Greatly reduce the memory bandwidth requirement Control can be complex
Write Through: write to cache and memory at the same time
What!!! How can this be? Isnt memory too slow for this?
20
Write Buffer for Write Through
Processor Cache DRAM
Write Buffer
Write Buffer: needed between cache and main mem Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO Typical number of entries: 4 Works fine if store freq. (w.r.t. time) << 1 / DRAM write cycle Memory system designers nightmare Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation
21
Write Buffer Saturation
Processor Cache DRAM Write Buffer
Store frequency (w.r.t. time) > 1 / DRAM write cycle
If this condition exist for a long period of time (CPU cycle time too
quick and/or too many store instructions in a row)
Store buffer will overflow no matter how big you make it CPU Cycle Time << DRAM Write Cycle Time
Solutions for write buffer saturation
Use a write back cache
Install a second level (L2) cache
Processor
Cache
L2 Cache
DRAM
22
Write Buffer