Lecture 13 16 Post
Lecture 13 16 Post
Caching: A Simple First Example MIPS Direct Mapped 4KB Cache Example
Main Memory One-word blocks, cache size = 1K words (or 4KB)
0000xx
0001xx One-word blocks Byte
Cache Two low order bits 31 30 ... 13 12 11 ... 2 1 0
offset
0010xx
Index Valid Tag Data define the byte in the
0011xx Tag 20 Data
word (32b words) Hit 10
00 0100xx Index
01 0101xx Index Valid Tag Data
10 0110xx 0
1
11 0111xx
1000xx Q2: How do we find it? 2
.
.
1001xx
1010xx Use next 2 low order .
1021
Q1: Is it there? 1011xx memory address bits – 1022
32 8 requests, 4 misses
What kind of locality are we taking advantage of?
5 64 KB
For a block (line) size of 2m words (2m+2 bytes), m bits are used
to address the word within the block and 2 bits are used to 256 KB
address the byte within the word
What is the size of the tag field? 32 – (n + m + 2)
The total number of bits in a direct-mapped cache is then 0
16 32 64 128 256
2n x (block size + tag field size + valid field size)
Block size (bytes)
How many total bits are required for a direct mapped
cache with 16KB of data and 4-word blocks assuming a Miss rate goes up if the block size becomes a significant
32-bit address? 16KB = 4KW (212) 1024 blocks (210) fraction of the cache size because the number of blocks
210 x [ 4x32b data + (32-10-2-2)b tag +1b valid ] = 147Kb
that can be held in the same size cache is smaller
…about 1.15 times as many as needed just for storage data (increasing capacity misses)
Handling Cache Hits Sources of Cache Misses (3 Cs)
Read hits (I$ and D$) Compulsory (cold start or process migration, first
this is what we want! reference):
First access to a block, “cold” fact of life, not a whole lot you
can do about it. If you are going to run “millions” of instruction,
Write hits (D$ only) compulsory misses are insignificant
If we require the cache and memory to be consistent Solution: increase block size (increases miss penalty; very
- always write the data into both the cache block and the next level in large blocks could increase miss rate)
the memory hierarchy (write-through)
- writes run at the speed of the next level in the memory hierarchy – so Capacity:
slow! – or can use a write buffer and stall only if the write buffer is full Cache cannot contain all blocks accessed by the program
If we allow cache and memory to be inconsistent Solution: increase cache size (may increase access time)
- write the data only into the cache block (write-back the cache block to
the next level in the memory hierarchy when that cache block is Conflict (collision):
“evicted”) Multiple memory locations mapped to the same cache location
- need a dirty bit for each data cache block to tell if it needs to be
Solution 1: increase cache size
written back to memory when it is evicted – can use a write buffer to
help “buffer” write-backs of dirty blocks Solution 2: increase associativity (stay tuned) (may increase
access time)
Miss Rates per Cache Miss Type Handling Cache Misses (Single Word Blocks)
Read misses (I$ and D$)
stall the pipeline, fetch the block from the next level in the memory
hierarchy, install it in the cache (which may involve having to evict a
dirty block if using a write-back cache), and send the requested
word to the core, then let the pipeline resume.
Write misses (D$ only)
Since no data is returned to the requester on write operations, a
decision needs to be made on write misses, whether or not data
would be loaded into the cache.
Write allocate – just write the word (and its tag) into the cache
(which may involve having to evict a dirty block if using a write-back
cache), no need to check for cache hit, no need to stall
No-write allocate – skip the cache write (but must invalidate that
cache block since it will now hold stale data) and just write the word
to the write buffer (and eventually to the next memory level), no
need to stall if the write buffer isn’t full
Multiword Block Considerations Measuring Cache Performance
Assuming cache hit costs are included as part of the
Read misses (I$ and D$) normal CPU execution cycle, then
Processed the same as for single word blocks – a miss returns CPU time = IC × CPI × CC
the entire block from memory = IC × (CPIideal + Memory-stall cycles) × CC
Miss penalty grows as block size grows
CPIstall
- Early restart – core resumes execution as soon as the requested Memory-stall cycles come from cache misses (a sum of
word of the block is returned
read-stalls and write-stalls)
- Requested word first – requested word is transferred from the
memory to the cache (and core) first Read-stall cycles = reads/program × read miss rate × read miss
Nonblocking cache – allows the core to continue to access the penalty
cache while the cache is handling an earlier miss Write-stall cycles = (writes/program × write miss rate × write miss
penalty)
Write misses (D$ only)
+ write buffer stalls
If using write allocate must first fetch the block from memory and
then write the word to the block For write-through caches, we can simplify this to
Memory-stall cycles = accesses/program × miss rate × miss penalty
For write-back caches, additional stalls arising from write-
backs of dirty blocks should also be considered
The previous examples and equations assume that the Reducing miss rate
hit time is not a factor in determining cache performance More associativity
Alternatives/enhancements to associativity
Clearly, a larger cache will have a longer access time.
- Victim caches, hashing, pseudo-associativity, skewed associativity
An increase in hit time will likely add another stage to the
Better replacement/insertion policies
pipeline.
Software approaches
At some point the increase in hit time for a larger cache will
overcome the improvement in hit rate leading to a decrease in
Reducing miss latency/cost
performance.
Multi-level caches
Average Memory Access Time (AMAT) is the average to
Critical word first
access memory considering both hits and misses
Subblocking/sectoring
AMAT = Time for a hit + Miss rate x Miss penalty Better replacement/insertion policies
What is the AMAT for a core with a miss penalty of 50 Non-blocking caches (multiple cache misses in parallel)
clock cycles, a miss rate of 0.02 misses per instruction Multiple accesses per cycle
and a cache access time of 1 clock cycle? Software approaches
AMAT = 1 + 0.02x50 = 2 cycles
37 38
Implementing LRU Approximations of LRU
Idea: Evict the least recently accessed block Most modern processors do not implement “true LRU” in
highly-associative caches
Problem: Need to keep track of access ordering of blocks
Why?
Question: 2-way set associative cache:
True LRU is complex
What do you need to implement LRU?
LRU is an approximation predict locality anyway (i.e., not the best
possible replacement policy)
Comparing Cache Memory Architectures ARM Cortex-A8 Data Cache Miss Rates
Intel i7 Data Cache Miss Rates Improving Cache Performance #3
Hardware Prefetching
Fetch blocks into the cache proactively (speculatively)
Key is to anticipate the upcoming miss addresses accurately
Relies on having unused memory bandwidth available
clean
1-bit Read/Write 1-bit Read/Write
Cache Allocate Write Back
1-bit Valid 1-bit Valid
& Read new block Write old block
Core
59 60
Block size is the data that is associated with an address tag Large cache blocks can take a long time to fill into the cache
not necessarily the unit of transfer between hierarchies fill cache line critical word first
- Sub-blocking: A block divided into multiple pieces (each with V bit) restart cache access before complete fill
– Can improve “write” performance
Associativity
Shared-memory Multiprocessors (SMP)
How many blocks can map to the same index (or set)?
Larger associativity P P P
lower miss rate, less variation among programs
$ $ $
diminishing returns, higher hit latency
Smaller associativity
hit rate Memory
lower cost
lower hit latency
- Especially important for L1 caches Intel Coffelake Processor Conceptual diagram
6 cores + graphics co-processor,
Power of 2 associativity? Shared LLC
Complexities of an SMP
• Cache coherence: defines the ordering of writes to a single address location
associativity • Memory consistency: defines the ordering of reads and writes to all memory
location
• Synchronization: allowing only one processor to access data at a time
P P P X= -100
$ $ $
Memory
Memory
4 5
Example (Write-through Cache) Defining Coherence
6 7
• All the writes will be shown as a transaction on the shared bus to Load X
P P P
memory
Cache Cache Cache
X= 505 X= 505
• Two protocols
• Update-based Protocol
• Invalidation-based Protocol
Bus transaction
X= 505
Memory Bus snoop
Load X
P P P
PrRd / --- PrWr / BusWr
PrRd / ---
BusWr / ---
Bus transaction
PrWr / BusWr X= 505
Memory Bus snoop
14 15
Cache Coherence Protocol MSI Writeback Invalidation Protocol
(Invalidation-based Protocol on Writeback cache)
Store X P P P
Store X • Modified
• Dirty
Store X
Cache Cache Cache • Only this cache has a valid copy
X= 444 • Shared
• Memory is consistent
• One or more caches have a valid copy
• Invalid
Bus transaction
• Writeback protocol: A cache line can be written multiple times before
Memory Bus snoop
the memory is updated.
MSI Summary
• MSI ensures single-writer-multiple-reader
• Often called the SWMR (pronounced as “swimmer”) invariant CSE 431
• We can still do better Computer Architecture
• MESI, MOESI, …. SPRING 2024
• I will not go into further detail. Exploiting the Memory Hierarchy: TLBs
31
Review: Caches Review: The Memory Hierarchy
Caches exploit temporal and spatial locality Take advantage of the principle of locality to present the
Convert “data reuse” into “data locality”
user with as much memory as is available in the cheapest
technology at the speed offered by the fastest technology
Managed by hardware
Core
Inclusive– what
Built typically as a hierarchy (e.g., L1-L2-L3) 4-8 bytes (word)
is in L1$ is a
subset of what
Hardware optimizations include flexible placement of Increasing L1$ is in L2$
data, prefetching, careful hierarchy design distance 8-32 bytes (block)
from the
L2$
Software optimizations include computation reordering processor in Noninclusive –
and data layout restructuring access time 1 to 4 blocks what is in L1$
Main Memory is not a subset
of what is in
1,024+ bytes (disk sector
L2$
= page)
Secondary Memory
V = Valid?, Dirty = is the page dirty (so will have to be written back on
replacement)?, Ref = Referenced recently?, Access = Write access allowed?
TLB access time is typically much smaller than cache
access time (because TLBs are much smaller than caches)
TLBs are typically not more than 512 entries even on high end
machines
A TLB in the Memory Hierarchy TLB Event Combinations
¼ t hit ¾t
VA PA miss TLB Page Cache Possible? Under what circumstances?
TLB Main Table
Core Cache
Lookup Memory
Hit Hit Hit Yes – this is what we want!
miss hit Hit Hit Miss Yes – although the page table is not
checked after the TLB hits
Trans-
lation Miss Hit Hit Yes – TLB missed, but PA is in page table
data and data is in cache; update TLB
A TLB miss – is it a page fault or merely a TLB miss? Miss Hit Miss Yes – TLB missed, but PA is in page table,
If the page is loaded into main memory, then the TLB miss can be data not in cache; update TLB
handled (in hardware or software) by loading the translation information Miss Miss Miss Yes – page fault; OS takes control
from the page table into the TLB
Hit Miss Miss/ No – TLB translation is not possible if the
- Takes 10’s of cycles to find and load the translation info into the TLB
Hit page is not present in main memory
If the page is not in main memory, then it’s a true page fault
- Takes 1,000,000’s of cycles to service a page fault Miss Miss Hit No – data is not allowed in the cache if the
- Page faults can be handled in software because the overhead will be page is not in memory
small compared to the disk access time
TLB misses are much more frequent than true page faults
Software-managed TLB
Data structure design is flexible since the OS controls the page
table walk
Miss handler is also instructions
- It may itself miss in the instruction cache
Data cache may be polluted by the page table walk
Cache
hit
But, data
Two programs which are sharing data will have two different virtual
addresses for the same physical address – aliasing – so will have two
copies of the shared data in the cache and two entries in the TBL which
would lead to coherence issues
- Must update all cache entries with the same physical address or the
memory becomes inconsistent
Possible cache organizations:
1) physically-indexed, physically-tagged
2) virtually-indexed, virtually-tagged
3) virtually-indexed, physically-tagged