Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
38 views24 pages

Lecture 13 16 Post

The document discusses memory hierarchy and why it is used. It has multiple levels of memory from fast but small cache levels close to the processor to larger but slower main memory and secondary storage further away. It describes the characteristics and technologies used at each level like SRAM, DRAM, and flash and how the hierarchy is managed between levels.

Uploaded by

Siddharth Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views24 pages

Lecture 13 16 Post

The document discusses memory hierarchy and why it is used. It has multiple levels of memory from fast but small cache levels close to the processor to larger but slower main memory and secondary storage further away. It describes the characteristics and technologies used at each level like SRAM, DRAM, and flash and how the hierarchy is managed between levels.

Uploaded by

Siddharth Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

2 3

Why Memory Hierarchy? Memory Locality


 We want both fast and large  A “typical” program has a lot of locality in memory
references
 But we cannot achieve both with a single level of memory
฀ Typical programs are composed of “loops” (repeated sequence of
instructions)
 Idea: Have multiple levels of storage (progressively bigger
and slower as the levels are farther from the processor) and ฀ At any given time, they access a small fraction of data
ensure most of the data the processor needs is kept in the  Temporal: A program tends to reference the same memory
fast(er) level(s) location many times and all within a small window of time
 The number of levels keeps increasing  Spatial: A program tends to reference a cluster of memory
locations at a time
฀ most notable examples:
- 1. instruction memory references
- 2. array/data structure references (e.g., A[i], where i is a loop iterator)

Review: A Typical Memory Hierarchy Characteristics of the Memory Hierarchy


 Take advantage of the principle of locality to present the
user with as much memory as is available in the Core
cheapest technology at the speed offered by the fastest 4-8 bytes (word)
technology Inclusive
caches –
Increasing L1$
what is in L1$
distance 8-32 bytes (block)
On-Chip Components is a subset of
from the L2$
Control MC what is in L2$
core in
1 to 4 words (blocks)
Secondary access
Cache Cache

Last Main Noninclusive


Instr Data
ITLB DTLB

Memory time Main Memory (MM)


Level Memory
Datapath ... (Disk, SSDs) caches –
Cache (DRAM)
RegFile

1,024+ bytes (disk sector,


(LLC) what is in L1$
page, block)
Secondary Memory (SM) is not a
e.g., SSDs, HDDS subset of
Speed (%cycles): ½’s 1’s 10’s 100’s 10,000’s what is in L2$
Size (bytes): 100’s 10K’s M’s G’s T’s
(Relative) size of the memory at each level
Cost/bit: highest lowest
The Memory Hierarchy: Terminology Memory Hierarchy Technologies (1/2)
Hit Time << Miss Penalty  Caches use SRAM for speed and technology
compatibility with the core
 Block (or line or page): the minimum unit of information that ฀ Fast (typical access times of 100 psec to 2 nsec)
is present (or not) in a level of the memory hierarchy ฀ Lower density (6 transistor cells), higher power, expensive ($200
 Hit Rate: the fraction of memory accesses found in a level to $800 per GB)
฀ Static: content will last “forever” (as long as power is left on)
of the memory hierarchy
฀ Hit Time: Time to access that level which consists of  Main memory uses DRAM for size (density)
Time to access the block + Time to determine hit/miss ฀ Slower (typical access times of 10 nsec to 70 nsec)
 Miss Rate: the fraction of memory accesses not found in a ฀ Higher density (1 transistor cells), lower power, cheaper ($5 to
level of the memory hierarchy ⇒ 1 - (Hit Rate) $10 per GB)
฀ Miss Penalty: Time to replace a block in that level with the ฀ Dynamic: needs to be “refreshed” regularly (~ every 64 ms)
corresponding block from a lower level which consists of - consumes ~ 1% of the active cycles of the DRAM
Time to determine that there is a miss + Time to access that block in the ฀ Addresses divided into 2 halves (row and column)
lower level + Time to transmit that block to the level that experienced - RAS or Row Access Strobe triggering the row decoder
the miss + Time to insert the block in that level + Time to pass the block
to the requestor - CAS or Column Access Strobe triggering the column selector

Memory Hierarchy Technologies (2/2) How is the Hierarchy Managed?


 Flash memories is a type of EEPROM – electrically
erasable programmable read-only memory  registers ↔ memory
฀ Must be erased before it can be rewritten ฀ by compiler (programmer?)
฀ Unlike disks and DRAM, but like other EEPROM technologies,
write can wear out flash memory bits  cache ↔ main memory
฀ Worn out bits are not reliable, so one may want to balance writes ฀ by the cache controller hardware
to different bits – wear leveling
 main memory ↔ secondary storage
 Magnetic hard disk consists of a collection of platters, ฀ by the operating system (virtual memory)
which rotate on a spindle at 5,400 to 15,000 revolutions ฀ virtual to physical address mapping assisted by the hardware
per minute (TLB)
฀ To read/write information, a movable arm containing a small ฀ by the file system/user
electromagnetic coil called a read-write head is located just
above each surface
฀ Each surface is divided into tracks (co-centric circles) and each
track is in turn divided into sectors
฀ Latency = seek time + rotational delay + transfer time
Caching Cache Basics
 Caching is perhaps the most important example of the  We now replace the memories in the datapath we
big idea of prediction covered so far with caches
 It relies on the principle of locality to try to find the  Two questions to answer (in hardware):
desired data in the higher levels of the memory ฀ Q1: How do we know if a data item is in the cache?
hierarchy, and provides mechanisms to ensure that when ฀ Q2: If it is, how do we find it?
the prediction is wrong, it finds and uses the proper data
from the lower levels of the memory hierarchy  Direct mapped cache
฀ Each memory block is mapped to exactly one block in the
cache
- Lots of memory blocks must share a block in the cache
฀ Address mapping (to answer Q2):
(block address) modulo (# of blocks in the cache)
฀ Have a tag associated with each cache block that contains
the address information (the upper portion of the address)
required to identify the block (to answer Q1)

Caching: A Simple First Example MIPS Direct Mapped 4KB Cache Example
Main Memory  One-word blocks, cache size = 1K words (or 4KB)
0000xx
0001xx One-word blocks Byte
Cache Two low order bits 31 30 ... 13 12 11 ... 2 1 0
offset
0010xx
Index Valid Tag Data define the byte in the
0011xx Tag 20 Data
word (32b words) Hit 10
00 0100xx Index
01 0101xx Index Valid Tag Data
10 0110xx 0
1
11 0111xx
1000xx Q2: How do we find it? 2
.
.
1001xx
1010xx Use next 2 low order .
1021
Q1: Is it there? 1011xx memory address bits – 1022

1100xx the index – to determine 1023


20 32
Compare the cache 1101xx which cache block (i.e.,
tag to the high order 2 1110xx modulo the number of
memory address bits to 1111xx blocks in the cache)
tell if the memory block
is in the cache
(block address) modulo (# of blocks in the cache) What kind of locality are we taking advantage of?
Multiword Block Direct Mapped 4KB Cache Taking Advantage of Spatial Locality
 Four words/block, cache size = 1K words  Let cache block hold more than one word
Byte Start with an empty cache - all 0 1 2 3 4 3 4 15
31 30 . . . 13 12 11 ... 4 32 10
Hit offset Data blocks initially marked as not valid

Tag 20 8 Block offset 0 miss 1 hit 2 miss


Index
00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0)
Index Valid Tag Data 00 Mem(3) Mem(2)
0
1
2 3 hit 4 miss 3 hit
. 01 5 4
. 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4)
.
253 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2)
254
255
20 4 hit 15 miss
01 Mem(5) Mem(4) 1101 Mem(5) Mem(4)
15 14
00 Mem(3) Mem(2) 00 Mem(3) Mem(2)

32 ฀ 8 requests, 4 misses
What kind of locality are we taking advantage of?

Cache Field Sizes Miss Rate vs Block Size vs Cache Size


 The number of bits in a cache includes both the storage 10
for data and for the tags
฀ 32-bit byte address 8 KB

Miss rate (%)


฀ A direct mapped cache with 2n blocks has a n bits index 16 KB

5 64 KB
฀ For a block (line) size of 2m words (2m+2 bytes), m bits are used
to address the word within the block and 2 bits are used to 256 KB
address the byte within the word
 What is the size of the tag field? 32 – (n + m + 2)
 The total number of bits in a direct-mapped cache is then 0
16 32 64 128 256
2n x (block size + tag field size + valid field size)
Block size (bytes)
 How many total bits are required for a direct mapped
cache with 16KB of data and 4-word blocks assuming a  Miss rate goes up if the block size becomes a significant
32-bit address? 16KB = 4KW (212) 1024 blocks (210) fraction of the cache size because the number of blocks
210 x [ 4x32b data + (32-10-2-2)b tag +1b valid ] = 147Kb
that can be held in the same size cache is smaller
…about 1.15 times as many as needed just for storage data (increasing capacity misses)
Handling Cache Hits Sources of Cache Misses (3 Cs)
 Read hits (I$ and D$)  Compulsory (cold start or process migration, first
฀ this is what we want! reference):
฀ First access to a block, “cold” fact of life, not a whole lot you
can do about it. If you are going to run “millions” of instruction,
 Write hits (D$ only) compulsory misses are insignificant
฀ If we require the cache and memory to be consistent ฀ Solution: increase block size (increases miss penalty; very
- always write the data into both the cache block and the next level in large blocks could increase miss rate)
the memory hierarchy (write-through)
- writes run at the speed of the next level in the memory hierarchy – so  Capacity:
slow! – or can use a write buffer and stall only if the write buffer is full ฀ Cache cannot contain all blocks accessed by the program
฀ If we allow cache and memory to be inconsistent ฀ Solution: increase cache size (may increase access time)
- write the data only into the cache block (write-back the cache block to
the next level in the memory hierarchy when that cache block is  Conflict (collision):
“evicted”) ฀ Multiple memory locations mapped to the same cache location
- need a dirty bit for each data cache block to tell if it needs to be
฀ Solution 1: increase cache size
written back to memory when it is evicted – can use a write buffer to
help “buffer” write-backs of dirty blocks ฀ Solution 2: increase associativity (stay tuned) (may increase
access time)

Miss Rates per Cache Miss Type Handling Cache Misses (Single Word Blocks)
 Read misses (I$ and D$)
฀ stall the pipeline, fetch the block from the next level in the memory
hierarchy, install it in the cache (which may involve having to evict a
dirty block if using a write-back cache), and send the requested
word to the core, then let the pipeline resume.
 Write misses (D$ only)
฀ Since no data is returned to the requester on write operations, a
decision needs to be made on write misses, whether or not data
would be loaded into the cache.
฀ Write allocate – just write the word (and its tag) into the cache
(which may involve having to evict a dirty block if using a write-back
cache), no need to check for cache hit, no need to stall
฀ No-write allocate – skip the cache write (but must invalidate that
cache block since it will now hold stale data) and just write the word
to the write buffer (and eventually to the next memory level), no
need to stall if the write buffer isn’t full
Multiword Block Considerations Measuring Cache Performance
 Assuming cache hit costs are included as part of the
 Read misses (I$ and D$) normal CPU execution cycle, then
฀ Processed the same as for single word blocks – a miss returns CPU time = IC × CPI × CC
the entire block from memory = IC × (CPIideal + Memory-stall cycles) × CC
฀ Miss penalty grows as block size grows
CPIstall
- Early restart – core resumes execution as soon as the requested  Memory-stall cycles come from cache misses (a sum of
word of the block is returned
read-stalls and write-stalls)
- Requested word first – requested word is transferred from the
memory to the cache (and core) first Read-stall cycles = reads/program × read miss rate × read miss
฀ Nonblocking cache – allows the core to continue to access the penalty
cache while the cache is handling an earlier miss Write-stall cycles = (writes/program × write miss rate × write miss
penalty)
 Write misses (D$ only)
+ write buffer stalls
฀ If using write allocate must first fetch the block from memory and
then write the word to the block  For write-through caches, we can simplify this to
Memory-stall cycles = accesses/program × miss rate × miss penalty
 For write-back caches, additional stalls arising from write-
backs of dirty blocks should also be considered

Impacts of Cache Performance Impacts of Cache Performance, Con’t


 Relative cache penalty increases as core performance  Relative cache penalty increases as core performance
improves (faster clock rate and/or lower CPI) improves (lower CPI)
฀ The memory speed is unlikely to improve as fast as core cycle  What if the CPIideal is reduced to 1? 0.5? 0.25?
time. When calculating CPIstall, the cache miss penalty is
measured in core clock cycles needed to handle a miss
CPIstall = 4.44 (up from 3.44/5.44 = 63% to 3.44/4.44 = 77%)
฀ The lower the CPIideal, the more pronounced the impact of stalls

 What if the D$ miss rate went up 1%? 2%?


 A core with a CPIideal of 2, a 100 cycle miss penalty, 36%
load/store instr’s, and 2% I$ and 4% D$ miss rates
CPIstall = 2 + (2% x 100 + 36% x 5% x 100) = 5.80

Memory-stall cycles = 2% × 100 + 36% × 4% × 100 = 3.44


 What if the core clock rate is doubled (doubling the miss
So CPIstalls = 2 + 3.44 = 5.44
penalty)?
CPIstall = 2 + (2% x 200 + 36% x 4% x 200) = 8.88 !!
more than twice the CPIideal !
Average Memory Access Time (AMAT) Improving Basic Cache Performance

 The previous examples and equations assume that the  Reducing miss rate
hit time is not a factor in determining cache performance ฀ More associativity
฀ Alternatives/enhancements to associativity
 Clearly, a larger cache will have a longer access time.
- Victim caches, hashing, pseudo-associativity, skewed associativity
An increase in hit time will likely add another stage to the
฀ Better replacement/insertion policies
pipeline.
฀ Software approaches
฀ At some point the increase in hit time for a larger cache will
overcome the improvement in hit rate leading to a decrease in
 Reducing miss latency/cost
performance.
฀ Multi-level caches
 Average Memory Access Time (AMAT) is the average to
฀ Critical word first
access memory considering both hits and misses
฀ Subblocking/sectoring
AMAT = Time for a hit + Miss rate x Miss penalty ฀ Better replacement/insertion policies
 What is the AMAT for a core with a miss penalty of 50 ฀ Non-blocking caches (multiple cache misses in parallel)
clock cycles, a miss rate of 0.02 misses per instruction ฀ Multiple accesses per cycle
and a cache access time of 1 clock cycle? ฀ Software approaches
AMAT = 1 + 0.02x50 = 2 cycles

Improving Cache Performance #1 Another Reference String Mapping


 Consider the main memory word reference string
Allow More Flexible Block Placement Start with an empty cache - all 0 4 0 4 0 4 0 4
blocks initially marked as not valid
 In a direct mapped cache, a memory block maps to
exactly one cache block 0 miss 4 miss 0 miss 4 miss
01 4 00 0 01
00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0)4
 At the other extreme, could allow a memory block to be
mapped to any cache block – fully associative cache
 A compromise is to divide the cache into sets each of
which consists of n “ways” (n-way set associative). A 00
0 miss 01
4 miss
4 00
0 miss
0 01
4 miss
0 4
memory block maps to a unique set (specified by the 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0)
index field) and can be placed in any way of that set (so
there are n choices)
(block address) modulo (# sets in the cache)
฀ 8 requests, 8 misses
 Ping pong effect due to conflict misses - two memory
locations that map into the same cache block
Set Associative Cache Example Another Reference String Mapping
Main Memory
0000xx  Consider the main memory word reference string
One word blocks
Cache 0001xx Start with an empty cache - all 0 4 0 4 0 4 0 4
Two low order bits blocks initially marked as not valid
0010xx define the byte in the
Way Set V Tag Data
0011xx word (32b words) 0 miss 4 miss 0 hit 4 hit
0 0100xx
0
1 0101xx 000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0)
0 0110xx
1 010 Mem(4) 010 Mem(4) 010 Mem(4)
1 0111xx
1000xx Q2: How do we find it?
1001xx
Q1: Is it there? 1010xx Use next 1 low order ฀ 8 requests, 2 misses
1011xx memory address bit to
Compare all the cache 1100xx determine which
tags in the set to the 1101xx cache set (i.e., modulo  Solves the ping pong effect in a direct mapped cache
high order 3 memory 1110xx the number of sets in
1111xx the cache)
due to conflict misses since now two memory locations
address bits to tell if
that map into the same cache set can co-exist!
the memory block is in
the cache

Four-Way Set Associative 4KB Cache Range of Set Associative Caches


 28 = 256 sets in each of four ways (each with one block)  For a fixed size cache, each increase by a factor of two
31 30 ... 13 12 11 ... 2 1 0 Byte offset
in associativity doubles the number of blocks per set (i.e.,
Tag 22 8
the number or ways) and halves the number of sets –
Index decreases the size of the index by 1 bit and increases
Index V Tag Data V Tag Data V Tag Data V Tag Data the size of the tag by 1 bit
0 0 0 0
1 1 1 1
2 Way 0 2 Way 1 2 Way 2 2 Way 3
. . . . Used for tag compare Selects the set Selects the word in the block
. . . .
. . . .
253 253 253 253 Tag Index Block offset Byte offset
254 254 254 254
255 255 255 255
Increasing associativity
Decreasing associativity
Fully associative
Direct mapped (only one set)
32
(only one way) Tag is all the bits except
Smaller tags, only a block and byte offset
4x1 select
single comparator Data can be located
anywhere in the cache
Hit Data
Costs of Set Associative Caches Benefits of Set Associative Caches
 When a miss occurs, which way’s block do we pick for  The choice of direct mapped or set associative depends
replacement? on the cost of a miss versus the cost of implementation
฀ Least Recently Used (LRU): the block replaced is the one
that has been unused for the longest time
- Must have hardware to keep track of when each way’s block was
used relative to the other blocks in the set
- For 2-way set associative, takes one bit per way → set the bit when a
block is referenced (and reset the other way’s bit)
 N-way set associative cache additional costs
฀ N comparators (delay and area) – one per way
฀ MUX delay (way selection) before data is available
฀ Data available after way selection (and Hit/Miss decision). In a
direct mapped cache, the cache block is available before the
Hit/Miss decision
- So, its not possible to just assume a hit and continue and recover later  Largest gains are in going from direct mapped to 2-way
if it was a miss
(20%+ reduction in miss rate)

37 38
Implementing LRU Approximations of LRU
 Idea: Evict the least recently accessed block  Most modern processors do not implement “true LRU” in
highly-associative caches
 Problem: Need to keep track of access ordering of blocks

 Why?
 Question: 2-way set associative cache:
฀ True LRU is complex
฀ What do you need to implement LRU?
฀ LRU is an approximation predict locality anyway (i.e., not the best
possible replacement policy)

 Question: 4-way set associative cache:


฀ How many different orderings possible for the 4 blocks in the set?  Examples:
฀ How many bits needed to encode the LRU order of a block? ฀ Not MRU (not most recently used)
฀ What is the logic needed to determine the LRU victim? ฀ Hierarchical LRU: divide the 4-way set into 2-way “groups”, track the
MRU group and the MRU way in each group
฀ Victim-NextVictim Replacement: Only keep track of the victim and
the next victim
41

Multi-level Caching in a Pipelined Design Split vs Unified Caches


 First-level caches (instruction and data)  Split L1I$ and L1D$: instr’s and data in different caches
฀ Decisions very much affected by cycle time at L1
฀ Small, lower associativity ฀ To minimize structural hazards and thit
฀ Tag store and data store accessed in parallel - So low capacity/associativity (to reduce thit)
- So small to medium block size (to reduce conflict misses)
 Second-level caches ฀ To optimize L1I$ for wide output (superscalar) and no writes
฀ Decisions need to balance hit rate and access latency
฀ Usually large and highly associative; latency not as important
 Unified L2, L3, …: instr’s and data together in one cache
฀ To minimize %miss (thit is less important due to (hopefully)
฀ Tag store and data store accessed serially
infrequent accesses)
- So high capacity/associativity/block size (to reduce %miss)
 Serial vs. Parallel access of levels ฀ Fewer capacity misses: unused instr capacity can be used for
data
฀ Serial: Second level cache accessed only if first-level misses
฀ More conflict misses: instr / data conflicts (smaller effect in large
฀ Second level does not see the same accesses as the first caches)
- First level acts as a filter ฀ Instr / data structural hazards are rare (would take a
- Tradeoff between performance and power consumption simultaneous L1I$ and L1D$ miss)

Comparing Cache Memory Architectures ARM Cortex-A8 Data Cache Miss Rates
Intel i7 Data Cache Miss Rates Improving Cache Performance #3
Hardware Prefetching
฀ Fetch blocks into the cache proactively (speculatively)
฀ Key is to anticipate the upcoming miss addresses accurately
฀ Relies on having unused memory bandwidth available

 A simple case is to use next block prefetching


฀ Miss on address X → anticipate next reference miss on
X + block-size
฀ Works well for instr’s (sequential execution) and for arrays of
data

 Need to initiate prefetches sufficiently in advance


 If prefetch instr/data that is not going to be used then
have polluted the cache with unnecessary data (possibly
evicting useful data)

FSM Cache Controller Four State Cache Controller


 Key characteristics for a simple L1 cache
฀ Direct mapped Cache Hit Compare Tag
Mark Cache Ready If Valid && Hit
฀ Write-back using write-allocate
Idle Set Valid, Set Tag,
฀ Block size of 4 32-bit words (so 16B); Cache size of 16KB (so
1024 blocks) Valid CPU request If Write set Dirty
฀ 18-bit tags, 10-bit index, 2-bit block offset, 2-bit byte offset, dirty Cache Miss
bit, valid bit, LRU bits (if set associative) Old block is
Cache Miss
Old block is Dirty
Memory Controller

clean
1-bit Read/Write 1-bit Read/Write
Cache Allocate Write Back
1-bit Valid 1-bit Valid
& Read new block Write old block
Core

32-bit address 32-bit address Memory Ready


Cache from memory to memory
32-bit data Controller 128-bit data
32-bit data 128-bit data Memory
Memory
1-bit Ready 1-bit Ready Not Ready
Not Ready
Cache Coherence Issues – More Details to Come Improving Cache Performance #4
 Need a cache controller to also ensure cache coherence Code and Data Layout Transformations
the most popular of which is snooping  Code transformations change data access pattern,
฀ The cache controller monitors (snoops) on the broadcast influencing cache hits and misses
medium (e.g., bus) with duplicate address tag hardware (so it
doesn’t interfere with core’s access to the cache) to determine if Alarge body of compiler transformations designed to
its cache has a copy of a block that is requested maximize cache performance
 Write invalidate protocol – writes require exclusive ฀ Data Reuse => Data Locality => Appl Performance
access and invalidate all other copies
฀ Exclusive access ensure that no other readable or writable  Examples
copies of an item exists
฀ Loop interchange
 If two cores attempt to write the same data at the same
time, one of them wins the race causing the other core’s ฀ Iteration space tiling (aka code blocking)
copy to be invalidated. For the other core to complete, it ฀ Loop fusion
must obtain a new copy of the data which must now ฀ Statement scheduling
contain the updated value – thus enforcing write
serialization ฀ Software prefetching

Loop Fusion Loop Interchange


 Merges two adjacent countable loops into a single loop.  Changes the direction of array traversing by swapping
two loops.
 Reduces the cost of test and branch code.
 The goal is to align data access direction with the
 Fusing loops that refer to the same data enhances memory layout order.
temporal locality.
 One potential drawback is that larger loop body may
reduce instruction locality when the instruction cache is for i = 1, N for j = 1, N
very small. for j = 1, N for i = 1, N
… A(j,i) … … A(j,i) …
for i = 1, N
for i = 1, N
A(i) = B(i) +C(i)
A(i) = B(i) +C(i)
for i = 1, N assuming ROW-MAJOR memory layout
D(i) = E(i) + B(i)
D(i) = E(i) + B(i)
 What type of locality do we exploit?
53 54

Restructuring Data Layout (1/2) Restructuring Data Layout (2/2)


 Pointer based traversal struct Node {  Idea: separate frequently-
struct Node { (e.g., of a linked list) struct Node* node; used fields of a data
struct Node* node; int key; structure and pack them
int key;  Assume a huge linked struct Node-data* node-data; into a separate data
char [256] name; list (1M nodes) and } structure
char [256] school; unique keys
} struct Node-data {
 Why does the code on char [256] name;
while (node) { the left have poor cache char [256] school;  Who should do this?
if (nodekey == input-key) { hit rate? } ฀ Programmer
// access other fields of node ฀ “Other fields” occupy ฀ Compiler
} most of the cache line while (node) {
node = nodenext; - Profiling vs. dynamic
even though rarely if (nodekey == input-key) {
} accessed! // access nodenode-data ฀ Hardware?
} ฀ Who can determine what is
node = nodenext; frequently used?
}

Summary: Improving Cache Performance Summary: Improving Cache Performance


0. Reduce the time to hit in the cache 2. Reduce the miss penalty
฀ smaller cache ฀ smaller blocks
฀ direct mapped cache ฀ use a write buffer to hold dirty blocks being replaced so don’t
฀ smaller blocks have to wait for the write to complete before reading
฀ for writes ฀ check write buffer (and/or victim cache) on read miss – may get
- no write allocate – no “hit” on cache, just write to write buffer lucky
- write allocate – to avoid two cycles (first check for hit, then write) ฀ for large blocks fetch critical word first
pipeline writes via a delayed write buffer to cache ฀ use multiple cache levels – L2, L3, …
฀ faster backing store/improved memory bandwidth
1. Reduce the miss rate - wider buses
฀ bigger cache - memory interleaving, DDR SDRAMs
฀ more flexible placement (increase associativity)
฀ larger blocks (16 to 64 bytes typical)
฀ victim cache – small buffer holding most recently discarded blocks
57 Cache Performance Summary: Cache 58

Parameters vs Miss Rate Cache Size


 Cache size: total data (not including tag) capacity
 Cache size ฀ bigger can exploit temporal locality better
฀ not ALWAYS better
 Block size
 Too large a cache adversely affects hit and miss latency
 Associativity ฀ smaller is faster => bigger is slower
฀ access time may degrade critical path
hit rate
 Replacement policy
 Too small a cache
 Insertion/Placement policy ฀ doesn’t exploit temporal locality well
“working set”
฀ useful data replaced often
size

 Working set: the whole set of data


the executing application references
฀ Within a time-interval cache size

59 60

Block Size Large Blocks: Critical-Word and Subblocking

 Block size is the data that is associated with an address tag  Large cache blocks can take a long time to fill into the cache
฀ not necessarily the unit of transfer between hierarchies ฀ fill cache line critical word first
- Sub-blocking: A block divided into multiple pieces (each with V bit) ฀ restart cache access before complete fill
– Can improve “write” performance

 Too small blocks hit rate


 Large cache blocks can waste bus bandwidth
฀ don’t exploit spatial locality well
฀ divide a block into sub-blocks
฀ have larger tag overhead
฀ associate separate valid bits for each sub-block
 Too large blocks ฀ When is this useful?
฀ too few total # of blocks
- likely-useless data transferred
- Extra bandwidth/energy consumed
block
v d subblock v d subblock v d subblock tag
size
61

Associativity
Shared-memory Multiprocessors (SMP)
 How many blocks can map to the same index (or set)?
 Larger associativity P P P
฀ lower miss rate, less variation among programs
$ $ $
฀ diminishing returns, higher hit latency

 Smaller associativity
hit rate Memory
฀ lower cost
฀ lower hit latency
- Especially important for L1 caches Intel Coffelake Processor Conceptual diagram
6 cores + graphics co-processor,
 Power of 2 associativity? Shared LLC

Complexities of an SMP
• Cache coherence: defines the ordering of writes to a single address location
associativity • Memory consistency: defines the ordering of reads and writes to all memory
location
• Synchronization: allowing only one processor to access data at a time

Cache Coherence Example (Writeback Cache)


• If P writes to a location X and reads X, if there were no other
processors written to X in between, P should read the value it wrote. P P P
• If P1 writes to a location X and P2 reads X, if there were no other Rd? Rd?
writes to X in between and if the two were sufficiently separated in
Cache Cache Cache
time, P2 should read the value P1 wrote.
X= -100 X= -100 X= 505
• Two writes to the same location by any two processors are seen in
the same order by all processors (writes are serializes).
• Seem obvious?

P P P X= -100
$ $ $
Memory

Memory
4 5
Example (Write-through Cache) Defining Coherence

• An MP is coherent if the results of any execution of


P P P a program can be reconstructed by a hypothetical
Rd? serial order
Cache Cache Cache
X= -100 X= 505 X= 505
Implicit definition of coherence
• Write propagation
• Writes are visible to other processes
• Write serialization
X= 505 • All writes to the same location are seen in the same order
Memory by all processes (to “all” locations called write atomicity)
• E.g., w1 followed by w2 seen by a read from P1, will be
seen in the same order by all reads by other processors Pi

6 7

Bus Snooping based on Write-Through Cache Bus Snooping


(Invalidation-based Protocol on Write-Through cache)

• All the writes will be shown as a transaction on the shared bus to Load X
P P P
memory
Cache Cache Cache
X= 505 X= 505
• Two protocols
• Update-based Protocol
• Invalidation-based Protocol

Bus transaction
X= 505
Memory Bus snoop

• Each processor’s cache controller constantly snoops on the bus


• Invalidate local copies upon snoop hit
8 10
A Simple Invalidation-based Coherence Protocol Bus Snooping
for a WT, Write-Allocate Cache (Invalidation-based Protocol on Write-Through cache)

Load X
P P P
PrRd / --- PrWr / BusWr

Cache Cache Cache


Valid X= 505 X= 505

PrRd / ---
BusWr / ---
Bus transaction
PrWr / BusWr X= 505
Memory Bus snoop

Invalid Observed / Transaction


Processor-initiated Transaction • Each processor’s cache controller constantly snoops on the bus
Bus-snooper-initiated Transaction • Invalidate local copies upon snoop hit
12 13

How about Writeback Cache? Cache Coherence Protocols for WB caches


• A cache has an exclusive copy of a line if
• Write-through wastes a lot of bus bandwidth (every update creates a • It is the only cache having a valid copy
traffic) • Memory may or may not have it
• WB cache to reduce bandwidth requirement • Modified (dirty) cache line
• The majority of local writes are hidden behind the processor nodes • The cache having the line is the owner of the line, because it must supply the
block
• How to snoop?
• Write ordering

14 15
Cache Coherence Protocol MSI Writeback Invalidation Protocol
(Invalidation-based Protocol on Writeback cache)
Store X P P P
Store X • Modified
• Dirty
Store X
Cache Cache Cache • Only this cache has a valid copy
X= 444 • Shared
• Memory is consistent
• One or more caches have a valid copy
• Invalid
Bus transaction
• Writeback protocol: A cache line can be written multiple times before
Memory Bus snoop
the memory is updated.

• Invalidate the data copies for the sharing processor nodes


• Reduced traffic when a processor node keeps updating the same
memory location
20 21

MSI Writeback Invalidation Protocol


MSI Writeback Invalidation Protocol (Processor Request)
PrWr / BusRdX
• Two types of request from the processor
• PrRd PrWr / --- PrRd / ---
• PrWr
Modified Shared
• Three types of bus transactions post by cache
controller
• BusRd
PrRd / ---
• PrRd misses the cache
• Memory or another cache supplies the line
• BusRd eXclusive (Read-to-own)
• PrWr is issued to a line which is not in the Modified state PrWr / BusRdX
• BusWB
PrRd / BusRd
• Writeback due to replacement
• Processor does not directly involve in initiating this operation Invalid
Processor-initiated
22 23
MSI Writeback Invalidation Protocol MSI Example
PrWr / BusRdX P1 P2 P3
PrWr / --- BusRd / Flush PrRd / ---
Cache Cache Cache
BusRd / ---
X=-25 S X=-25 S X=-25 M
S
Modified Shared
Bus
BusRd

PrRd / --- MEMORY


BusRdX / Flush BusRdX / --- X=-25
X=10

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data Supplier


PrWr / BusRdX P1 reads X S --- --- BusRd Memory
P3 reads X S --- S BusRd Memory
PrRd / BusRd --- M BusRdX P3 Cache
P3 writes X I
Invalid --- S BusRd P3 Cache
P1 reads X S
Processor-initiated
P2 reads X S S S BusRd Memory
Bus-snooper-initiated 25 30

MSI Summary
• MSI ensures single-writer-multiple-reader
• Often called the SWMR (pronounced as “swimmer”) invariant CSE 431
• We can still do better Computer Architecture
• MESI, MOESI, …. SPRING 2024
• I will not go into further detail. Exploiting the Memory Hierarchy: TLBs

Cyan Subhra Mishra

[Adapted from Computer Organization and Design, 5th Edition,


Patterson & Hennessy, © 2014, Morgan Kaufmann]
[Slides Adapted from Mahmut Kandemir, Jack Sampson, Kiwan Maeng]

31
Review: Caches Review: The Memory Hierarchy
 Caches exploit temporal and spatial locality  Take advantage of the principle of locality to present the
฀ Convert “data reuse” into “data locality”
user with as much memory as is available in the cheapest
technology at the speed offered by the fastest technology
 Managed by hardware
Core
Inclusive– what
 Built typically as a hierarchy (e.g., L1-L2-L3) 4-8 bytes (word)
is in L1$ is a
subset of what
 Hardware optimizations include flexible placement of Increasing L1$ is in L2$
data, prefetching, careful hierarchy design distance 8-32 bytes (block)
from the
L2$
 Software optimizations include computation reordering processor in Noninclusive –
and data layout restructuring access time 1 to 4 blocks what is in L1$
Main Memory is not a subset
of what is in
1,024+ bytes (disk sector
L2$
= page)
Secondary Memory

(Relative) size of the memory at each level

How is the Hierarchy Managed? Virtual Memory Concepts


 Use main memory as a “cache” for secondary memory
 registers ↔ memory ฀ Allows efficient and safe sharing of main memory among multiple
processes/threads (running programs)
฀ by compiler (programmer?)
- Each program is compiled into its own private virtual address space
 registers ↔ cache ↔ main memory ฀ Provides the ability to run programs and data sets larger than the
฀ by the cache controller hardware size of physical memory
฀ Simplifies loading a program for execution by providing for code
 main memory ↔ secondary memory (flash, disk) relocation (i.e., the code/data can be loaded in main memory
฀ by the operating system (virtual memory) anywhere the OS can find space for it)
- virtual address to physical address mapping  The core and OS work together to translate virtual
- assisted by the hardware (TLB, page tables) addresses to physical addresses
฀ by the programmer with OS support (files) ฀ A virtual memory miss (i.e., when the page is not in physical
memory) is called a page fault
 What makes it work efficiently? – the Principle of Locality
฀ Programs tend to access only a small portion of their address
space over long portions of their execution time
Two Programs Sharing Physical Memory Address Translation
 A program’s address space is divided into pages (all one  A virtual address is translated to a physical address by a
fixed size) or segments (variable sizes) combination of hardware and software
฀ The starting location of each page (either in main memory or in
secondary memory) is contained in the program’s page table
Virtual Address (VA)
31 30 . . . 12 11 . . . 0
Program 1’s Virtual page number Page offset
virtual address space Offset
Physical address space determines the
main memory is 1GB, and virtual Translation
page size (e.g.,
address space is 4GB 212 bytes)
Physical page number Page offset
29 . . . 12 11 0
Program 2’s Physical Address (PA)
virtual address space

 So, each memory request first requires an address


translation from the virtual space to the physical space

Virtual Address Translation Mechanisms Design Choices in Virtual Memory Systems


Virtual page # Offset The page table together with
Virtual Address the program counter and the  Pages should be large enough to amortize high access
registers specifies the state of
a program times
Physical page #
Physical Address  Organizations that reduce page fault rate (e.g., full
Offset
Page Table Register

associative placement of pages in main memory) are


Physical page
V base addr attractive
1
1  Page faults can be handled in software
1
1
1
1
0
1 Main memory
0
If valid bit for a 1
0
virtual page is off (0),
a page fault occurs Page Table
(stored in main memory)
Disk storage
Virtual Addressing with a Cache Making Address Translation Fast
Virtual page # Physical page
 Thus, it takes an extra memory access to translate a VA V Tag base addr
to a PA 1
1
1
VA PA miss
0
Trans- Main 1
Core Cache

Page Table Register


lation Memory TLB
Physical page
hit V base addr
data 1
1
1
 This makes memory (cache) accesses very expensive (if 1
every access is really two accesses) 1
1
 The hardware fix is to use a Translation Lookaside Buffer 0
1 Main memory
(TLB) – a fast, small cache that keeps track of recently 0
1
used address mappings to avoid having to do a page 0
table lookup in memory (i.e., cache or main memory) Page Table
 Typical TLBs – 16 to 512 PTEs, 0.5 to 2 cycle for a hit, (in physical memory)
10-100 cycles for a miss, 0.01% to 1% miss rate Disk storage

Translation Lookaside Buffers (TLBs) TLB with Cache Example


 Just like any other cache, the TLB can be organized as
fully associative, set associative, or direct mapped
฀ simplescalar defaults are itbl:16:4096:4:l (16 sets per
way, 4-way set associative so 64 entries, 4096B pages) and
dtlb:32:4096:4:l and tlb:lat 30 (cycles to service a TLB
miss)

V Virtual Page # Physical Page # Dirty Ref Access

V = Valid?, Dirty = is the page dirty (so will have to be written back on
replacement)?, Ref = Referenced recently?, Access = Write access allowed?
 TLB access time is typically much smaller than cache
access time (because TLBs are much smaller than caches)
฀ TLBs are typically not more than 512 entries even on high end
machines
A TLB in the Memory Hierarchy TLB Event Combinations
¼ t hit ¾t
VA PA miss TLB Page Cache Possible? Under what circumstances?
TLB Main Table
Core Cache
Lookup Memory
Hit Hit Hit Yes – this is what we want!
miss hit Hit Hit Miss Yes – although the page table is not
checked after the TLB hits
Trans-
lation Miss Hit Hit Yes – TLB missed, but PA is in page table
data and data is in cache; update TLB
 A TLB miss – is it a page fault or merely a TLB miss? Miss Hit Miss Yes – TLB missed, but PA is in page table,
฀ If the page is loaded into main memory, then the TLB miss can be data not in cache; update TLB
handled (in hardware or software) by loading the translation information Miss Miss Miss Yes – page fault; OS takes control
from the page table into the TLB
Hit Miss Miss/ No – TLB translation is not possible if the
- Takes 10’s of cycles to find and load the translation info into the TLB
Hit page is not present in main memory
฀ If the page is not in main memory, then it’s a true page fault
- Takes 1,000,000’s of cycles to service a page fault Miss Miss Hit No – data is not allowed in the cache if the
- Page faults can be handled in software because the overhead will be page is not in memory
small compared to the disk access time
 TLB misses are much more frequent than true page faults

Handling a TLB Miss Handling a page fault


 Once the operating system knows the virtual address that caused the
 A TLB miss can indicate one of two possibilities: page fault, it must complete three steps:
฀ A page is present in memory, and we need only create the ฀ Look up the page table entry using the virtual address and find the location
missing TLB entry of the referenced page on disk
฀ A page is not present in memory, and we need to transfer control ฀ Choose a physical page to replace; if the chosen page is dirty, it must be
to the operating system to deal with a page fault written out to disk before we can bring a new virtual page into this physical
page
 MIPS traditionally handles a TLB miss in software ฀ Start a read to bring the referenced page from disk into the chosen physical
 Handling a TLB miss or a page fault requires using the page
exception mechanism to interrupt the active process,  The last step will take millions of clock cycles (so will the second if the
transferring control to the operating system, and later replaced page is dirty)
resuming execution of the interrupted process  Accordingly, the operating system will usually select/schedule another
process to execute in the processor until the disk access completes –
 A TLB miss or page fault exception must be asserted by context switching
the end of the same clock cycle that the memory access  When the read of the page from the disk completes, the operating
occurs, so that the next clock cycle will begin exception system can restore the state of the process that originally caused the
processing rather than continue normal instruction page fault and execute the instruction that returns from the exception
execution  The user process (application) then re-executes the instruction that
faulted
TLBs TLB Management
 Hardware-managed TLB
฀ No need for expensive interrupts
฀ Pipeline remains largely unaffected
฀ OS cannot employ alternate design

 Software-managed TLB
฀ Data structure design is flexible since the OS controls the page
table walk
฀ Miss handler is also instructions
- It may itself miss in the instruction cache
฀ Data cache may be polluted by the page table walk

Why Not a Virtually Addressed Cache?


 A virtually addressed cache would only require address
translation on cache misses
VA PA
Trans- Main
Core
lation Memory

Cache
hit
But, data
฀ Two programs which are sharing data will have two different virtual
addresses for the same physical address – aliasing – so will have two
copies of the shared data in the cache and two entries in the TBL which
would lead to coherence issues
- Must update all cache entries with the same physical address or the
memory becomes inconsistent
Possible cache organizations:
1) physically-indexed, physically-tagged
2) virtually-indexed, virtually-tagged
3) virtually-indexed, physically-tagged

You might also like