Chapter 4
Memory Hierarchy Design
Computer Components
There are three basic hardware modules
(Bell, Newell: Computer Structures, 1971):
Processors
Memory
Communication
Memory / Storage Evaluation
Costs
Capacity
Speed
Reliability
Volatility
Memory/storage hierarchies
Balancing performance with cost
Small memories are fast but expensive
Large memories are slow but cheap
Exploit locality to get the best of
Capacity
both worlds
Performance
locality = re-use/nearness of accesses
allows most accesses to use small, fast
memory
An Example Memory Hierarchy
Smaller, L0:
faster, registers CPU registers hold words retrieved
and from L1 cache.
costlier L1: on-chip L1
(per byte) cache (SRAM) L1 cache holds cache lines retrieved
storage from the L2 cache memory.
devices L2: off-chip L2
cache (SRAM) L2 cache holds cache lines
retrieved from main memory.
L3: main memory
Larger, (DRAM)
Main memory holds disk
slower, blocks retrieved from local
and disks.
cheaper local secondary storage
L4:
(per byte) (local disks)
storage Local disks hold files
devices retrieved from disks on
remote network servers.
L5: remote secondary storage
(tapes, distributed file systems, Web servers)
From lecture-9.ppt
Main Memory
Most of the main memory in a general
purpose computer is made up of RAM
integrated circuits chips, but a portion of the
memory may be constructed with ROM chips
RAM– Random Access memory
Integated RAM are available in two possible
operating modes, Static and Dynamic
ROM– Read Only memory
Random-Access Memory
(RAM)
Static RAM (SRAM)
Each cell stores bit with a six-transistor circuit.
Retains value indefinitely, as long as it is kept powered.
Relatively insensitive to disturbances such as electrical noise.
Faster (8-16 times faster) and more expensive (8-16 times more
expensice as well) than DRAM.
Dynamic RAM (DRAM)
Each cell stores bit with a capacitor and transistor.
Value must be refreshed every 10-100 ms.
Sensitive to disturbances.
Slower and cheaper than SRAM.
SRAM vs DRAM Summary
Tran. Access
per bit time Persist? Sensitive? Cost Applications
SRAM 6 1X Yes No 100x cache memories
DRAM 1 10X No Yes 1X Main memories,
frame buffers
Virtually all desktop or server computers since
1975 used DRAMs for main memory and
SRAMs for cache
ROM
ROM is used for storing programs that are
PERMENTLY resident in the computer and
for tables of constants that do not change in
value once the production of the computer is
completed
The ROM portion of main memory is needed
for storing an initial program called bootstrap
loader, witch is to start the computer
software operating when power is turned off
Introduction
Introduction
Programmers want unlimited amounts of memory with
low latency
Fast memory technology is more expensive per bit than
slower memory
Solution: organize memory system into a hierarchy
Entire addressable memory space available in largest, slowest
memory
Incrementally smaller and faster memories, each containing a
subset of the memory below it, proceed in steps up toward the
processor
Temporal and spatial locality insures that nearly all
references can be found in smaller memories
Gives the illusion of a large, fast memory being presented to the
processor
Since 1980, CPU has outpaced DRAM ...
Q. How do architects address this gap?
A. Put smaller, faster “cache” memories
Performance between CPU and DRAM.
(1/latency) Create a “memory hierarchy”. CPU
CPU 60% per yr
2X in 1.5 yrs
Gap grew 50% per
year
DRAM
DRAM
9% per yr
2X in 10 yrs
Year
Introduction
Memory Hierarchy
Exploiting the Memory
Hierarchy
Not all stored data is equally important.
Put important data in the upper ranges
of the memory / storage hierarchy.
Put unimportant data in the lower
ranges.
The Principle of Locality
The Principle of Locality:
Program access a relatively small portion of the address space at
any instant of time. (This is kind of like in real life, we all have a
lot of friends. But at any given time most of us can only keep
in touch with a small group of them.)
Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is referenced, it
will tend to be referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced,
items whose addresses are close by tend to be referenced soon
(e.g., straightline code, array access)
Last 15 years, HW relied on locality for speed
It is a property of programs which is exploited in machine design. 15
Exploiting the Memory
Hierarchy
Locality
Spatial Locality:
Data is more likely to be accessed if
neighboring data is accessed.
Temporal Locality:
Data is more likely to be accessed if it has been
recently accessed.
Exploiting the Memory
Hierarchy
Executables
Program executions tend to spend a great portion of time
in loops.
Spatial locality: if a statement in the loop is executed,
then so are the statements surrounding it.
Temporal locality: if a statement is executed, it is likely to
be executed again.
Exploiting the Memory
Hierarchy
Relational Databases
Store data in relations
Relation consists of fields
Often with Record ID.
Stored in a B+ tree or in a (linear) hash table.
Spatial Locality
Accessing all records in order, records are stored in B+
tree.
Makes sense to move records in bunches from disk / tape
to main memory.
Typical transaction has no spatial locality.
Accesses a record here and there all over the place.
No spatial locality.
Exploiting the Memory
Hierarchy
Relational Databases
Temporal Locality
Some records are hot, most are cold.
Records of current students vs. records of graduates.
Active accounts in a bank database.
Current patients versus other patients.
Some transactions look at the same record
several times (due to inefficiencies).
Exploiting the Memory
Hierarchy
File System
Temporal Locality:
Few files are frequently accessed (OS kernel,
killer apps, data in current projects).
Most are written and never read again.
Spatial Locality:
Not only individual files, but also directories can
become hot.
Exploiting the Memory
Hierarchy
Caching strategy:
Keep popular items in expensive, small,
and fast memory.
Keep less popular items in cheap, big, and
slow memory.
Use spatial & temporal locality to guess
what items are popular.
Cache Analysis
Assume two levels of memory:
Cache: fast, small, expensive.
Main: slow, large, cheap.
Introduction
Performance and Power
High-end microprocessors have >10 MB on-
chip cache
Consumes large amount of area and power
budget
Introduction
Memory Hierarchy Basics
When a word is not found in the cache, a miss
occurs:
Fetch word from lower level in hierarchy, requiring a
higher latency reference
Lower level may be another cache or the main
memory
Also fetch the other words contained within the block
Takes advantage of spatial locality
Place block into cache in any location within its set,
determined by address
block address MOD number of sets
Cache Hits and Misses
Hit: data appears in some block in the upper level
(example: Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieved from a block in the lower
level (Block Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
Lower Level
To Processor Upper Level Memory
Memory
Blk X
From Processor Blk Y
Memory Hierarchy Terms
The goal of the memory hierarchy is to keep the
contents that are needed now at or near the top of
the hierarchy
We discuss the performance of the memory hierarchy
using the following terms:
Hit – when the datum being accessed is found at the current
level
Miss – when the datum being accessed is not found and the next
level of the hierarchy must be examined
Hit rate – how many hits out of all memory accesses
Miss rate – how many misses out of all memory accesses
NOTE: hit rate = 1 – miss rate, miss rate = 1 – hit rate
Hit time – time to access this level of the hierarchy
Miss penalty – time to access the next level
Hit Rate and Miss Penalty
Hit rate: fraction found in that level
So high that usually talk about Miss rate
Miss rate fallacy: as MIPS to CPU performance,
miss rate to average memory access time in memory
Average memory-access time
= Hit time + Miss rate x Miss penalty
(ns or clocks)
Miss penalty: time to replace a block from lower level,
including time to replace in CPU
access time: time to lower level
= f(latency to lower level)
transfer time: time to transfer block
=f(BW between upper & lower levels, block size)
Single Cache
The average read access time= Hit Ratio*Time
taken in case of hit +(1-Hit Ratio)*Time taken
in case of miss
Average access time = H1*T1 +(1-H1)*T2
Two level Cache
Average access time = [H1*T1]+[(1-
H1)*H2*T2]+[(1-H1)(1-H2)*Hm*Tm]
Example 1
Assume that for a certain processor, a read request takes
50 nanoseconds on a cache miss and 5 nanoseconds on
a cache hit. Suppose while running a program, it was
observed that 80% of the processor’s read requests
result in a cache hit. The average read access time in
nanoseconds is____________.
(A) 10
(B) 12
(C) 13
(D) 14
Solution 1
Hit Ratio=0.8
Time taken in case of hit=5ns
Time taken in case of miss=50ns
The average read access time= Hit Ratio*Time taken in
case of hit +(1-Hit Ratio)*Time taken in case of miss
The average read access time in nanoseconds
= 0.8 * 5 + (1-0.8)*50
= 0.8 * 5 + 0.2*50
= 14 ns
Example 2
Consider a system with 2 level caches. Access times of
Level 1 cache, Level 2 cache and main memory are 1
ns, 10ns, and 500 ns, respectively. The hit rates of
Level 1 and Level 2 caches are 0.8 and 0.9,
respectively. What is the average access time of the
system ignoring the search time within the cache?
(A) 13.0 ns
(B) 12.8 ns
(C) 12.6 ns
(D) 12.4 ns
Solution 2
where,
H1 = Hit rate of level 1 cache = 0.8
T1 = Access time for level 1 cache = 1 ns
H2 = Hit rate of level 2 cache = 0.9
T2 = Access time for level 2 cache = 10 ns
Hm = Hit rate of Main Memory = 1
Tm = Access time for Main Memory = 500 ns
So, Average Access Time = ( 0.8 * 1 ) + ( 0.2 * 0.9 * 10 ) + ( 0.2
* 0.1 * 1 * 500)
= 0.8 + 1.8 + 10
= 12.6 ns
Example 3
A computer system has an L1 cache, an L2 cache, and a main memory unit
connected as shown below. The block size in L1 cache is 4 words. The
block size in L2 cache is 16 words. The memory access times are 2
nanoseconds. 20 nanoseconds and 200 nanoseconds for L1 cache, L2
cache and main memory unit respectively.
When there is a miss in L1 cache and a hit in L2 cache, a block is
transferred from L2 cache to L1 cache. What is the time taken for this
transfer?
Example 4
The memory access time is 1 nanosecond for a read operation
with a hit in cache, 5 nanoseconds for a read operation with a
miss in cache, 2 nanoseconds for a write operation with a hit in
cache and 10 nanoseconds for a write operation with a miss in
cache. Execution of a sequence of instructions involves 100
instruction fetch operations, 60 memory operand read
operations and 40 memory operand write operations. The cache
hit-ratio is 0.9. The average memory access time (in
nanoseconds) in executing the sequence of instructions is
__________.
(A) 1.26
(B) 1.68
(C) 2.46
(D) 4.52
The question is to find the time taken for, "100 fetch operation and 60
operand red operations and 40 memory operand write
operations"/"total number of instructions".
Total number of instructions= 100+60+40 =200
Time taken for 100 fetch operations(fetch =read)
= 100*((0.9*1)+(0.1*5)) // 1 corresponds to time taken for read
// when there is cache hit
= 140 ns //0.9 is cache hit rate
Time taken for 60 read operations = 60*((0.9*1)+(0.1*5)) = 84ns
Time taken for 40 write operations = 40*((0.9*2)+(0.1*10)) = 112 ns
// Here 2 and 10 the time taken for write when there is cache
// hit and no cahce hit respectively
So, the total time taken for 200 operations is = 140+84+112 = 336ns
Average time taken = time taken per operation = 336/200 = 1.68 ns
Data access using catche
Writing to Cache
Writing to cache: two strategies
Write-through
Immediately update lower levels of hierarchy
Write-back
Only update lower levels of hierarchy when an updated block
is replaced
Both strategies use write buffer to make writes
asynchronous
Q4: What happens on a write?
Write-Through Write-Back
Write data only to the
Data written to cache cache
block
Policy
also written to lower- Update lower level
level memory when a block falls out
of the cache
Debug Easy Hard
Do read misses
produce writes?
No Yes
Do repeated writes
make it to lower Yes No
level?
Additional option (on miss)-- let writes to an un-cached
address; allocate a new cache line (“write-allocate”).
Write Buffers for Write-Through Caches
Cache Lower
Processor Level
Memory
Write Buffer
Holds data awaiting write-through to
lower level memory
Q. Why a write buffer ? A. So CPU doesn’t stall
Q. Why a buffer, why not just A. Bursts of writes are
one register ? common.
Q. Are Read After Write (RAW) A. Yes! Drain buffer before next
hazards an issue for write read, or send read 1st after check
buffer? write buffers.
Performance - Cache Memory System
Te: Effective memory access time in
cache memory system
Tc: Cache access time
Tm: Main memory access time
Te = Tc + (1 - h) Tm
Example: Tc = 0.4ns, Tm = 1.2ns, h
= 85%
Te = 0.4 + (1 - 0.85) × 1.2 = 0.58ns
Introduction
Types and Causes of Misses
Causes of misses
Compulsory
First reference to a block
Capacity
Blocks discarded and later retrieved
Conflict
Program makes repeated references to multiple
addresses from different blocks that map to the same
location in the cache
Introduction
Memory Access Time
Note that speculative and multithreaded
processors may execute other instructions
during a miss
Reduces performance impact of misses
Improve Cache Performance
improve cache and memory access times:
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Reducing each of these!
Simultaneously?
CPUtime IC * (CPI Execution MemoryAccess
Instruction * MissRate * MissPenalt y * ClockCycleTime)
• Improve performance by:
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.