25/03/2020 1
BVM ENGG. COLLEGE
AY: 2019-20 Sem: even Div: 9(EC)
Cache Memory
Microprocessor and computer
architecture
Dr. Bhargav Goradiya
The Memory Hierarchy
This storage organization can be thought of as a pyramid:
25/03/2020 Cache Memory 2
The Memory Hierarchy
To access a particular piece of data, the CPU
first sends a request to its nearest memory,
usually cache.
If the data is not in cache, then main memory
is queried. If the data is not in main memory,
then the request goes to disk.
Once the data is located, then the data, and a
number of its nearby data elements are
fetched into cache memory.
25/03/2020 Cache Memory 3
Principle of Locality of Reference
Programs tend to reuse data and instructions
they have used recently
Instructions in localized areas are executed
repeatedly
“Make the common case fast” favor
accesses to such data
Keep recently accessed data in the fastest
memory
25/03/2020 Cache Memory 4
Principle of Locality of Reference
Most program execution time spent on routines in
which many instructions are executed repeatedly
Temporal
– Recently executed instruction is likely to be executed again
very soon
– Whenever an item is first needed, it is first brought to the
cache, where it will hopefully remain until it is needed
again. Also influences choice on which item to discard when
cache is full
Spatial
– Instructions in close proximity to a recently executed
instruction are also likely to be executed soon
– Instead of fetching just one item into the cache, fetch
several adjacent data items as well (block/cache line)
25/03/2020 Cache Memory 5
Cache Memory
Cache Memory is intended to give:
– Memory speed approaching that of the fastest
memories available.
– Large memory size at the price of less expensive
types of semiconductor memories.
Small amount of fast memory.
Sits between normal main memory and CPU.
May be located on CPU chip or module.
25/03/2020 Cache Memory 6
Conceptual Operation
Relatively large and slow main memory together
with faster, smaller cache.
Cache contains a copy of portions of main memory.
When processor attempts to read a word from
memory, a check is made to determine if the word
exists in cache.
– If it is, the word is delivered to the processor.
– If not, a block of main memory is read into the cache, then the
word is delivered to the processor.
Word Block
Transfer Transfer
Cache Main
CPU Memory
Memory
25/03/2020 Cache Memory 7
Hit Ratio
A measure of the efficiency of the cache
structure.
– When the CPU refers to memory and the word is
found in the cache, this called a hit.
– When the word is not found in cache, this is called
a miss.
Hit ratio is the total number of hits divided
by the total number of access attempts (hits
+ misses).
– It has been shown practically that hit rations
higher than 0.9 are possible.
25/03/2020 Cache Memory 8
Cache vs. Main Memory Structure
0
1
Tag Block 2 Block
0 (K words)
1
2
3
.
.
C-1
.
Block Length
(K Words)
Cache
Main
2n-1 Memory
Word Length
25/03/2020 Cache Memory 9
Main Memory and Cache Memory
Main Memory consists of 2n words.
– Each word has a unique n-bit address.
We can consider that main memory is made
up of blocks of K words each.
Cache consists of C lines of K words each.
A block of main memory is copied into a line
of Cache.
– The “tag” field of the line identifies which main
memory block each cache line represents
25/03/2020 Cache Memory 10
Memory Hierarchy Design
Block placement:
– Where can a block be placed in the upper
level?
Block identification:
– How is a block found if it is in the upper
level?
Block replacement:
– Which block should be replaced on a miss?
Write strategy:
– What happens on a write?
25/03/2020 Cache Memory 11
Where can a block be placed in a cache?
Mapping function determines how a
block is placed in the cache
25/03/2020 Cache Memory 12
Mapping Functions
Three Types
– Direct Mapping
– Associative Mapping
– Set-Associative Mapping
Examples assume 64K words main
memory and 2K words cache
1 Block consists of 16 words
65536/16 = 4096 Blocks in Memory
2048/16 = 128 Blocks in Cache
25/03/2020 Cache Memory 13
Where can a block be placed in a cache?
How is a block found?
MAIN MEMORY
CACHE
Block 0
TAG Block 0 Block 1
TAG Block 1 MAPPING
FUNCTION Block 127
Block 128
Block 129
TAG Block 126
TAG Block 127 Block 255
Block 256
Word Block 257
16-Bit
Address 12 4
Block 4095
25/03/2020 Cache Memory 14
Direct Mapping
Block j of main memory maps onto block (j modulo
128) of the cache.
Example:
– Block 2103 of main memory maps to block (2103 mod 128)
= block 55
Each main memory block has only one place in
cache
Cache Block 0
– Memory Blocks 0, 128, 256...
Cache Block 1
– Memory Blocks 1, 129, 257...
Contention may occur even when cache is not full
Replacement algorithm is trivial
Simplest
– Very inflexible
25/03/2020 Cache Memory 15
Direct Mapping
16-bit address (64K words)
16 words per block lower 4 bits
Cache block position middle 7 bits
32 blocks are mapped to the same
word
Higher 5 bits tell which of the 32 blocks
are mapped
Higher 5 bits are stored in 5 tag bits
associated with cache location
25/03/2020 Cache Memory 16
How is a block found if it is in the cache?
Direct Mapping
Middle 7 bits select determine which
location in cache is used
Higher-order 5 bits are matched with
tag bits in cache to check if desired
block is the one stored in the cache
25/03/2020 Cache Memory 17
Direct Mapping
MAIN MEMORY
CACHE
Block 0
TAG Block 0 Block 1
TAG Block 1 MAPPING
FUNCTION Block 127
Block 128
Block 129
TAG Block 126
TAG Block 127 Block 255
Block 256
Tag Block Word Block 257
16-Bit
Address 5 7 4
Block 4095
25/03/2020 Cache Memory 18
Direct Mapping Cache Organization
25/03/2020 Cache Memory 19
Example: Direct Mapping
Main Memory: 2M words = 221 Words
Cache: 64K words = 216 Words
Words per Block: 8 words = 23 Words
No. of Blocks in a Cache:
– Cache Size / Block Size = 216 / 23 = 213
No. of bits in Tag field:
Total Bits – Block Bits – Word Bits
21 - 13 - 3 = 5
Tag Block Word
5 13 3
25/03/2020 Cache Memory 20
Associative Mapping
To improve the hit ratio of the cache,
another mapping techniques is often utilized,
“associative mapping”.
A block of main memory may be mapped
into ANY line of the cache.
– A block of memory is no longer restricted to a
single line of cache.
Higher 12 bits are stored in tag bits
25/03/2020 Cache Memory 21
How is a block found if it is in the cache?
Associative Mapping
Tag bits (Higher-order 12 bits) of an
address are compared with tag bits of
each block to check if desired block is
present
Higher cost than direct mapping due to
need to search all 128 tags
Tags must be searched in parallel for
performance reasons
25/03/2020 Cache Memory 22
Associative Mapping
MAIN MEMORY
CACHE
Block 0
TAG Block 0 Block 1
TAG Block 1 MAPPING
FUNCTION Block 127
Block 128
Block 129
TAG Block 126
TAG Block 127 Block 255
Block 256
Word Block 257
16-Bit
Address 12 4
Block 4095
25/03/2020 Cache Memory 23
Associative Mapping
25/03/2020 Cache Memory 24
Set-Associative Mapping
Cache blocks are grouped into sets
A main memory block can reside in any
block of a specific set
Less contention than direct mapping
Less cost than associative mapping
Set = (Block Address) MOD (Number of
Sets in Cache)
k-way set associative cache: k blocks
per set
25/03/2020 Cache Memory 25
How is a block found if it is in the cache?
Set-Associative Mapping
Example: Cache groups two blocks per
set 64 sets (6-bit set field)
64 blocks can be mapped onto one set
Tag bits in each cache block store
upper 6 bits of address to tell which of
the 64 blocks are currently in the cache
25/03/2020 Cache Memory 26
Set-Associative Mapping
MAIN MEMORY
CACHE
Block 0
TAG Block 0 Block 1
Set 0
TAG Block 1 MAPPING
FUNCTION Block 127
Block 128
Block 129
TAG Block 126
Set 63
TAG Block 127 Block 255
Block 256
Tag Set Word Block 257
16-Bit
Address 6 6 4
Block 4095
25/03/2020 Cache Memory 27
25/03/2020 Cache Memory 28
Set-Associative Mapping
25/03/2020 Cache Memory 29
Example: Set-Associative Mapping
Main Memory: 32M words = 225 Words
Cache: 128K words = 217 Words
Words per Block: 16 words = 24 Words
8 Blocks per Set = 23 blocks
No of Sets in Cache:
– 217 / (23 *24) = 210
No of tag bits: 25 – 10 – 4 = 11
Tag Set Word
11 10 4
25/03/2020 Cache Memory 30
Levels of Set Associativity
Direct Mapping: 1 block per set 128
sets
Fully Associative Mapping: 128 blocks
per set 1 set
Set Associative Mapping is in between
Direct and Fully Associative
Different mappings are just different
degrees of set associativity
25/03/2020 Cache Memory 31
Which block should be replaced on a cache
miss?
Replacement Algorithm
– Determines which block in the cache is to
be replaced in the event of a cache miss
and the cache is full
Trivial for direct mapped caches
25/03/2020 Cache Memory 32
Which block should be replaced on a cache
miss?
Replacement algorithms
– Random Replacement
– First-In First-Out (FIFO)
– Optimal Algorithm
– Least Recently Used (LRU)
– Least Frequently Used
– Most Frequently Used
25/03/2020 Cache Memory 33
What happens on a write?
Write policies
– Write-through
– Write-back
25/03/2020 Cache Memory 34
Write-Through
Cache location and main memory
location are updated simultaneously
Simpler but results in unnecessary
write operations if word is updated
many times during its cache residency
Requires only valid bit
25/03/2020 Cache Memory 35
Valid Bit
Indicate if block stored in cache is still valid
Set to 1 when block is initially loaded to cache
Transfers from disk to main memory use DMA,
bypass cache
When main memory block is updated by a
source that bypasses the cache, if block is also
in cache, its valid bit is set to 0
25/03/2020 Cache Memory 36
Write-Back
Update only the cache location and mark it
updated using dirty bit/modified bit
Main memory location is updated later when
block is replaced
Writes at the speed of the cache
Also results in unnecessary writes because
whole block is written back to memory even
if only one word is updated
Requires valid bit and dirty bit
25/03/2020 Cache Memory 37
Dirty Bit
Tells whether block in cache has been
modified/has newer data than main
memory block
Problem: Transfer from main memory to
disk bypassing the cache
Solution: Flush the cache (write back all
dirty blocks) before DMA transfer begins
25/03/2020 Cache Memory 38
What happens on a write miss?
No-write allocate: Data is directly
written to main memory
Write allocate: Block is first loaded
from cache, then cache block is written
to
25/03/2020 Cache Memory 39
Write Buffer
Used as temporary holding location for
data to be written to memory
Processor need not wait for write to
finish
Data in write buffer will be written
when memory is available for writing
Works for both write-through and
write-back caches
25/03/2020 Cache Memory 40
Example of Mapping Techniques
Consider data cache with 8 blocks of data
Each block of data consists of only one word
These are greatly simplified parameters
Consider 4 x 10 array of numbers, arranged
in column order
40 elements = 28h stored from 7A00h to
7A27h
25/03/2020 Cache Memory 41
Example of Mapping Techniques
Address Array
7A00h A(0,0)
7A01h A(1,0)
7A02h A(2,0)
7A03h A(3,0) Direct Mapped 13
7A04h A(0,1)
Set-Associative 15
7A05h A(1,1)
7A06h A(2,1)
Associative 16
7A07h A(3,1)
… … 16-Bit
16
7A24h A(0,9) Address
7A25h A(1,9)
7A26h A(2,9)
7A27h A(3,9)
25/03/2020 Cache Memory 42
Example of Mapping Techniques
Consider the following algorithm
SUM := 0
for j:= 0 to 9 do
SUM := SUM + A(0,j)
end
AVE := SUM / 10
for i:= 9 downto 0 do
A(0,i) := A(0,i) / AVE
end
This gets the average of the first row (0),
and stores the value of the element divided
by the average of all the elements
25/03/2020 Cache Memory 43
Example of Mapping Techniques
Array Address
A(0,0) 7A00h
A(0,1) 7A04h
SUM := 0 A(0,2) 7A08h
for j:= 0 to 9 do A(0,3) 7A0Ch
SUM := SUM + A(0,j) A(0,4) 7A10h
end
A(0,5) 7A14h
AVE := SUM / 10
A(0,6) 7A18h
for i:= 9 downto 0 do
A(0,i) := A(0,i) / AVE A(0,7) 7A1Ch
end A(0,8) 7A20h
A(0,9) 7A24h
25/03/2020 Cache Memory 44
Direct-Mapped Cache
j=1 j=3 j=5 j=7 j=9 i=6 i=4 i=2 i=0
0 A(0,0) A(0,2) A(0,4) A(0,6) A(0,8) A(0,6) A(0,4) A(0,2) A(0,0)
1
2
3
4 A(0,1) A(0,3) A(0,5) A(0,7) A(0,9) A(0,7) A(0,5) A(0,3) A(0,1)
5
6
7
25/03/2020 Cache Memory 45
Direct-Mapped Cache
Only two cache locations are used due
to the way the array is arranged
First loop (0 to 9): all references result
in cache miss
Second loop (9 to 0): reference to 9
and 8 are in the cache, the rest are not
in the cache
25/03/2020 Cache Memory 46
Associative-Mapped Cache
j=7 j=8 j=9 i=1 i=0
0 A(0,0) A(0,8) A(0,8) A(0,8) A(0,0)
1 A(0,1) A(0,1) A(0,9) A(0,1) A(0,1)
2 A(0,2) A(0,2) A(0,2) A(0,2) A(0,2)
3 A(0,3) A(0,3) A(0,3) A(0,3) A(0,3)
4 A(0,4) A(0,4) A(0,4) A(0,4) A(0,4)
5 A(0,5) A(0,5) A(0,5) A(0,5) A(0,5)
6 A(0,6) A(0,6) A(0,6) A(0,6) A(0,6)
7 A(0,7) A(0,7) A(0,7) A(0,7) A(0,7)
25/03/2020 Cache Memory 47
Associative-Mapped Cache
All cache locations are used
First loop (0 to 9): all references result
in cache miss
Second loop (9 to 0): references from
9 to 2 are in the cache, the rest are not
in the cache
Good utilization of cache because of
the order of the loop
25/03/2020 Cache Memory 48
Set-Associative-Mapped Cache
j=3 j=7 j=9 i=4 i=2 i=0
0 A(0,0) A(0,4) A(0,8) A(0,4) A(0,4) A(0,0)
Set 0 1 A(0,1) A(0,5) A(0,9) A(0,5) A(0,5) A(0,1)
2 A(0,2) A(0,6) A(0,6) A(0,6) A(0,2) A(0,2)
3 A(0,3) A(0,7) A(0,7) A(0,7) A(0,3) A(0,3)
4
Set 1 5
6
7
25/03/2020 Cache Memory 49
Associative-Mapped Cache
Half of the cache was used only one
set
First loop (0 to 9): all references result
in cache miss
Second loop (9 to 0): references 9 to 6
are in the cache, the rest are not in the
cache
25/03/2020 Cache Memory 50
Example of Mapping Techniques
In general, full associative mapping
performs best
Second is set-associative mapping
Worst performance: direct mapping
But full associative mapping is
expensive to implement
Compromise: set-associative mapping
25/03/2020 Cache Memory 51
Measuring Cache Performance
Hit Rate: h
– Ratio of number of hits to number of all
attempted accesses
Miss Rate
– Ratio of number of misses to number of all
attempted acccesses
Hit Rate + Miss Rate = 1
Miss Penalty: M
– Extra time to bring desired block into cache
Hit time: C
– time to hit in the cache; time to access data from
the cache
25/03/2020 Cache Memory 52
Measuring Cache Performance
Average memory access time: tavg
– Hit Rate x Hit Time + Miss Rate x Miss
Penalty
tavg = hC + (1–h)M
25/03/2020 Cache Memory 53
Measuring Cache Performance
Example
– 10 clock cycles for memory access
– 17 clock cycles to load a block into the cache
(miss penalty)
– 1 clock cycle to load word from cache (hit time)
– Assume 30% of instructions perform read/write
130 memory access for every 100 instructions
– Hit rates: 0.95 for instructions, 0.9 for data
Time without cache = 130 * 10 = 1300
Time with cache = 100 (0.95*1 + 0.05*17)
+ 30 (0.9*1 + 0.1*17) = 258
Ratio: 5.04
25/03/2020 Cache Memory 54
Cache Examples
Intel 386 has no internal cache
Intel 486 – on-chip 8KB unified cache
Intel Pentium – 8KB/8KB split L1 cache
Intel Pentium II – 8KB/16KB split L1
cache / 256KB L2 Cache
25/03/2020 Cache Memory 55
Pentium III Cache
Two cache levels L1 and L2
L1
– 16-KB data cache
• Four-way set-associative / write-back or write-
through
– 16-KB instruction cache
• Two-way set-associative
L2
– Unified cache
– 256K (Coppermine), 512K (Tualatin)
25/03/2020 Cache Memory 56
Pentium III Cache
Processing Units
LI Instruction LI Data
Cache Cache
Bus interface unit
System Bus
Cache Bus
L2 Cache Main Memory Input/Output
25/03/2020 Cache Memory 57
Pentium IV Cache
Up to 3 levels of caches
L1
– 8K data cache, 4-way set-associative,
write-through
– 12K instruction cache
L2
– Unified cache: 256 KB, 512 KB, write-back
L1 and L2 implemented on chip
25/03/2020 Cache Memory 58
Memory Interleaving
Each memory module has its own ABR
(address buffer register) and DBR
(data buffer register)
Consecutive addresses are located in
successive modules
Lower-order address bits select a
module
Higher-order address bits select a
location within the module
25/03/2020 Cache Memory 59
Memory Interleaving
CONSECUTIVE WORDS IN A MODULE
k m
Module Address in Module Main Memory Address
ABR DBR ABR DBR ABR DBR
Module Module Module
0 1 2
25/03/2020 Cache Memory 60
Memory Interleaving
CONSECUTIVE WORDS IN CONSECUTIVE MODULES
(INTERLEAVED)
m k
Address in Module Module Main Memory Address
ABR DBR ABR DBR ABR DBR
Module Module Module
0 1 2
25/03/2020 Cache Memory 61
Example
Assume a cache with 8-word blocks is
used.
On a read miss the block that contains
the desired word must be copied from
the main memory into Cache.
Hardware takes 1 cycle to put address.
Main memory is, Say, DRAM allow the
first word to be accessed in 8 cycle, but
subsequent words in 4 cycles.
25/03/2020 Cache Memory 62
If consecutive words are in single
module, then the time required to load
the desired block in to cache is
1 + 8 + (7 x 4) + 1 = 38
If consecutive words are in Consecutive
module, then the time required to load
the desired block in to cache is
1 + 8 + 4 + 4 = 17
25/03/2020 Cache Memory 63
Example Based on Cache Memory
A Program consists of two nested
loops. The general structure of a
program is given below. The program
is to be rum on that has instruction
cache organized in the direct mapped
manner that has following parameters:
Main Memory Size: 64K words
Cache Size: 1K Words
Block Size: 128 Words
25/03/2020 Cache Memory 64
START 17 The Cycle time of the
main memory is 10τ s
23 and that of the cache is
1τ s.
165 (a) Specify the no of
20
Times 10 bits in TAG, BLOCK and
239 Times WORD fields in main
memory address.
1200
(b) Compute the total
END 1500 time needed for
instruction fetching
during the execution of
the program.
25/03/2020 Cache Memory 65
Solution
Main memory address length is 16 bits.
BLOCK field is 3 bits (8 blocks).
WORD field is 7 bits (128 words/block).
TAG field is 6 bits.
25/03/2020 Cache Memory 66
25/03/2020 Cache Memory 67
Solution
Hence, the sequence of reads from the main
memory blocks into cache blocks is
As this sequence shows, both the beginning
and the end of the outer loop use blocks 0
and 1 in the cache. They overwrite each
other on each pass through the loop. Blocks
2 to 7 remain resident in the cache until the
outer loop is completed.
25/03/2020 Cache Memory 68
Solution
The total time for reading the blocks from
the main memory into the cache is therefore
(10 + 4 x 9 + 2) x 128 x 10τ = 61,440τ
Executing the program out of the cache:
Outer loop - inner loop =
[(1200 - 22) - (239 - 164)]10 1τ = 11,030τ
Inner loop = (239 - 164)200 x1τ = 15,000τ
End section of program =
1500 - 1200 = 300 x 1τ
Total execution time = 87,770τ
25/03/2020 Cache Memory 69
Example Set-associative Mapping
Problem 5.10
25/03/2020 Cache Memory 70
Solution
TAG field is 10 bits.
SET field is 4 bits.
WORD field is 6 bits.
Words 0, 1, 2,…., 4351 occupy blocks 0 to
67 in the main memory (MM).
After blocks 0, 1, 2,…., 63 have been read
from MM into the cache on the first pass, the
cache is full. Because of the fact that the
replacement algorithm is LRU, MM blocks
that occupy the first four sets of the 16
cache sets are always overwritten before
they can be used on a successive pass.
25/03/2020 Cache Memory 71
Solution
In particular, MM blocks 0, 16, 32, 48, and 64
continually displace each other in competing for the
4 block positions in cache set 0. The same thing
occurs in cache set 1 (MM blocks, 1, 17, 33, 49, 65),
cache set 2 (MM blocks 2, 18, 34, 50, 66) and cache
set 3 (MM blocks 3, 19, 35, 51, 67).
MM blocks that occupy the last 12 sets (sets 4
through 15) are fetched once on the first pass and
remain in the cache for the next 9 passes.
On the first pass, all 68 blocks of the loop must be
fetched from the MM. On each of the 9 successive
passes, blocks in the last 12 sets of the cache (4 x
12 = 48) are found in the cache, and the remaining
20 (68-48) blocks must be fetched from the MM
25/03/2020 Cache Memory 72
Solution
Time with Cache:
= 1 x 68 x 11τ + 9(20x11τ + 48x1τ)
=
Time without Cache:
= 10 x 68 x 10τ =
Improvement Factor = 2.15
25/03/2020 Cache Memory 73