Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views72 pages

Unit 4

The document discusses memory technology, including types of memory (SRAM, DRAM, Flash, Magnetic Disk) and their access times and costs. It explains the concept of memory hierarchy, the principle of locality, and cache memory organization, including direct-mapped and associative caches. Additionally, it covers cache writing strategies, replacement policies, sources of cache misses, and the importance of multilevel caches in reducing miss rates.

Uploaded by

akisher987
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views72 pages

Unit 4

The document discusses memory technology, including types of memory (SRAM, DRAM, Flash, Magnetic Disk) and their access times and costs. It explains the concept of memory hierarchy, the principle of locality, and cache memory organization, including direct-mapped and associative caches. Additionally, it covers cache writing strategies, replacement policies, sources of cache misses, and the importance of multilevel caches in reducing miss rates.

Uploaded by

akisher987
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Unit-4

S Raghavendra Kumar
SSNCE
Memory Technology
Memory Technology Typical Access Time $ per GiB in 2012

SRAM 0.5–2.5 ns $500–$1000


DRAM 50–70 ns $10–$20
Flash 5,000–50,000 ns $0.75–$1.00
Magnetic Disk 5,000,000–20,000,000 ns $0.05–$0.10

Ideal memory
• Access time of SRAM
• Capacity and cost/GB of disk
Strategy: arrange memory in a hierarchy
• Smaller and faster memory for data currently being accessed
• Larger and slower memory for data not currently being accessed

2 v 1.2
Memory Hierarchy

Memory hierarchy
▪ Store everything on flash/disk
▪ Copy recently accessed (and nearby) items from disk
to smaller DRAM memory
▪ Main memory
▪ Copy more recently accessed (and nearby) items from
DRAM to smaller SRAM memory
▪ Cache memory attached to CPU
▪ Recently is a good predictor of Currently because of
the principle of locality.

3 v 1.2
Principle of Locality
Programs access a small proportion of their
address space at any time
• Temporal locality (locality in time) Items
accessed recently are likely to be accessed again
soon e.g., instructions in a loop, induction variables \
– Keep most recently accessed items into the cache
• Spatial locality (locality in space) Items near
those accessed recently are likely to be accessed
soon E.g., sequential instruction access, array data
– Move blocks consisting of contiguous words closer to
the processor
4 v 1.2
Memory Hierarchy Levels
• Block (aka cache line): unit of copying May be
multiple words
• If accessed data is present in upper level
– Hit: access satisfied by upper level
• Hit ratio: hits/accesses
– Hit time: Time to access the block + Time
to determine hit/miss
• If accessed data is absent
– Miss: data not in upper level
• Miss ratio: misses/accesses = 1 – hit ratio
– Miss penalty: Time to access the block in
the lower level + Time to transmit that
block to the level that experienced the miss
+ Time to insert the block in that level +
Time to pass the block to the requestor

5 v 1.2
Average Memory Access Time (AMAT)

Tavg= H* Hit_time+ (1-H)* Miss_time

Suppose CPU refers memory 100 times, the misses are


20% during the access of memory. The hit, memory access
time is 10nsec and miss, memory acess time is 100nsec.
Find AMAT.

6 v 1.2
Cache Memory

• Cache memory The level of the memory hierarchy closest to the


CPU
• Given accesses X1, …, Xn–1, Xn

• How do we know if the


data is present?
• Where do we look?

7 v 1.2
Direct Mapped Cache
• Location determined by address
• Direct mapped: only one choice
– (Block address) modulo (#Blocks in cache)

• #Blocks is a power of 2
• Use low-order address
bits

8 v 1.2
Tags and Valid Bits

• How do we know which particular block is


stored in a cache location?
– Store block address as well as the data
– Actually, only need the high-order bits
– Called the tag
• What if there is no data in a location?
– Valid bit: 1 = valid, 0 = not valid
– Initially 0

9 v 1.2
Cache Example

◼ 8-blocks, 1 word/block, direct mapped


◼ Initial state
Index V Tag Data
0002(010) N
0012(110) N
0102(210) N
0112(310) N
1002(410) N
1012(510) N
1102(610) N
1112(710) N

◼ Access sequence (address in word):


10 22, 26, 22, 26, 16, 3, 16, 18, 16
v 1.2
Cache Example

Word addr Binary addr Hit/miss Cache block


22 10 110 Miss 110

Index V Tag Data


000 N
001 N
010 N
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
11 v 1.2
Cache Example

Word addr Binary addr Hit/miss Cache block


26 11 010 Miss 010

Index V Tag Data


000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
12 v 1.2
Cache Example

Word addr Binary addr Hit/miss Cache block


22 10 110 Hit 110
26 11 010 Hit 010

Index V Tag Data


000 N
001 N
010 Y 11 Mem[11010]
011 N
100 N
101 N
110 Y 10 Mem[10110]
111 N
13 v 1.2

Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 13


Cache Example

Word addr Binary addr Hit/miss Cache block


16 10 000 Miss 000
3 00 011 Miss 011

Index V Tag Data


000 Y 10 Mem[10000]
001 N
010 Y 11 Mem[11010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
14 v 1.2
Cache Example

Word addr Binary addr Hit/miss Cache block


16 10 000 Hit 000
18 10 010 Miss 010

Index V Tag Data


000 Y 10 Mem[10000]
001 N
010 Y 10 Mem[10010]
011 Y 00 Mem[00011]
100 N
101 N
110 Y 10 Mem[10110]
111 N
15 v 1.2
Address Subdivision

16
Larger Block Size

◼ 64 blocks, 16 bytes/block
◼ To what block number does address 1200 map?
◼ Block address = 1200/16 = 75
◼ Block number = 75 modulo 64 = 11

31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits

17 v 1.2
Question 1

◼ 100 blocks, 16 bytes/block


◼ To what block number does address 2000 map?
◼ Find the Tag bit size, Index size, offset size.

18 v 1.2
Question 2

◼ 1024 blocks, 32 bytes/block


◼ Find the Tag bit size, Index size, offset size.

19 v 1.2
Multiword Block Direct Mapped Cache
Caches contain 256 blocks
with 16 words per block

20 v 1.2
Directed Mapped Cache Address Bits

• Cache with 2n blocks, 2m bytes/block


31 m+n m+n-1 m m-1 2 1 0
Block Byte
Tag Index
Offset offset
32-m-n bits n bits m-2 bits 2 bits
Size of a directed mapped cache:
– Data bits= 2n . 2m .8
– Valid bits: 2n (1 bit per block)
– Tag bits: 2n 32-m-n (1 tag per block)
– Total bits = 2n (1 + [32-m-n]+2m .8)
– Efficiency: data bits/ total bits = 2m .8/ (1 + [32-m-n]+2m .8)

21 v 1.2
Impact of Block size
• Advantages of larger blocks:
– Reduces miss rate due to spatial locality
– Amortize the overhead of the tag bits.
• Disadvantages (assuming fixed sized cache):
– Fewer blocks (increases the miss rate)
– Underutilizes blocks (pollution) if no spatial locality
– Larger miss penalty (more bytes to fetch)
–Can override benefit of reduced miss rate.
–Early restart and critical word first can help.

22 v 1.2
Writing to Caches

• On a write hit, if we just update the


block in cache, cache and memory
would be inconsistent.

• Need to ensure that both eventually


updated.

23 v 1.2
Write Through Cache

• A write-through cache updates


both cache and memory at the
time of the write.
• Disadvantage:
– Makes writes take longer because
they need to wait for lower levels in
the hierarchy.
• Solution: use a write buffer
– Buffers data that is waiting to be
written to memory.
– Processor continues while write
buffer writes in the background.
• Only stalls if write buffer is already
full.

24 v 1.2
Write-back Cache
• A write cache only update
cache.
• Value is written to memory
when the block is evicted.
– Need to know which blocks
Evict have changed
– Use a dirty bit.
• Now eviction take long
– Solution : use a write buffer for
evicted dirty blocks.

25 v 1.2
Block Placement in Various Mapping

26 v 1.2
Associative Caches
▪ Fully associative
▪ Allow a given block to go in any cache entry
▪ Requires all entries to be searched at once
▪ Comparator per entry (expensive)
▪ n-way set associative
▪ Each set contains n entries
▪ Search all entries in a given set at once
▪ Block address determines which set
▪ (Block address) modulo (#Sets in cache)
▪ n comparators (less expensive)

27 v 1.2
Spectrum of Associativity

28 v 1.2
Set Associative Cache Organization
4-way set, 256 sets in
cache and 1 word/block

29 v 1.2
Example: 2 way set associative
Tag Set Index Block offset Byte offset
Address Hit/Miss (3 bits) (2 bits) (1 bit) (2 bits)
10110000 Miss 101 10 0 00

11101100 Miss 111 01 1 00

10010000 Miss 100 10 0 00

10110100 Hit 101 10 1 00

11010000
Way 0 Way 1
Index v Tag Word 0 Word 1 v Tag Word 0 Word 1

00 0 0

01 1 111 x x 0

10 1 101 x x 1 100 x x

11 0 0

30 v 1.2
Size of Tags versus Set Associativity
Problem: Increasing associativity requires more comparators and more
tag bits per cache block. Assuming a cache of 4096 blocks, a 4-word
block size, and a 32-bit address, find the total number of sets and the
total number of tag bits for caches that are direct mapped, two-way and
four-way set associative, and fully associative.

31 v 1.2
Range of Set Associative Caches

For a fixed size cache, each increase by a factor of two in


associativity doubles the number of blocks per set (i.e., the
number of ways) and halves the number of sets – decreases
the size of the index by 1 bit and increases the size of the tag
by 1 bit
32 v 1.2
Replacement Policy
• Direct mapped: no choice
• Set associative
– Prefer non-valid entry, if there is one
– Otherwise, choose among entries in the set
– Least-recently used (LRU)
• Choose the one unused for the longest time
– Simple for 2-way, manageable for 4-way, too hard beyond that

– Random
• Gives approximately the same performance as LRU for high
associativity
– FIFO
• Replace the block that has been in the cache longest.

33 v 1.2
Sources of Cache Misses

▪ Compulsory (cold start or process migration, first reference):


First access to a block, “cold” fact of life, not a whole lot you
can do about it. If you are going to run “millions” of instruction,
compulsory misses are insignificant
▪ Solution: increase block size
▪ Capacity: Cache cannot contain all blocks accessed by the
program
▪ Solution: increase cache size
▪ Conflict (collision): Multiple memory locations mapped to the
same cache location
▪ Solution 1: increase cache size
▪ Solution 2: increase associativity

34 v 1.2
Reducing Cache Miss Rates

• Use multiple levels of caches


–Primary (L1) cache attached to CPU
• Small, but fast
• Separate L1 I$ and L1 D$
–Level-2 cache services misses from L1 cache
• Larger, slower, but still faster than main memory
• Unified cache for both instructions and data
• Main memory services L-2 cache misses
• Some high-end systems include L-3 cache

35 v 1.2
Multilevel Cache Considerations

Primary cache Focus


– on minimal hit time
– Smaller total size with smaller block size
L-2 cache Focus on low miss rate to avoid main
memory access
– Hit time has less overall impact
– Larger total size with larger block size
– Higher levels of associativity

36 v 1.2
Global v.s. Local Miss Rate

Global miss rate


– The fraction of references that miss in all levels of a multilevel
cache
– Dictate how often the main memory is accessed
Local miss rate
– The fraction of references to one level of a cache that miss

37 v 1.2
Virtual Memory

◼ A memory management technique developed


for multitasking computer architectures
◼ Virtualize various forms of data storage
◼ Allow a program to be designed as there is only one
type of memory, i.e., “virtual” memory
◼ Each program runs on its own virtual address space
◼ Use main memory as a “cache” for secondary
(disk) storage
◼ Allow efficient and safe sharing of memory among
multiple programs
◼ Provide the ability to easily run programs larger than
the size of physical memory
◼ Simplify loading a program for execution by
38
providing for code relocation
v 1.2
Two Programs Sharing Physical Memory
▪ A program’s address space is divided into pages (fixed
size)
▪ The starting location of each page (either in main memory or in
secondary memory) is contained in the program’s page table

Program 1
virtual address space

main memory

Program 2
virtual address space

“cache” of hard drive


39 v 1.2
Address Translation
▪ A virtual address is translated to a physical address by a
combination of hardware and software
Virtual Address (VA)
31 30 . . . 12 11 . . . 0
Virtual page number Page offset

Translation

Physical page number Page offset


29 . . . 12 11 0
Physical Address (PA)
◼ So each memory request first requires an address
translation from the virtual space to the physical space
◼ A virtual memory miss (i.e., when the page is not in physical
memory) is called a page fault
40 v 1.2
Page Tables

◼ Stores placement information


◼ Array of page table entries, indexed by virtual page
number
◼ Page table register in CPU points to page table in
physical memory
◼ If page is present in memory
◼ Page table entry stores the physical page number
◼ Plus other status bits (referenced, dirty, …)
◼ If page is not present
◼ Page table entry can refer to location in swap space
on disk
◼ Swap space: the space on the disk reserved for the full virtual
memory space of a process
41 v 1.2
Address Translation Mechanisms
Virtual page # Offset

Physical page #
Offset
Page table register

Physical page
V base addr
1
1
1
1
1
1
0
1 Main memory
0
1
0
Page Table
42 (in main memory)
v 1.2

Disk storage
Address Translation Example

43 v 1.2
Translation Using a Page Table

44
Page Fault Penalty

◼ On page fault, the page must be fetched from


disk
◼ Takes millions of clock cycles
◼ Handled by OS code
◼ Try to minimize page fault rate
◼ Fully associative placement
◼ Smart replacement algorithms

45 v 1.2
Replacement and Writes

◼ To reduce page fault rate, prefer least-recently


used (LRU) replacement
◼ Reference bit (aka use bit) in PTE set to 1 on access
to page
◼ Periodically cleared to 0 by OS
◼ A page with reference bit = 0 has not been used
recently
◼ Disk writes take millions of cycles
◼ Block at once, not individual locations
◼ Write through is impractical
◼ Use write-back
◼ Dirty bit in PTE set when page is written

46 v 1.2
Virtual Addressing with a Cache

◼ Thus it takes an extra memory access to


translate a VA to a PA
VA PA miss
Trans- Main
CPU Cache
lation Memory
hit
data

▪ This makes memory accesses very expensive (if every


access was really two accesses)
▪ The hardware fix is to use a Translation Lookaside Buffer
(TLB) – a small cache that keeps track of recently used
address mappings to avoid having to do a page table
47
lookup v 1.2
Fast Translation Using a TLB

◼ Address translation would appear to require


extra memory references
◼ One to access the PTE
◼ Then the actual memory access
◼ But access to page tables has good locality
◼ So use a fast cache of PTEs within the CPU
◼ Called a Translation Look-aside Buffer (TLB)
◼ Misses could be handled by hardware or software

48 v 1.2
Making Address Translation Fast
Virtual page # Physical page
V Tag base addr
1
1
1
0
1
TLB
Page table register

Physical page
V base addr
1
1
1
1
1
1
0
1 Main memory
0
1
0
Page Table
49 (in physical memory)
v 1.2

Disk storage
Fast Translation Using a TLB

50
Translation Lookaside Buffers (TLBs)

◼ Just like any other cache, the TLB can be


organized as fully associative, set associative,
or direct mapped
V Tag Physical Page # Dirty Ref Access

▪ TLB access time is typically smaller than cache access


time (because TLBs are much smaller than caches)
▪ TLBs are typically not more than 512 entries even on high end
machines

51 v 1.2
TLB Misses

◼ If page is in memory
◼ Load the PTE from memory and retry
◼ Could be handled in hardware
◼ Can get complex for more complicated page table structures
◼ Or in software
◼ Raise a special exception, with optimized handler
◼ If page is not in memory (page fault)
◼ OS handles fetching the page and updating the page
table
◼ Then restart the faulting instruction

52 v 1.2
TLB Miss Handler

◼ TLB miss indicates


◼ Page present, but PTE not in TLB
◼ Page not preset
◼ Handler copies PTE from memory to TLB
◼ Then restarts instruction
◼ If page not present, page fault will occur

53 v 1.2
Page Fault Handler

◼ Use faulting virtual address to find PTE


◼ Locate page on disk
◼ Choose page to replace
◼ If dirty, write to disk first
◼ Read page into memory and update page table
◼ Make process runnable again
◼ Restart from faulting instruction

54 v 1.2
Modes of Data Transfer

• Programmed IO
• Interrupt Driven IO
• DMA Transfer

55 v 1.2
Programmed IO

56
Programmed IO

• Each data item transfer is initiated by an instruction in the program. Usually the
transfer is to and from the CPU register and the peripheral.
• Transferring data under program control requires the constant monitoring of the
peripheral by the CPU.
• Once the data transfer is initiated the CPU is required to monitor the interface to
see that when the transfer can be made.
• In the programmed IO method the CPU stays in the program loop until the IO
unit indicates that it is ready for data transfer .polling
• This is a time consuming process. Since it keeps the processor needlessly.
• In the programmed IO method the IO device does not have direct access to
memory.
• A transfer from an IO device to memory requires the execution of several
instructions by the CPU , including a input instruction to transfer the data from
the device to the CPU, and store instruction to transfer the data from CPU to
memory.
• Other instructions are needed to verify that the data are available from the
device and to count the number of words transferred .

57 v 1.2
Programmed IO
• When the byte of data is available the device places it in the IO bus and enables its
data valid line.
• The interface accepts the byte in to its data register and enables the data
accepted.
• The interface sets a bit in the status register i.e F (or) flag bit. The device can now
disable the data valid line but will not transfer until the data accepted line is
disabled by the interface.
• A program is written to check the flag in the status register to determine if a byte
has been placed in the data register by the IO device. This is done by reading the
status register in to the CPU register and checking the value of the flag bit .
• If the flag is equal to 1 the CPU reads the data from the data register .The flag bit
will be reset to 0 by either the CPU or the interface depending on how the
interface circuits are designed.

58 v 1.2
Programmed IO

59
Programmed IO

• The transfer of each byte requires three instructions:


• Read the status register.
• Check the status of the flag bit and branch to step1. if not set or or to step 3 if set .
• Read the data register. The programmed IO method is particularly useful in small low
speed computer.

60 v 1.2
Interrupt driven data transfer

61
Interrupt driven data transfer
• The daisy chaining method of establishing the priority consists of
connection of all the devices that request an interrupt.
• The device with the highest priority will be positioned in the first position
followed by the low priority devices up to lowest priority device which is
positioned last in the chain.
• The interrupt request line is common to all the devices and forms an wired
logic connection.
• If any device has its interrupt signal in the low level state , the interrupt
goes to the low level state and enables the interrupt input in the CPU.
• When no interrupts are pending the interrupt line stays in the high level
state and no interrupts are recognized by the CPU. This is equivalent to
the –ve logic OR operation.

62 v 1.2
Interrupt driven data transfer
• The CPU responds to an interrupt request by enabling the interrupt
acknowledgement line.
• This signal is received by the device 1 at its PI ( priority In) input .
• The acknowledgement line passes to the next device through PO
(priority Out ) output only if device 1 is not requesting an interrupt.
• If device 1 has pending interrupt it blocks the acknowledgement signal
from the next device by placing the 0 in the PO output.
• It then proceeds to insert its own interrupt vector address (VAD) in to
the data bus for the CPU to use during the interrupt cycle

63 v 1.2
Interrupt driven data transfer

• A device with a 0 in its PI input generates a 0 in its PO output to


inform the next lower priority device that the acknowledge signal
has been blocked.
• A device that is requesting an interrupt and has 1 in its PI input will
intercept acknowledge signal by placing the 0 in its PO output.
• If the device does not have any pending interrupts it transmits the
acknowledgement signal to the next device by placing the 1 in the
PO output.
• Thus the device with PI=1 and PO=0 is the one with the highest
priority that is requesting an interrupt and this device places an
vectored address on the data bus.
• The daisy chain arrangement gives the highest priority to the
device that receives the acknowledgement signal from the CPU.

64 v 1.2
DMA mode of data transfer
• The transfer of data between the fast storage device such as magnetic
disk and memory is limited by the speed of the CPU.
• Removing the CPU from the path and letting the peripheral device
manage the memory buses directly would improve the speed of the
transfer.
• This transfer technique is called the direct access memory (DMA)
During DMA transfer the CPU is idle and has no control of memory
buses.
• A DMA controller takes control over the buses to manage the transfer
directly between the IO device and the memory

65 v 1.2
DMA mode of data
transfer

66
DMA mode of data transfer
• The bus request (BR) is used by the DMAC (DMA controller) to
request the CPU to relinquish control of the buses.
• When this input is active the CPU terminates the execution of the
current instruction and places the address bus , data bus and R/W
lines in to high impedance state.
• The high impedance state behaves like an open circuit which means
that the output is disconnected and does not have logic significance.
• The CPU activates the bus grant (BG) output to inform the external
DMA that the buses are in the high impedance state.
• The DMA that originated the bus request can now take control of the
buses to conduct memory transfer with out processor intervention.

67 v 1.2
DMA mode of data
transfer
• DMA Transfer can be :

• Burst mode➔ A block sequence consisting of number of memory words is


transferred in continuous burst while a DMA controller is a master of
the memory buses. This mode of transfer is needed for fast devices such
as magnetic disks , where the data transfer can not be stopped or slowed
down until an entire block is transferred.
• Cycle stealing➔ Allows the DMA controller to transfer one data word at a
time , after which it must return control of the buses to the CPU. The
CPU merely delays its operation for one memory cycle to allow the direct
memory IO transfer to steal one memory cycle.

68 v 1.2
DMA mode of data
transfer

69
DMA mode of data transfer

• When the peripheral device sends a DMA request line the DMAC
activates the BR line informing the CPU to relinquish the buses.
• The CPU responds with its BG line informing the DMA that its buses
are disabled.
• The DMA then puts the current value of the address register in to the
address bus initiates the RD or WR signal and sends a DMA ACK to the
peripheral device.
• The RD or WR lines in the DMAC are bi-directional. The direction of
transfer depends on the status of the BG signal.
• If BG=0 the RD and WR are input lines allowing the CPU to
communicate with the internal DMA registers.

70 v 1.2
DMA mode of data transfer

• If BG=1 the RD and WR are the output lines from the DMC to the
RAM to specify the read or write operation for the data.
• When the peripheral device receives the DMA ACK it puts the word in
the data bus (write) or receives the word from the data bus (read).
• Thus the DMA controls the R/W operation and supplies the addresses
for the memory.
• The peripheral unit can then be communicate with the memory thru
the data bus for direct transfer between the two units while the CPU
is momentarily disabled.

71 v 1.2
DMA mode of data
transfer

• For each word that is transferred the DMA increments its address
register and decrements its word count register.
• If the word count register reaches to zero the DMA stops any further
transfers and removes its bus request .
• It also informs the CPU of the termination by means of the interrupt.
• When the CPU responds the interrupt it reads the contents of the
word count register .
• The zero value of the register indicates that all words were
transferred successfully

72 v 1.2

You might also like