Unit 4
Unit 4
S Raghavendra Kumar
SSNCE
Memory Technology
Memory Technology Typical Access Time $ per GiB in 2012
Ideal memory
• Access time of SRAM
• Capacity and cost/GB of disk
Strategy: arrange memory in a hierarchy
• Smaller and faster memory for data currently being accessed
• Larger and slower memory for data not currently being accessed
2 v 1.2
Memory Hierarchy
Memory hierarchy
▪ Store everything on flash/disk
▪ Copy recently accessed (and nearby) items from disk
to smaller DRAM memory
▪ Main memory
▪ Copy more recently accessed (and nearby) items from
DRAM to smaller SRAM memory
▪ Cache memory attached to CPU
▪ Recently is a good predictor of Currently because of
the principle of locality.
3 v 1.2
Principle of Locality
Programs access a small proportion of their
address space at any time
• Temporal locality (locality in time) Items
accessed recently are likely to be accessed again
soon e.g., instructions in a loop, induction variables \
– Keep most recently accessed items into the cache
• Spatial locality (locality in space) Items near
those accessed recently are likely to be accessed
soon E.g., sequential instruction access, array data
– Move blocks consisting of contiguous words closer to
the processor
4 v 1.2
Memory Hierarchy Levels
• Block (aka cache line): unit of copying May be
multiple words
• If accessed data is present in upper level
– Hit: access satisfied by upper level
• Hit ratio: hits/accesses
– Hit time: Time to access the block + Time
to determine hit/miss
• If accessed data is absent
– Miss: data not in upper level
• Miss ratio: misses/accesses = 1 – hit ratio
– Miss penalty: Time to access the block in
the lower level + Time to transmit that
block to the level that experienced the miss
+ Time to insert the block in that level +
Time to pass the block to the requestor
5 v 1.2
Average Memory Access Time (AMAT)
6 v 1.2
Cache Memory
7 v 1.2
Direct Mapped Cache
• Location determined by address
• Direct mapped: only one choice
– (Block address) modulo (#Blocks in cache)
• #Blocks is a power of 2
• Use low-order address
bits
8 v 1.2
Tags and Valid Bits
9 v 1.2
Cache Example
16
Larger Block Size
◼ 64 blocks, 16 bytes/block
◼ To what block number does address 1200 map?
◼ Block address = 1200/16 = 75
◼ Block number = 75 modulo 64 = 11
31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits
17 v 1.2
Question 1
18 v 1.2
Question 2
19 v 1.2
Multiword Block Direct Mapped Cache
Caches contain 256 blocks
with 16 words per block
20 v 1.2
Directed Mapped Cache Address Bits
21 v 1.2
Impact of Block size
• Advantages of larger blocks:
– Reduces miss rate due to spatial locality
– Amortize the overhead of the tag bits.
• Disadvantages (assuming fixed sized cache):
– Fewer blocks (increases the miss rate)
– Underutilizes blocks (pollution) if no spatial locality
– Larger miss penalty (more bytes to fetch)
–Can override benefit of reduced miss rate.
–Early restart and critical word first can help.
22 v 1.2
Writing to Caches
23 v 1.2
Write Through Cache
24 v 1.2
Write-back Cache
• A write cache only update
cache.
• Value is written to memory
when the block is evicted.
– Need to know which blocks
Evict have changed
– Use a dirty bit.
• Now eviction take long
– Solution : use a write buffer for
evicted dirty blocks.
25 v 1.2
Block Placement in Various Mapping
26 v 1.2
Associative Caches
▪ Fully associative
▪ Allow a given block to go in any cache entry
▪ Requires all entries to be searched at once
▪ Comparator per entry (expensive)
▪ n-way set associative
▪ Each set contains n entries
▪ Search all entries in a given set at once
▪ Block address determines which set
▪ (Block address) modulo (#Sets in cache)
▪ n comparators (less expensive)
27 v 1.2
Spectrum of Associativity
28 v 1.2
Set Associative Cache Organization
4-way set, 256 sets in
cache and 1 word/block
29 v 1.2
Example: 2 way set associative
Tag Set Index Block offset Byte offset
Address Hit/Miss (3 bits) (2 bits) (1 bit) (2 bits)
10110000 Miss 101 10 0 00
11010000
Way 0 Way 1
Index v Tag Word 0 Word 1 v Tag Word 0 Word 1
00 0 0
01 1 111 x x 0
10 1 101 x x 1 100 x x
11 0 0
30 v 1.2
Size of Tags versus Set Associativity
Problem: Increasing associativity requires more comparators and more
tag bits per cache block. Assuming a cache of 4096 blocks, a 4-word
block size, and a 32-bit address, find the total number of sets and the
total number of tag bits for caches that are direct mapped, two-way and
four-way set associative, and fully associative.
31 v 1.2
Range of Set Associative Caches
– Random
• Gives approximately the same performance as LRU for high
associativity
– FIFO
• Replace the block that has been in the cache longest.
33 v 1.2
Sources of Cache Misses
34 v 1.2
Reducing Cache Miss Rates
35 v 1.2
Multilevel Cache Considerations
36 v 1.2
Global v.s. Local Miss Rate
37 v 1.2
Virtual Memory
Program 1
virtual address space
main memory
Program 2
virtual address space
Translation
Physical page #
Offset
Page table register
Physical page
V base addr
1
1
1
1
1
1
0
1 Main memory
0
1
0
Page Table
42 (in main memory)
v 1.2
Disk storage
Address Translation Example
43 v 1.2
Translation Using a Page Table
44
Page Fault Penalty
45 v 1.2
Replacement and Writes
46 v 1.2
Virtual Addressing with a Cache
48 v 1.2
Making Address Translation Fast
Virtual page # Physical page
V Tag base addr
1
1
1
0
1
TLB
Page table register
Physical page
V base addr
1
1
1
1
1
1
0
1 Main memory
0
1
0
Page Table
49 (in physical memory)
v 1.2
Disk storage
Fast Translation Using a TLB
50
Translation Lookaside Buffers (TLBs)
51 v 1.2
TLB Misses
◼ If page is in memory
◼ Load the PTE from memory and retry
◼ Could be handled in hardware
◼ Can get complex for more complicated page table structures
◼ Or in software
◼ Raise a special exception, with optimized handler
◼ If page is not in memory (page fault)
◼ OS handles fetching the page and updating the page
table
◼ Then restart the faulting instruction
52 v 1.2
TLB Miss Handler
53 v 1.2
Page Fault Handler
54 v 1.2
Modes of Data Transfer
• Programmed IO
• Interrupt Driven IO
• DMA Transfer
55 v 1.2
Programmed IO
56
Programmed IO
• Each data item transfer is initiated by an instruction in the program. Usually the
transfer is to and from the CPU register and the peripheral.
• Transferring data under program control requires the constant monitoring of the
peripheral by the CPU.
• Once the data transfer is initiated the CPU is required to monitor the interface to
see that when the transfer can be made.
• In the programmed IO method the CPU stays in the program loop until the IO
unit indicates that it is ready for data transfer .polling
• This is a time consuming process. Since it keeps the processor needlessly.
• In the programmed IO method the IO device does not have direct access to
memory.
• A transfer from an IO device to memory requires the execution of several
instructions by the CPU , including a input instruction to transfer the data from
the device to the CPU, and store instruction to transfer the data from CPU to
memory.
• Other instructions are needed to verify that the data are available from the
device and to count the number of words transferred .
57 v 1.2
Programmed IO
• When the byte of data is available the device places it in the IO bus and enables its
data valid line.
• The interface accepts the byte in to its data register and enables the data
accepted.
• The interface sets a bit in the status register i.e F (or) flag bit. The device can now
disable the data valid line but will not transfer until the data accepted line is
disabled by the interface.
• A program is written to check the flag in the status register to determine if a byte
has been placed in the data register by the IO device. This is done by reading the
status register in to the CPU register and checking the value of the flag bit .
• If the flag is equal to 1 the CPU reads the data from the data register .The flag bit
will be reset to 0 by either the CPU or the interface depending on how the
interface circuits are designed.
58 v 1.2
Programmed IO
59
Programmed IO
60 v 1.2
Interrupt driven data transfer
61
Interrupt driven data transfer
• The daisy chaining method of establishing the priority consists of
connection of all the devices that request an interrupt.
• The device with the highest priority will be positioned in the first position
followed by the low priority devices up to lowest priority device which is
positioned last in the chain.
• The interrupt request line is common to all the devices and forms an wired
logic connection.
• If any device has its interrupt signal in the low level state , the interrupt
goes to the low level state and enables the interrupt input in the CPU.
• When no interrupts are pending the interrupt line stays in the high level
state and no interrupts are recognized by the CPU. This is equivalent to
the –ve logic OR operation.
62 v 1.2
Interrupt driven data transfer
• The CPU responds to an interrupt request by enabling the interrupt
acknowledgement line.
• This signal is received by the device 1 at its PI ( priority In) input .
• The acknowledgement line passes to the next device through PO
(priority Out ) output only if device 1 is not requesting an interrupt.
• If device 1 has pending interrupt it blocks the acknowledgement signal
from the next device by placing the 0 in the PO output.
• It then proceeds to insert its own interrupt vector address (VAD) in to
the data bus for the CPU to use during the interrupt cycle
63 v 1.2
Interrupt driven data transfer
64 v 1.2
DMA mode of data transfer
• The transfer of data between the fast storage device such as magnetic
disk and memory is limited by the speed of the CPU.
• Removing the CPU from the path and letting the peripheral device
manage the memory buses directly would improve the speed of the
transfer.
• This transfer technique is called the direct access memory (DMA)
During DMA transfer the CPU is idle and has no control of memory
buses.
• A DMA controller takes control over the buses to manage the transfer
directly between the IO device and the memory
65 v 1.2
DMA mode of data
transfer
66
DMA mode of data transfer
• The bus request (BR) is used by the DMAC (DMA controller) to
request the CPU to relinquish control of the buses.
• When this input is active the CPU terminates the execution of the
current instruction and places the address bus , data bus and R/W
lines in to high impedance state.
• The high impedance state behaves like an open circuit which means
that the output is disconnected and does not have logic significance.
• The CPU activates the bus grant (BG) output to inform the external
DMA that the buses are in the high impedance state.
• The DMA that originated the bus request can now take control of the
buses to conduct memory transfer with out processor intervention.
67 v 1.2
DMA mode of data
transfer
• DMA Transfer can be :
68 v 1.2
DMA mode of data
transfer
69
DMA mode of data transfer
• When the peripheral device sends a DMA request line the DMAC
activates the BR line informing the CPU to relinquish the buses.
• The CPU responds with its BG line informing the DMA that its buses
are disabled.
• The DMA then puts the current value of the address register in to the
address bus initiates the RD or WR signal and sends a DMA ACK to the
peripheral device.
• The RD or WR lines in the DMAC are bi-directional. The direction of
transfer depends on the status of the BG signal.
• If BG=0 the RD and WR are input lines allowing the CPU to
communicate with the internal DMA registers.
70 v 1.2
DMA mode of data transfer
• If BG=1 the RD and WR are the output lines from the DMC to the
RAM to specify the read or write operation for the data.
• When the peripheral device receives the DMA ACK it puts the word in
the data bus (write) or receives the word from the data bus (read).
• Thus the DMA controls the R/W operation and supplies the addresses
for the memory.
• The peripheral unit can then be communicate with the memory thru
the data bus for direct transfer between the two units while the CPU
is momentarily disabled.
71 v 1.2
DMA mode of data
transfer
• For each word that is transferred the DMA increments its address
register and decrements its word count register.
• If the word count register reaches to zero the DMA stops any further
transfers and removes its bus request .
• It also informs the CPU of the termination by means of the interrupt.
• When the CPU responds the interrupt it reads the contents of the
word count register .
• The zero value of the register indicates that all words were
transferred successfully
72 v 1.2