0% found this document useful (0 votes)

125 views62 pages

Chapter05 Memory PDF

Uploaded by

My Heo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views62 pages

Chapter05 Memory PDF

Uploaded by

My Heo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

dce

2013

COMPUTER ARCHITECTURE
CSE Fall 2013
Faculty of Computer Science and
Engineering
BK
TP.HCM Department of Computer Engineering

Vo Tan Phuong
http://www.cse.hcmut.edu.vn/~vtphuong
dce
2013

Chapter 5
Memory

Computer Architecture – Chapter 5 ©Fall 2013, CS 2

dce
2013
Presentation Outline
 Random Access Memory and its Structure

 Memory Hierarchy and the need for Cache Memory

 The Basics of Caches

 Cache Performance and Memory Stall Cycles

 Improving Cache Performance

 Multilevel Caches

Computer Architecture – Chapter 5 ©Fall 2013, CS 3

dce
2013
Random Access Memory
 Large arrays of storage cells
 Volatile memory
 Hold the stored data as long as it is powered on

 Random Access
 Access time is practically the same to any data on a RAM chip

 Output Enable (OE) control signal

RAM
n
 Specifies read operation Address

 Write Enable (WE) control signal m

Data

OE WE
 Specifies write operation

 2n × m RAM chip: n-bit address and m-bit data

Computer Architecture – Chapter 5 ©Fall 2013, CS 4

dce
2013
Memory Technology
 Static RAM (SRAM) for Cache
 Requires 6 transistors per bit
 Requires low power to retain bit
 Dynamic RAM (DRAM) for Main Memory
 One transistor + capacitor per bit
 Must be re-written after being read
 Must also be periodically refreshed
 Each row can be refreshed simultaneously
 Address lines are multiplexed
 Upper half of address: Row Access Strobe (RAS)
 Lower half of address: Column Access Strobe (CAS)

Computer Architecture – Chapter 5 ©Fall 2013, CS 5

dce
2013
Static RAM Storage Cell
 Static RAM (SRAM): fast but expensive RAM

 6-Transistor cell with no static current

 Typically used for caches

Word line

 Provides fast access time Vcc

 Cell Implementation:
 Cross-coupled inverters store bit

 Two pass transistors bit bit

 Row decoder selects the word line Typical SRAM cell

 Pass transistors enable the cell to be read and written

Computer Architecture – Chapter 5 ©Fall 2013, CS 6

dce
2013
Dynamic RAM Storage Cell
 Dynamic RAM (DRAM): slow, cheap, and dense memory
 Typical choice for main memory
Word line
 Cell Implementation:
 1-Transistor cell (pass transistor) Pass
Transistor

 Trench capacitor (stores bit)

Capacitor
 Bit is stored as a charge on capacitor
bit
 Must be refreshed periodically
Typical DRAM cell
 Because of leakage of charge from tiny capacitor

 Refreshing for all memory rows

 Reading each row and writing it back to restore the charge

Computer Architecture – Chapter 5 ©Fall 2013, CS 7

dce
2013
Dynamic RAM Storage Cell
 The need for refreshed cycle

Computer Architecture – Chapter 5 ©Fall 2013, CS 8

dce
2013
Typical DRAM Packaging
 24-pin dual in-line package for 16Mbit = 222  4 memory
 22-bit address is divided into
Legend
 11-bit row address Ai Address bit i
CAS Column address strobe
 11-bit column address Dj Data bit j
NC No connection
 Interleaved on same address lines OE Output enable
RAS Row address strobe
WE Write enable

Vss D4 D3 CAS OE A9 A8 A7 A6 A5 A4 Vss

24 23 22 21 20 19 18 17 16 15 14 13

1 2 3 4 5 6 7 8 9 10 11 12

Vcc D1 D2 WE RAS NC A10 A0 A1 A2 A3 Vcc

Computer Architecture – Chapter 5 ©Fall 2013, CS 9

dce
2013
Typical Memory Structure

 Row decoder
 Select row to read/write

Row Decoder
Row address
2r × 2c × m bits
 Column decoder r

...
Cell Matrix
 Select column to read/write

 Cell Matrix
 2D array of tiny memory cells Sense/write amplifiers
m
Data Row Latch 2c × m bits
 Sense/Write amplifiers
...
 Sense & amplify data on read
Column Decoder
 Drive bit line with data in on write c

 Same data lines are used for data in/out Column address

Computer Architecture – Chapter 5 ©Fall 2013, CS 10

dce
2013
DRAM Operation
 Row Access (RAS)
 Latch and decode row address to enable addressed row
 Small change in voltage detected by sense amplifiers
 Latch whole row of bits
 Sense amplifiers drive bit lines to recharge storage cells
 Column Access (CAS) read and write operation
 Latch and decode column address to select m bits
 m = 4, 8, 16, or 32 bits depending on DRAM package
 On read, send latched bits out to chip pins
 On write, charge storage cells to required value
 Can perform multiple column accesses to same row (burst mode)

Computer Architecture – Chapter 5 ©Fall 2013, CS 11

dce
2013
Burst Mode Operation
 Block Transfer
 Row address is latched and decoded
 A read operation causes all cells in a selected row to be read
 Selected row is latched internally inside the SDRAM chip
 Column address is latched and decoded
 Selected column data is placed in the data output register
 Column address is incremented automatically
 Multiple data items are read depending on the block length

 Fast transfer of blocks between memory and cache

 Fast transfer of pages between memory and disk

Computer Architecture – Chapter 5 ©Fall 2013, CS 12

dce
2013
Trends in DRAM
Year Row Column Cycle Time
Chip size Type
Produced access access New Request
1980 64 Kbit DRAM 170 ns 75 ns 250 ns
1983 256 Kbit DRAM 150 ns 50 ns 220 ns
1986 1 Mbit DRAM 120 ns 25 ns 190 ns
1989 4 Mbit DRAM 100 ns 20 ns 165 ns
1992 16 Mbit DRAM 80 ns 15 ns 120 ns
1996 64 Mbit SDRAM 70 ns 12 ns 110 ns
1998 128 Mbit SDRAM 70 ns 10 ns 100 ns
2000 256 Mbit DDR1 65 ns 7 ns 90 ns
2002 512 Mbit DDR1 60 ns 5 ns 80 ns
2004 1 Gbit DDR2 55 ns 5 ns 70 ns
2006 2 Gbit DDR2 50 ns 3 ns 60 ns
2010 4 Gbit DDR3 35 ns 1 ns 37 ns
2012 8 Gbit DDR3 30 ns 0.5 ns 31 ns

Computer Architecture – Chapter 5 ©Fall 2013, CS 13

dce
2013
SDRAM and DDR SDRAM

 SDRAM is Synchronous Dynamic RAM

 Added clock to DRAM interface
 SDRAM is synchronous with the system clock
 Older DRAM technologies were asynchronous
 As system bus clock improved, SDRAM delivered
higher performance than asynchronous DRAM
 DDR is Double Data Rate SDRAM
 Like SDRAM, DDR is synchronous with the system
clock, but the difference is that DDR reads data on
both the rising and falling edges of the clock signal

Computer Architecture – Chapter 5 ©Fall 2013, CS 14

dce
2013
Transfer Rates & Peak Bandwidth
Standard Memory Millions Transfers Module Peak
Name Bus Clock per second Name Bandwidth
DDR-200 100 MHz 200 MT/s PC-1600 1600 MB/s
DDR-333 167 MHz 333 MT/s PC-2700 2667 MB/s
DDR-400 200 MHz 400 MT/s PC-3200 3200 MB/s
DDR2-667 333 MHz 667 MT/s PC-5300 5333 MB/s
DDR2-800 400 MHz 800 MT/s PC-6400 6400 MB/s
DDR2-1066 533 MHz 1066 MT/s PC-8500 8533 MB/s
DDR3-1066 533 MHz 1066 MT/s PC-8500 8533 MB/s
DDR3-1333 667 MHz 1333 MT/s PC-10600 10667 MB/s
DDR3-1600 800 MHz 1600 MT/s PC-12800 12800 MB/s
DDR4-3200 1600 MHz 3200 MT/s PC-25600 25600 MB/s

 1 Transfer = 64 bits = 8 bytes of data

Computer Architecture – Chapter 5 ©Fall 2013, CS 15

dce
2013
DRAM Refresh Cycles
 Refresh cycle is about tens of milliseconds
 Refreshing is done for the entire memory
 Each row is read and written back to restore the charge
 Some of the memory bandwidth is lost to refresh cycles

Voltage 1 Written Refreshed Refreshed Refreshed

for 1

Threshold
voltage

0 Stored Refresh Cycle

Voltage Time
for 0

Computer Architecture – Chapter 5 ©Fall 2013, CS 16

dce
2013
Expanding the Data Bus Width
 Memory chips typically have a narrow data bus
 We can expand the data bus width by a factor of p
 Use p RAM chips and feed the same address to all chips
 Use the same Output Enable and Write Enable control signals

OE WE OE WE OE WE

Address Address ... Address

Data Data Data

m m
..

Data width = m × p bits

Computer Architecture – Chapter 5 ©Fall 2013, CS 17

dce
2013
Next . . .
 Random Access Memory and its Structure

 Memory Hierarchy and the need for Cache Memory

 The Basics of Caches

 Cache Performance and Memory Stall Cycles

 Improving Cache Performance

 Multilevel Caches

Computer Architecture – Chapter 5 ©Fall 2013, CS 18

dce
2013
Processor-Memory Performance Gap

CPU Performance: 55% per year,

slowing down after 2004

Performance Gap
DRAM: 7% per year

 1980 – No cache in microprocessor

 1995 – Two-level cache on microprocessor
Computer Architecture – Chapter 5 ©Fall 2013, CS 19
dce
2013
The Need for Cache Memory
 Widening speed gap between CPU and main memory
 Processor operation takes less than 1 ns
 Main memory requires more than 50 ns to access

 Each instruction involves at least one memory access

 One memory access to fetch the instruction
 A second memory access for load and store instructions

 Memory bandwidth limits the instruction execution rate

 Cache memory can help bridge the CPU-memory gap
 Cache memory is small in size but fast

Computer Architecture – Chapter 5 ©Fall 2013, CS 20

dce
2013
Typical Memory Hierarchy
 Registers are at the top of the hierarchy
 Typical size < 1 KB
 Access time < 0.5 ns
 Level 1 Cache (8 – 64 KB)
Microprocessor
 Access time: 1 ns
Registers
 L2 Cache (512KB – 8MB)
L1 Cache
 Access time: 3 – 10 ns
L2 Cache
 Main Memory (4 – 16 GB)

Bigger
Faster
 Access time: 50 – 100 ns Memory Bus
Main Memory
 Disk Storage (> 200 GB)
I/O Bus
 Access time: 5 – 10 ms
Magnetic or Flash Disk

Computer Architecture – Chapter 5 ©Fall 2013, CS 21

dce
2013
Principle of Locality of Reference
 Programs access small portion of their address space
 At any time, only a small set of instructions & data is needed

 Temporal Locality (in time)

 If an item is accessed, probably it will be accessed again soon
 Same loop instructions are fetched each iteration
 Same procedure may be called and executed many times

 Spatial Locality (in space)

 Tendency to access contiguous instructions/data in memory
 Sequential execution of Instructions
 Traversing arrays element by element

Computer Architecture – Chapter 5 ©Fall 2013, CS 22

dce
2013
What is a Cache Memory ?
 Small and fast (SRAM) memory technology
 Stores the subset of instructions & data currently being accessed

 Used to reduce average access time to memory

 Caches exploit temporal locality by …
 Keeping recently accessed data closer to the processor

 Caches exploit spatial locality by …

 Moving blocks consisting of multiple contiguous words

 Goal is to achieve
 Fast speed of cache memory access
 Balance the cost of the memory system

Computer Architecture – Chapter 5 ©Fall 2013, CS 23

dce
2013
Cache Memories in the Datapath
Imm16

Imm
E
ALU result 32
0
32
1

ALUout
Register File
I-Cache Instruction Rs 5 BusA 2 D-Cache

WB Data
A 0
RA 3
Instruction Rt 5 L Address
BusB 32
RB 0 U
PC

Address Data_out
1 1 1

B
RW 2 0 Data_in

D
BusW
3
32
32

Rd2

Rd3

Rd4
0
1
Rd

clk
Instruction Block
Block Address

Block Address

D-Cache miss
I-Cache miss

Data Block
I-Cache miss or D-Cache miss
causes pipeline to stall

Interface to L2 Cache or Main Memory

Computer Architecture – Chapter 5 ©Fall 2013, CS 24

dce
2013
Almost Everything is a Cache !
 In computer architecture, almost everything is a cache!
 Registers: a cache on variables – software managed
 First-level cache: a cache on second-level cache
 Second-level cache: a cache on memory
 Memory: a cache on hard disk
 Stores recent programs and their data
 Hard disk can be viewed as an extension to main memory

 Branch target and prediction buffer

 Cache on branch target and prediction information

Computer Architecture – Chapter 5 ©Fall 2013, CS 25

dce
2013
Next . . .
 Random Access Memory and its Structure

 Memory Hierarchy and the need for Cache Memory

 The Basics of Caches

 Cache Performance and Memory Stall Cycles

 Improving Cache Performance

 Multilevel Caches

Computer Architecture – Chapter 5 ©Fall 2013, CS 26

dce
2013
Four Basic Questions on Caches
 Q1: Where can a block be placed in a cache?
 Block placement
 Direct Mapped, Set Associative, Fully Associative
 Q2: How is a block found in a cache?
 Block identification
 Block address, tag, index
 Q3: Which block should be replaced on a miss?
 Block replacement
 FIFO, Random, LRU
 Q4: What happens on a write?
 Write strategy
 Write Back or Write Through (with Write Buffer)

Computer Architecture – Chapter 5 ©Fall 2013, CS 27

dce
2013
Block Placement: Direct Mapped
 Block: unit of data transfer between cache and memory
 Direct Mapped Cache:
 A block can be placed in exactly one location in the cache

000
001
010

100
101
110
011

111
In this example:

Cache
Cache index =
least significant 3 bits
of Memory address

Memory
Main
00000
00001
00010

00100
00101

01000
01001
01010

10000
10001
10010

10100
10101

11000
11001
11010
00110

01100
01101

10110
00011

01011

10011

11100
11101
00111

01110

10111

11011

11110
01111

11111
Computer Architecture – Chapter 5 ©Fall 2013, CS 28
dce
2013
Direct-Mapped Cache
 A memory address is divided into
 Block address: identifies block in memory Block Address

 Block offset: to access bytes within a block Tag Index offset

 A block address is further divided into

V Tag Block Data
 Index: used for direct cache access
 Tag: most-significant bits of block address
Index = Block Address mod Cache Blocks

 Tag must be stored also inside cache

 For block identification
=
 A valid bit is also required to indicate
Data
 Whether a cache block is valid or not
Hit

Computer Architecture – Chapter 5 ©Fall 2013, CS 29

dce
2013
Direct Mapped Cache – cont’d
 Cache hit: block is stored inside cache
 Index is used to access cache block Block Address

 Cache data size = 2n+b bytes Data

Hit

Computer Architecture – Chapter 5 ©Fall 2013, CS 30

dce
2013
Mapping an Address to a Cache Block
 Example
 Consider a direct-mapped cache with 256 blocks
 Block size = 16 bytes
 Compute tag, index, and byte offset of address: 0x01FFF8AC
Block Address
 Solution 20 8 4

 32-bit address is divided into: Tag Index offset

 4-bit byte offset field, because block size = 24 = 16 bytes

 8-bit cache index, because there are 28 = 256 blocks in cache
 20-bit tag field
 Byte offset = 0xC = 12 (least significant 4 bits of address)
 Cache index = 0x8A = 138 (next lower 8 bits of address)
 Tag = 0x01FFF (upper 20 bits of address)

Computer Architecture – Chapter 5 ©Fall 2013, CS 31

dce
2013
Example on Cache Placement & Misses
 Consider a small direct-mapped cache with 32 blocks
 Cache is initially empty, Block size = 16 bytes
 The following memory addresses (in decimal) are referenced:
1000, 1004, 1008, 2548, 2552, 2556.
 Map addresses to cache blocks and indicate whether hit or miss
23 5 4
 Solution: Tag Index offset

 1000 = 0x3E8 cache index = 0x1E Miss (first access)

 1004 = 0x3EC cache index = 0x1E Hit
 1008 = 0x3F0 cache index = 0x1F Miss (first access)
 2548 = 0x9F4 cache index = 0x1F Miss (different tag)
 2552 = 0x9F8 cache index = 0x1F Hit
 2556 = 0x9FC cache index = 0x1F Hit

Computer Architecture – Chapter 5 ©Fall 2013, CS 32

dce
2013
Fully Associative Cache
 A block can be placed anywhere in cache  no indexing
 If m blocks exist then
 m comparators are needed to match tag
 Cache data size = m  2b bytes Address
Tag offset

V Tag Block Data V Tag Block Data V Tag Block Data V Tag Block Data

= = = =

mux
m-way associative Data
Hit

Computer Architecture – Chapter 5 ©Fall 2013, CS 33

dce
2013
Set-Associative Cache
 A set is a group of blocks that can be indexed
 A block is first mapped onto a set
 Set index = Block address mod Number of sets in cache

 If there are m blocks in a set (m-way set associative) then

 m tags are checked in parallel using m comparators

 If 2n sets exist then set index consists of n bits

 Cache data size = m  2n+b bytes (with 2b bytes per block)
 Without counting tags and valid bits

 A direct-mapped cache has one block per set (m = 1)

 A fully-associative cache has one set (2n = 1 or n = 0)

Computer Architecture – Chapter 5 ©Fall 2013, CS 34

dce
2013
Set-Associative Cache Diagram

Address Tag Index offset

V Tag Block Data V Tag Block Data V Tag Block Data V Tag Block Data

= = = =

mux
m-way set-associative Hit
Data

Computer Architecture – Chapter 5 ©Fall 2013, CS 35

dce
2013
Write Policy
 Write Through:
 Writes update cache and lower-level memory
 Cache control bit: only a Valid bit is needed
 Memory always has latest data, which simplifies data coherency
 Can always discard cached data when a block is replaced
 Write Back:
 Writes update cache only
 Cache control bits: Valid and Modified bits are required
 Modified cached data is written back to memory when replaced
 Multiple writes to a cache block require only one write to memory
 Uses less memory bandwidth than write-through and less power
 However, more complex to implement than write through

Computer Architecture – Chapter 5 ©Fall 2013, CS 36

dce
2013
Write Miss Policy
 What happens on a write miss?
 Write Allocate:
 Allocate new block in cache
 Write miss acts like a read miss, block is fetched and updated
 No Write Allocate:
 Send data to lower-level memory
 Cache is not modified
 Typically, write back caches use write allocate
 Hoping subsequent writes will be captured in the cache
 Write-through caches often use no-write allocate
 Reasoning: writes must still go to lower level memory

Computer Architecture – Chapter 5 ©Fall 2013, CS 37

dce
2013
Write Buffer
 Decouples the CPU write from the memory bus writing
 Permits writes to occur without stall cycles until buffer is full

 Write-through: all stores are sent to lower level memory

 Write buffer eliminates processor stalls on consecutive writes

 Write-back: modified blocks are written when replaced

 Write buffer is used for evicted blocks that must be written back

 The address and modified data are written in the buffer

 The write is finished from the CPU perspective
 CPU continues while the write buffer prepares to write memory

 If buffer is full, CPU stalls until buffer has an empty entry

Computer Architecture – Chapter 5 ©Fall 2013, CS 38

dce
2013
What Happens on a Cache Miss?
 Cache sends a miss signal to stall the processor
 Decide which cache block to allocate/replace
 One choice only when the cache is directly mapped
 Multiple choices for set-associative or fully-associative cache
 Transfer the block from lower level memory to this cache
 Set the valid bit and the tag field from the upper address bits
 If block to be replaced is modified then write it back
 Modified block is moved into a Write Buffer
 Otherwise, block to be replaced can be simply discarded
 Restart the instruction that caused the cache miss
 Miss Penalty: clock cycles to process a cache miss

Computer Architecture – Chapter 5 ©Fall 2013, CS 39

dce
2013
Replacement Policy
 Which block to be replaced on a cache miss?
 No selection alternatives for direct-mapped caches
 m blocks per set to choose from for associative caches
 Random replacement
 Candidate blocks are randomly selected
 One counter for all sets (0 to m – 1): incremented on every cycle
 On a cache miss replace block specified by counter
 First In First Out (FIFO) replacement
 Replace oldest block in set
 One counter per set (0 to m – 1): specifies oldest block to replace
 Counter is incremented on a cache miss

Computer Architecture – Chapter 5 ©Fall 2013, CS 40

dce
2013
Replacement Policy – cont’d
 Least Recently Used (LRU)
 Replace block that has been unused for the longest time
 Order blocks within a set from least to most recently used
 Update ordering of blocks on each cache hit
 With m blocks per set, there are m! possible permutations

 Pure LRU is too costly to implement when m > 2

 m = 2, there are 2 permutations only (a single bit is needed)
 m = 4, there are 4! = 24 possible permutations
 LRU approximation is used in practice

 For large m > 4,

Random replacement can be as effective as LRU

Computer Architecture – Chapter 5 ©Fall 2013, CS 41

dce
2013
Comparing Random, FIFO, and LRU
 Data cache misses per 1000 instructions
 10 SPEC2000 benchmarks on Alpha processor
 Block size of 64 bytes
 LRU and FIFO outperforming Random for a small cache
 Little difference between LRU and Random for a large cache
 LRU is expensive for large associativity (# blocks per set)
 Random is the simplest to implement in hardware

2-way 4-way 8-way

Size LRU Rand FIFO LRU Rand FIFO LRU Rand FIFO
16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4
64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3
256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5

dce
2013
Next . . .
 Random Access Memory and its Structure

 Memory Hierarchy and the need for Cache Memory

 The Basics of Caches

 Cache Performance and Memory Stall Cycles

 Improving Cache Performance

 Multilevel Caches

dce
2013
Hit Rate and Miss Rate
 Hit Rate = Hits / (Hits + Misses)
 Miss Rate = Misses / (Hits + Misses)
 I-Cache Miss Rate = Miss rate in the Instruction Cache
 D-Cache Miss Rate = Miss rate in the Data Cache
 Example:
 Out of 1000 instructions fetched, 150 missed in the I-Cache
 25% are load-store instructions, 50 missed in the D-Cache
 What are the I-cache and D-cache miss rates?

 I-Cache Miss Rate = 150 / 1000 = 15%

 D-Cache Miss Rate = 50 / (25% × 1000) = 50 / 250 = 20%

dce
2013
Memory Stall Cycles
 The processor stalls on a Cache miss
 When fetching instructions from the Instruction Cache (I-cache)
 When loading or storing data into the Data Cache (D-cache)

Memory stall cycles = Combined Misses  Miss Penalty

 Miss Penalty: clock cycles to process a cache miss
Combined Misses = I-Cache Misses + D-Cache Misses
I-Cache Misses = I-Count × I-Cache Miss Rate
D-Cache Misses = LS-Count × D-Cache Miss Rate
LS-Count (Load & Store) = I-Count × LS Frequency
 Cache misses are often reported per thousand instructions

dce
2013
Memory Stall Cycles Per Instruction
 Memory Stall Cycles Per Instruction =
Combined Misses Per Instruction × Miss Penalty
 Miss Penalty is assumed equal for I-cache & D-cache
 Miss Penalty is assumed equal for Load and Store
 Combined Misses Per Instruction =
I-Cache Miss Rate + LS Frequency × D-Cache Miss Rate
 Therefore, Memory Stall Cycles Per Instruction =
I-Cache Miss Rate × Miss Penalty +
LS Frequency × D-Cache Miss Rate × Miss Penalty

dce
2013
Example on Memory Stall Cycles
 Consider a program with the given characteristics
 Instruction count (I-Count) = 106 instructions
 30% of instructions are loads and stores
 D-cache miss rate is 5% and I-cache miss rate is 1%
 Miss penalty is 100 clock cycles for instruction and data caches
 Compute combined misses per instruction and memory stall cycles
 Combined misses per instruction in I-Cache and D-Cache
 1% + 30%  5% = 0.025 combined misses per instruction
 Equal to 25 misses per 1000 instructions
 Memory stall cycles
 0.025  100 (miss penalty) = 2.5 stall cycles per instruction
 Total memory stall cycles = 106  2.5 = 2,500,000

dce
2013
CPU Time with Memory Stall Cycles

CPU Time = I-Count × CPIMemoryStalls × Clock Cycle

CPIMemoryStalls = CPIPerfectCache + Mem Stalls per Instruction

 CPIPerfectCache = CPI for ideal cache (no cache misses)

 CPIMemoryStalls = CPI in the presence of memory stalls

 Memory stall cycles increase the CPI

dce
2013
Example on CPI with Memory Stalls
 A processor has CPI of 1.5 without any memory stalls
 Cache miss rate is 2% for instruction and 5% for data
 20% of instructions are loads and stores
 Cache miss penalty is 100 clock cycles for I-cache and D-cache
 What is the impact on the CPI?
 Answer: Instruction data

Mem Stalls per Instruction = 0.02×100 + 0.2×0.05×100 = 3

CPIMemoryStalls = 1.5 + 3 = 4.5 cycles per instruction
CPIMemoryStalls / CPIPerfectCache = 4.5 / 1.5 = 3
Processor is 3 times slower due to memory stall cycles
CPINoCache = 1.5 + (1 + 0.2) × 100 = 121.5 (a lot worse)
Computer Architecture – Chapter 5 ©Fall 2013, CS 49
dce
2013
Average Memory Access Time
 Average Memory Access Time (AMAT)
AMAT = Hit time + Miss rate × Miss penalty
 Time to access a cache for both hits and misses
 Example: Find the AMAT for a cache with
 Cache access time (Hit time) of 1 cycle = 2 ns
 Miss penalty of 20 clock cycles
 Miss rate of 0.05 per access

 Solution:
AMAT = 1 + 0.05 × 20 = 2 cycles = 4 ns
Without the cache, AMAT will be equal to Miss penalty = 20 cycles

dce
2013
Next . . .
 Random Access Memory and its Structure

 Memory Hierarchy and the need for Cache Memory

 The Basics of Caches

 Cache Performance and Memory Stall Cycles

 Improving Cache Performance

 Multilevel Caches

dce
2013
Improving Cache Performance
 Average Memory Access Time (AMAT)
AMAT = Hit time + Miss rate * Miss penalty

 Used as a framework for optimizations

 Reduce the Hit time
 Small and simple caches

 Reduce the Miss Rate

 Larger cache size, higher associativity, and larger block size

 Reduce the Miss Penalty

 Multilevel caches

dce
2013
Small and Simple Caches
 Hit time is critical: affects the processor clock cycle
 Fast clock rate demands small and simple L1 cache designs
 Small cache reduces the indexing time and hit time
 Indexing a cache represents a time consuming portion
 Tag comparison also adds to this hit time
 Direct-mapped overlaps tag check with data transfer
 Associative cache uses additional mux and increases hit time
 Size of L1 caches has not increased much
 L1 caches are the same size on Alpha 21264 and 21364
 Same also on UltraSparc II and III, AMD K6 and Athlon
 Reduced from 16 KB in Pentium III to 8 KB in Pentium 4

dce
2013
Classifying Misses – Three Cs
 Conditions under which misses occur
 Compulsory: program starts with no block in cache
 Also called cold start misses
 Misses that would occur even if a cache has infinite size

 Capacity: misses happen because cache size is finite

 Blocks are replaced and then later retrieved
 Misses that would occur in a fully associative cache of a finite size

 Conflict: misses happen because of limited associativity

 Limited number of blocks per set
 Non-optimal replacement algorithm

dce
2013
Classifying Misses – cont’d

Compulsory misses are independent of cache size

Very small for long-running programs
Miss Rate
14% Capacity misses decrease as
1-way
capacity increases
12%
2-way
10% Conflict misses decrease
4-way as associativity increases
8%
8-way Data were collected using
6%
Capacity LRU replacement
4% Compulsory
2%
0
1 2 4 8 16 32 64 128 KB

dce
2013
Larger Size and Higher Associativity
 Increasing cache size reduces capacity misses

 It also reduces conflict misses

 Larger cache size spreads out references to more blocks

 Drawbacks: longer hit time and higher cost

 Larger caches are especially popular as 2nd level caches

 Higher associativity also improves miss rates

 Eight-way set associative is as effective as a fully associative

dce
2013
Larger Block Size
 Simplest way to reduce miss rate is to increase block size
 However, it increases conflict misses if cache is small

25% Increased Conflict Misses

Reduced
20% Compulsory
1K
Misses
64-byte blocks
Miss Rate

15% 4K are common in

L1 caches
10% 16K
128-byte block
64K are common in
5% L2 caches
256K
0%
32

256
128
16

Block Size (bytes)

dce
2013
Next . . .
 Random Access Memory and its Structure

 Memory Hierarchy and the need for Cache Memory

 The Basics of Caches

 Cache Performance and Memory Stall Cycles

 Improving Cache Performance

 Multilevel Caches

dce
2013
Multilevel Caches
 Top level cache should be kept small to
 Keep pace with processor speed
 Adding another cache level
I-Cache D-Cache
 Can reduce the memory gap
 Can reduce memory bus loading Unified L2 Cache

 Local miss rate Main Memory

 Number of misses in a cache / Memory accesses to this cache

 Miss RateL1 for L1 cache, and Miss RateL2 for L2 cache
 Global miss rate
Number of misses in a cache / Memory accesses generated by CPU
Miss RateL1 for L1 cache, and Miss RateL1  Miss RateL2 for L2 cache

dce
2013
Power 7 On-Chip Caches [IBM 2010]

32KB I-Cache/core
32KB D-Cache/core
3-cycle latency

256KB Unified
L2 Cache/core
8-cycle latency

32MB Unified
Shared L3 Cache
Embedded DRAM
25-cycle latency
to local slice

dce
2013
Multilevel Cache Policies
 Multilevel Inclusion
 L1 cache data is always present in L2 cache

 A miss in L1, but a hit in L2 copies block from L2 to L1

 A miss in L1 and L2 brings a block into L1 and L2

 A write in L1 causes data to be written in L1 and L2

 Typically, write-through policy is used from L1 to L2

 Typically, write-back policy is used from L2 to main memory

 To reduce traffic on the memory bus

 A replacement or invalidation in L2 must be propagated to L1

dce
2013
Multilevel Cache Policies – cont’d
 Multilevel exclusion
 L1 data is never found in L2 cache – Prevents wasting space
 Cache miss in L1, but a hit in L2 results in a swap of blocks
 Cache miss in both L1 and L2 brings the block into L1 only
 Block replaced in L1 is moved into L2
 Example: AMD Athlon

 Same or different block size in L1 and L2 caches

 Choosing a larger block size in L2 can improve performance
 However different block sizes complicates implementation
 Pentium 4 has 64-byte blocks in L1 and 128-byte blocks in L2

ECE 554 Computer Architecture Main Memory Spring 2013
No ratings yet
ECE 554 Computer Architecture Main Memory Spring 2013
35 pages
Computer Structure - Memory
100% (1)
Computer Structure - Memory
61 pages
Chapter 5-The Memory System
No ratings yet
Chapter 5-The Memory System
80 pages
Chapter 5-The Memory System
100% (1)
Chapter 5-The Memory System
80 pages
Memory: Computer Architecture and Assembly Language
No ratings yet
Memory: Computer Architecture and Assembly Language
15 pages
Fundamental Concepts
No ratings yet
Fundamental Concepts
64 pages
The Memory System
No ratings yet
The Memory System
50 pages
The Memory System: Deepak John, Department Computer Applications, SJCET-Pala
No ratings yet
The Memory System: Deepak John, Department Computer Applications, SJCET-Pala
63 pages
Memory Organization
No ratings yet
Memory Organization
24 pages
Memory System
No ratings yet
Memory System
70 pages
Chapter5-The Memory System
No ratings yet
Chapter5-The Memory System
77 pages
Memory System Overview
No ratings yet
Memory System Overview
84 pages
Memory and I/O System Overview
No ratings yet
Memory and I/O System Overview
89 pages
COA Chapter 4
No ratings yet
COA Chapter 4
11 pages
Unit 5 COA
No ratings yet
Unit 5 COA
34 pages
Computer Memory Types & Operations
No ratings yet
Computer Memory Types & Operations
75 pages
04 - Computer Memory Systems
No ratings yet
04 - Computer Memory Systems
91 pages
EECS 150 - Components and Design Techniques For Digital Systems Lec 16 - Storage: DRAM, SDRAM
No ratings yet
EECS 150 - Components and Design Techniques For Digital Systems Lec 16 - Storage: DRAM, SDRAM
26 pages
Comporg6 ch8
No ratings yet
Comporg6 ch8
75 pages
250324digital System Design - Memory
No ratings yet
250324digital System Design - Memory
107 pages
f37 Book Intarch Pres Pt5
No ratings yet
f37 Book Intarch Pres Pt5
75 pages
Main Memory Architecture Lecture
No ratings yet
Main Memory Architecture Lecture
50 pages
15-213 Memory Technology March 14, 2000: Topics
No ratings yet
15-213 Memory Technology March 14, 2000: Topics
36 pages
CSE2213 Lecture 8 Chapter5 the-Memory-System
No ratings yet
CSE2213 Lecture 8 Chapter5 the-Memory-System
48 pages
Unit 5 COA
No ratings yet
Unit 5 COA
95 pages
Memory Systems-Module 3
No ratings yet
Memory Systems-Module 3
79 pages
Unit IV The Memory System
No ratings yet
Unit IV The Memory System
78 pages
Lecture 10
No ratings yet
Lecture 10
44 pages
٥محاضرات أساسيات نضام الحاسوب
No ratings yet
٥محاضرات أساسيات نضام الحاسوب
8 pages
Chapter 7
No ratings yet
Chapter 7
25 pages
Chapter5-The Memory System
No ratings yet
Chapter5-The Memory System
78 pages
COA Lecture 20
No ratings yet
COA Lecture 20
26 pages
CHAPTER 12 - Memory Organization PDF
No ratings yet
CHAPTER 12 - Memory Organization PDF
34 pages
Feb. 2011 Computer Architecture, Memory System Design Slide 1
No ratings yet
Feb. 2011 Computer Architecture, Memory System Design Slide 1
74 pages
Lecture 10
No ratings yet
Lecture 10
44 pages
Chapter 5 - Memory - Systems
No ratings yet
Chapter 5 - Memory - Systems
80 pages
CHAPTER 5. Memory Element: Electrical Engineering Department PTSB
No ratings yet
CHAPTER 5. Memory Element: Electrical Engineering Department PTSB
93 pages
Computer Architecture Unit IV
No ratings yet
Computer Architecture Unit IV
52 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
34 pages
Computer Memory System
No ratings yet
Computer Memory System
22 pages
04 Cache Memory Internal Memory Revised 2
No ratings yet
04 Cache Memory Internal Memory Revised 2
43 pages
Chapter 1 Memory Basics
No ratings yet
Chapter 1 Memory Basics
33 pages
Memory Organization
No ratings yet
Memory Organization
52 pages
Lecture 10
No ratings yet
Lecture 10
44 pages
Memory New
No ratings yet
Memory New
70 pages
FCA2
No ratings yet
FCA2
46 pages
Flash Memory
100% (1)
Flash Memory
68 pages
Coa Unit 4
No ratings yet
Coa Unit 4
90 pages
Chapter 4 - Computer Memory System
No ratings yet
Chapter 4 - Computer Memory System
53 pages
RAM and ROM
No ratings yet
RAM and ROM
27 pages
Mod 5 Memory
No ratings yet
Mod 5 Memory
70 pages
04 Memory
No ratings yet
04 Memory
101 pages
Lect06 MemoryOrganization PrimaryMemory
No ratings yet
Lect06 MemoryOrganization PrimaryMemory
21 pages
L 10.1 Memory Types
No ratings yet
L 10.1 Memory Types
29 pages
Assignment: Embedded Systems
No ratings yet
Assignment: Embedded Systems
6 pages
Lecture19 New2024
No ratings yet
Lecture19 New2024
25 pages
Unit 5 (Memory)
No ratings yet
Unit 5 (Memory)
108 pages
Module 5 - 5.1 Overview of Computer Memory
No ratings yet
Module 5 - 5.1 Overview of Computer Memory
65 pages
Chapter5 PDF
No ratings yet
Chapter5 PDF
95 pages
Computer Architecture Chapter 4: The Processor Part 3: Dr. Phạm Quốc Cường
No ratings yet
Computer Architecture Chapter 4: The Processor Part 3: Dr. Phạm Quốc Cường
23 pages
E3 Res Elm04 15 PDF
No ratings yet
E3 Res Elm04 15 PDF
7 pages
Query Processing & Optimization
No ratings yet
Query Processing & Optimization
100 pages
Chapter - 5 Concurrency Control PDF
No ratings yet
Chapter - 5 Concurrency Control PDF
57 pages
Exercises 3 PDF
No ratings yet
Exercises 3 PDF
4 pages
Basic Authentification For Web Services: Mobile Application Development
No ratings yet
Basic Authentification For Web Services: Mobile Application Development
9 pages
Mobile Application Development: Luong The Nhan
No ratings yet
Mobile Application Development: Luong The Nhan
35 pages
CO3043 MobileApplicationDevelopment PDF
No ratings yet
CO3043 MobileApplicationDevelopment PDF
4 pages
Web Services for CS Students
No ratings yet
Web Services for CS Students
19 pages
Atr42 - 72 MPC Specs
No ratings yet
Atr42 - 72 MPC Specs
233 pages
EECS Undergrad Manual
No ratings yet
EECS Undergrad Manual
50 pages
MCQs Format 1
60% (5)
MCQs Format 1
3 pages
ATS22D88Q
No ratings yet
ATS22D88Q
10 pages
Report - Smart Irrigation System
No ratings yet
Report - Smart Irrigation System
32 pages
Philips hts6120
No ratings yet
Philips hts6120
51 pages
Manu Op. - 0001 - HBDR21990 - 2100001 - 0001
No ratings yet
Manu Op. - 0001 - HBDR21990 - 2100001 - 0001
178 pages
IEC 61850 Relay Testing Guide
No ratings yet
IEC 61850 Relay Testing Guide
5 pages
High Output: Industrial VRS Magnetic Speed Sensors
No ratings yet
High Output: Industrial VRS Magnetic Speed Sensors
8 pages
ProMariner ProSport Marine Battery Charger
No ratings yet
ProMariner ProSport Marine Battery Charger
32 pages
06 - CG Instrument Transformers
No ratings yet
06 - CG Instrument Transformers
12 pages
Low Power Estimation
No ratings yet
Low Power Estimation
82 pages
Bi-Directional Visitor Counter
No ratings yet
Bi-Directional Visitor Counter
21 pages
S-Quad Sensor and Sounder Datasheet
No ratings yet
S-Quad Sensor and Sounder Datasheet
2 pages
JBL PRX712.v1
No ratings yet
JBL PRX712.v1
2 pages
TESB10605R1
No ratings yet
TESB10605R1
11 pages
DC Generator & Motor Problems
No ratings yet
DC Generator & Motor Problems
35 pages
Siprotec 7ss85 Profile
No ratings yet
Siprotec 7ss85 Profile
2 pages
LCD Displayand Irremote Control: Brief Explanation
No ratings yet
LCD Displayand Irremote Control: Brief Explanation
8 pages
VT 300d
No ratings yet
VT 300d
3 pages
IVI-4.1 Scope v3
No ratings yet
IVI-4.1 Scope v3
251 pages
Op-Amp ICs: Types, Structure, Functions
No ratings yet
Op-Amp ICs: Types, Structure, Functions
79 pages
Cable Analysis Report
67% (3)
Cable Analysis Report
458 pages
Level 3 Repair: 8-1. Block Diagram
No ratings yet
Level 3 Repair: 8-1. Block Diagram
31 pages
Confidence Power Point
No ratings yet
Confidence Power Point
11 pages
Title Kundur Nptel Material Unit 1 - Introduction
No ratings yet
Title Kundur Nptel Material Unit 1 - Introduction
2 pages
High-Efficiency Solar Inverters
No ratings yet
High-Efficiency Solar Inverters
1 page
Application of Shift Registers
No ratings yet
Application of Shift Registers
4 pages
Cinema: ONE Solution Guide
No ratings yet
Cinema: ONE Solution Guide
12 pages
Dewalt DCS331 Sticksåg
No ratings yet
Dewalt DCS331 Sticksåg
17 pages

Chapter05 Memory PDF

Uploaded by

Chapter05 Memory PDF

Uploaded by

dce

Computer Architecture – Chapter 5 ©Fall 2013, CS 2

 Memory Hierarchy and the need for Cache Memory

 The Basics of Caches

 Cache Performance and Memory Stall Cycles

 Improving Cache Performance

Computer Architecture – Chapter 5 ©Fall 2013, CS 3

 Output Enable (OE) control signal

 Write Enable (WE) control signal m

 2n × m RAM chip: n-bit address and m-bit data

Computer Architecture – Chapter 5 ©Fall 2013, CS 4

Computer Architecture – Chapter 5 ©Fall 2013, CS 5

 6-Transistor cell with no static current

 Typically used for caches

 Provides fast access time Vcc

 Two pass transistors bit bit

 Row decoder selects the word line Typical SRAM cell

 Pass transistors enable the cell to be read and written

Computer Architecture – Chapter 5 ©Fall 2013, CS 6

 Trench capacitor (stores bit)

 Refreshing for all memory rows

Computer Architecture – Chapter 5 ©Fall 2013, CS 7

Computer Architecture – Chapter 5 ©Fall 2013, CS 8

Vss D4 D3 CAS OE A9 A8 A7 A6 A5 A4 Vss

Vcc D1 D2 WE RAS NC A10 A0 A1 A2 A3 Vcc

Computer Architecture – Chapter 5 ©Fall 2013, CS 9

Computer Architecture – Chapter 5 ©Fall 2013, CS 10

Computer Architecture – Chapter 5 ©Fall 2013, CS 11

 Fast transfer of blocks between memory and cache

Computer Architecture – Chapter 5 ©Fall 2013, CS 12

Computer Architecture – Chapter 5 ©Fall 2013, CS 13

 SDRAM is Synchronous Dynamic RAM

Computer Architecture – Chapter 5 ©Fall 2013, CS 14

 1 Transfer = 64 bits = 8 bytes of data

Computer Architecture – Chapter 5 ©Fall 2013, CS 15

Voltage 1 Written Refreshed Refreshed Refreshed

0 Stored Refresh Cycle

Computer Architecture – Chapter 5 ©Fall 2013, CS 16

Address Address ... Address

Data Data Data

Data width = m × p bits

Computer Architecture – Chapter 5 ©Fall 2013, CS 17

 Memory Hierarchy and the need for Cache Memory

 The Basics of Caches

 Cache Performance and Memory Stall Cycles

 Improving Cache Performance

Computer Architecture – Chapter 5 ©Fall 2013, CS 18

CPU Performance: 55% per year,

 1980 – No cache in microprocessor

 Each instruction involves at least one memory access

 Memory bandwidth limits the instruction execution rate

Computer Architecture – Chapter 5 ©Fall 2013, CS 20

Computer Architecture – Chapter 5 ©Fall 2013, CS 21

 Temporal Locality (in time)

 Spatial Locality (in space)

Computer Architecture – Chapter 5 ©Fall 2013, CS 22

 Used to reduce average access time to memory

 Caches exploit spatial locality by …

Computer Architecture – Chapter 5 ©Fall 2013, CS 23

Interface to L2 Cache or Main Memory

Computer Architecture – Chapter 5 ©Fall 2013, CS 24

 Branch target and prediction buffer

Computer Architecture – Chapter 5 ©Fall 2013, CS 25

 Memory Hierarchy and the need for Cache Memory

 The Basics of Caches

 Cache Performance and Memory Stall Cycles

 Improving Cache Performance

Computer Architecture – Chapter 5 ©Fall 2013, CS 26

Computer Architecture – Chapter 5 ©Fall 2013, CS 27

 Block offset: to access bytes within a block Tag Index offset

 A block address is further divided into

 Tag must be stored also inside cache

Computer Architecture – Chapter 5 ©Fall 2013, CS 29

 Address tag is compared against stored tag

 Cache data size = 2n+b bytes Data

Computer Architecture – Chapter 5 ©Fall 2013, CS 30