0% found this document useful (0 votes)

49 views46 pages

Stanford Advanced Caches

Uploaded by

Murali Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views46 pages

Stanford Advanced Caches

Uploaded by

Murali Krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

EE282 Lecture 4

Advanced Caching (2)

Jacob Leverich

http://eeclass.stanford.edu/ee282

EE282 – Spring 2011 – Lecture 04

Announcements
 HW1 out
 Due Wed 4/20 @ 5pm, box outside Gates 305

2
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches

3
Advanced Cache Optimizations
 Multi-level caches and inclusion
 Victim caches
 Pseudo-associative caches
 Skew-associative caches
 Critical word first
 Non-blocking caches
 Prefetching
 Multi-ported caches

 Readings: H&P 5.1-2 and 4.2

 Read on your own about way prediction, pipelined caches, merging
write buffers, compiler optimizations

4
Non-blocking or Lockup Free
Caches
 Basic idea
 Allow for hits while serving a miss (hit-under-miss)
 Allow for more than one outstanding miss (miss-under-miss)
 When does it make sense (for L1, L2, …)
 When the processor can handle >1 pending load/store
 This is the case with superscalar processors
 When the cache serves >1 processor or other cache
 When the lower level allows for multiple pending accesses
 More on this later
 What is difficult about non-blocking caches
 Handling multiple misses at the same time
 Handling loads to pending misses
 Handling stores to pending misses
5
Potential of Non-blocking
Caches
CPU Stall CPU on miss
Miss Penalty

Miss

Miss Hit
CPU CPU Hit under miss
Miss Penalty

Stall only when

result needed
Miss Hit Miss
CPU
Miss Penalty Multiple out-standing misses
Miss Penalty
Miss Penalty

6
Miss Status Handling Register
 Keeps track of
 Outstanding cache misses
 Pending load & stores that refer to that cache block
 Fields of an MSHR
 Valid bit
 Cache block address
 Must support associative search
 Issued bit (1 if already request issued to memory)
 For each pending load or store
 Valid bit
 Type (load/store) and format (byte/halfword/…)
 Block offset
 Destination register for load OR store buffer entry for stores
7
MSHR

1 27 1 1 3 5 5
Valid Block Address Issued Valid Type Block Offset Destination Load/store 0

Valid Type Block Offset Destination Load/store 1

Valid Type Block Offset Destination Load/store 2

Valid Type Block Offset Destination Load/store 3

8
Non-block Caches: Operation
 On a cache miss:
 Search MSHRs for pending access to same cache block
 If yes, just allocate new load/store entry
 (if no) Allocate free MSHR
 Update block address and first load/store entry
 If no MSHR or load/store entry free, stall
 When one word/sub-block for cache line become available
 Check which load/stores are waiting for it
 Forward data to LSU
 Mark loads/store as invalid
 Write word in the cache
 When last word for cache line is available
 Mark MSHR as invalid
9
Non-blocking Cache Efficacy

Cache Miss rate Miss Hit Band-

optimization Cold Capacity Conflict penalty time width

Non-blocking
cache

10
Prefetching
 Idea: fetch data into the cache before processors
request them
 Can address cold misses
 Can be done by the programmer, compiler, or hardware

 Characteristics of ideal prefetching

 You only prefetch data that are truly needed
 Avoid bandwidth waste
 You issue prefetch requests early enough
 To hide the memory latency
 You don’t issue prefetch requests too early
 To avoid cache pollution
11
Software Prefetching
for (i=0; i<N; i++) {  Issues software prefetching
__prefetch(a[i+8]);  Takes up issue slots
__prefetch(b[i+8]);  Not big issue with superscalar
sum += a[i]*b[i];  Takes up system bandwidth
}
 Must have non-blocking caches
Doesn’t have to be correct!
 Prefetch distance depends on
__prefetch(-1);
specific system implementation
 Non-portable code
 Not easy to use for pointer based
structures
 Requires ninja
programmer/compiler!
12
Hardware Prefetching
 Same goal with software prefetching but initiated by hardware
 Can tune to specific system implementation
 Does not waste instruction issue bandwidth
 More portable code
 Major design questions
 Where to place a prefetch engine?
 L1, L2, …
 What to prefetch?
 Next sequential cache line(s), strided patterns, pointers, …
 When to prefetch?
 On a load, on a miss, when other prefetched data used, …
 Where to place prefetched data
 In the cache or in a special prefetch buffer
 How to handle VM exceptions?
 Don’t prefetch beyond a page?
13
Simple Sequential Prefetching
 On a cache miss, fetch two sequential memory
blocks
 Exploits spatial locality in both instructions & data
 Exploits high bandwidth for sequential accesses

 Called “Adjacent Cache Line Prefetch” or “Spatial

Prefetch” by Intel

 Extend to fetching N sequential memory blocks

 Pick N large enough to hide the memory latency
14
Stream Prefetching
 Sequential prefetching problem
 Performance slows down once every N cache lines
 Stream prefetching is a continuous version of prefetching
 Stream buffer can fit N cache lines
 On a miss, start fetching N sequential cache lines
 On a stream buffer hit:
 Move cache line to cache, start fetching line (N+1)
 In other words, stream buffer tries to stay N cache lines ahead
 Design issues
 When is a stream buffer allocated
 When is a stream buffer released
 Can use multiple stream buffers to capture multiple streams
 E.g. a program operating on 2 arrays

15
Stream Buffer Design

16
Strided Prefetching
PC Stride Last Addr Conf
 Idea: detect and prefetch strided accesses
PC
 for (i=0; i<N; i++) A[i*1024]++; 0x08ab0 8 0xff024 10

 Stride detected using a PC-based table 0x03fa8 1024 0xf0ab2 11

 For each PC, remember the stride

 Stride detection
 Remember the last address used for this PC
 Compare to currently used address for this PC
 Track confidence using a two bit saturating counter
 Increment when stride correct, decrement when incorrect

 How to use the PC-based table

 Similar to stream prefetching except using stride instead of +1
18
Sandybridge Prefetching
(Intel Core i7-2600K)
 “Intel 64 and IA-32 Architectures Optimization
Reference Manual, Jan 2011”, pg 2-24

http://www.intel.com/Assets/PDF/manual/248966.pdf
19
Other Ideas in Prefetching
 Prefetch for pointer-based data structures
 Predict if fetched data contain a pointer & follow it
 Works for linked-lists, graphs, etc
 Must be very careful:
 What is a pointer?
 How far to prefetch?

 Different correlation techniques

 Markov prefetchers
 Delta correlation prefetchers

20
Prefetching Efficacy

Cache Miss rate Miss Hit Band-

optimization Cold Capacity Conflict penalty time width

Prefetching

21
Multi-ported Caches
 Idea: allow for multiple accesses in parallel
 Processor with many LSUs, I+D access in L2, …

 Can be implemented in multiple ways

 True multi-porting
 Multiple banks

 What is difficult about multi-porting

 Interaction between parallel accesses (especially for
stores)
22
True Multi-porting
 True multiporting
 Use 2-ported tag/data storage
 Problem: large area increase
 Problem: hit time increase

Request 1 Data 1

Request 2 Cache Data 2

23
Multi-banked Caches
Cache
Request 1 Read Data 1
Bank 1

Cache
Request 2 Read Data 2
Bank 2

 Partition address space into multiple banks

 Bank0 caches addresses from partition 0, bank1 from partition 1…
 Can use least or most significant address bits for partitioning
 What are the advantages of each approach?

 Benefits: accesses can go in parallel if no conflicts

 Challenges: conflicts, distribution network, bank
utilization
24
Sun UltraSPARC T2
8-bank L2 cache

25
Multi-porting Efficacy

Cache Miss rate Miss Hit Band-

optimization Cold Capacity Conflict penalty time width

Multi-porting

26
Summary of
Advanced Cache Optimizations
Cache Miss rate Miss Hit Band-
optimization Cold Capacity Conflict penalty time width

Multi-level +
Victim cache ~ +
Pseudo-assoc. +/~
Skew-assoc. + ~
Non-blocking + ~
Critical-word-
first +
Prefetching + - +
Multi-porting ~ +
Also see Figure 5.11 in H&P 27
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches

28
Cache Coherence Problem
P1 P2 P3
u=? 3
u=?
4 5 $
$ $

u :5 u :5 u= 7

I/O devices
1
2
u:5
Memory

 Cores may see different values for u

 With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when
 Threads or processes accessing main memory may see very stale value
 Unacceptable for programming, and its frequent!
29
Hardware Cache Coherence
Using Snooping
 Hardware guarantees that loads from all
cores will return the value of the latest
write

 Coherence mechanisms
 Metadata to track state for cached data
 Controller that snoops bus (or interconnect)
activity and reacts if needed to adjust the state
of the cache data

 There needs to be a serialization point

 Shared L3, memory controller, or memory bus

30
MSI: Simple Coherence Protocol for Write
Back Caches
Each cache line has a tag M: Modified
S: Shared
Address tag I: Invalid
state
bits
P1 reads
Other process reads, M or writes
P1 writes back
Other processor
intent to write

Read
miss
S I
Read by any Other processor
processor intent to write Cache state in
processor P1
31
Quick Questions
 How many copies of a cache line can you have
in S state?

 How many copies can you have in M state?

 How does L2 inclusion help?

33
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches

34
Software-managed Memory
 Caches are complex, hard to design, hard to
optimize, hard to analyze, hard to use well,
hard to keep coherent…

 Private on-chip memory with its own address

space
 Not implicitly backed by main memory
 Also called “Local Store”, “Local Memory”,
“Scratchpad”, “Stream Register File”
 Ubiquitous in embedded computing space
35
Local Stores in the wild
 IBM Cell Processor
 256KB LS per core
 Shared by inst. and data!

Playstation 3!
36
Cache vis-à-vis Local Store
 Cache  Local Store

37
Local Stores: AMAT
AMAT = HitTime + MissRate * MissPenalty

 MissRate = 0%!

Consequences?
 Simpler performance analysis
 Less motivation for out-of-order cores
 Cell processor is in-order
 High clock rate and low power
38
Local Stores: Operation
 LD/ST instructions to LS proceed normally
 No LD/ST to non-LS memory

 DMA transfers (Direct Memory Access) to

move data to/from main memory and LS
 Bulk, like memcpy()
 Asynchronous

dma(void local_address, void remote_address,

int size, int tag, boolean direction);

39
Stream Programming
Time

get(a) do_something(a) get(b) do_something(b)

get(a) get(b)
do_something(a) do_something(b)

 Overlap communication with computation

 Hide memory latency
 “Macroscopic” software prefetching
 No ugly prefetch instructions interlaced w/ your code
 Doesn’t waste instruction issue bandwidth
40
Local Stores: Pros and Cons
Pros Cons
 No coherence!  No coherence…
 Simple to implement  Complex to program
 Less overhead (no tags)  Can’t run existing SW
 Predictable performance,  Unpredictable access
great for in-order cores patterns perform poorly
 Can potentially hide all  Pointer chasing difficult
memory latency (linked lists, trees, etc.)

People resort to implementing

set-associative caches in
software…
41
Local Store Efficacy

Cache Miss rate Miss Hit Band-

optimization Cold Capacity Conflict penalty time width

Local Store

SW Complexity

42
Today’s lecture: More Caches
 Advanced cache optimizations
 H&P Chapter 5

 Cache coherence
 H&P Chapter 4

 Software-managed memories

 Beyond processor caches

43
Everything is a Cache for
Something Else
Access Time Capacity Managed by
Registers 1 cycle ~500B Software/compiler

Level 1 Cache 1-3 cycles ~64KB Hardware

Level 2 Cache 5-10 cycles 1-10MB Hardware

DRAM ~100 cycles ~10GB Software/OS

Disk
106-107 cycles ~1TB Software/OS

The
Tape Interwebs
44
Example: File cache
 Do files exhibit locality?
 Prefetching?
 Write back or write  Microsoft “SuperFetch”:
through? load common programs
at boot
 When should we write
to disk?
 Coherence?
 Associativity?  “Leases” in network
filesystems
 Place arbitrarily and
keep an index
 Most disks have caches
45
Example: Browser cache
 Do web pages you visit
exhibit locality?
 Coherence?
 Write back or write  Did the page change
through? since I last checked?
 No writes!  Relaxed coherence
 “If-Modified-Since”
header
 Replacement policy?
 Probably LRU
 AMAT?
46
Caching is a ubiquitous tool
 Same design issues in system design as in
processor design
 Placement, lookup, write policies, replacement
policies, coherence

 Same optimization dimensions

 Size, associativity, granularity
 Hit time, miss rate, miss penalty, bandwidth,
complexity

47
Next Lecture

 DRAM (Main Memory)

EGC121lect19 Cache Prefetching
No ratings yet
EGC121lect19 Cache Prefetching
22 pages
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
No ratings yet
Lecture: Cache Hierarchies: Topics: Cache Innovations (Sections B.1-B.3, 2.1)
20 pages
EECS 470 Final Review
No ratings yet
EECS 470 Final Review
16 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
4 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
10 Caches
No ratings yet
10 Caches
34 pages
Cache 2
No ratings yet
Cache 2
37 pages
Advanced Cache Strategies
No ratings yet
Advanced Cache Strategies
27 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
Module4 CAche Performance
No ratings yet
Module4 CAche Performance
40 pages
5.2 Eleven Advanced Optimizations of Cache Performance
No ratings yet
5.2 Eleven Advanced Optimizations of Cache Performance
13 pages
Memory 2
No ratings yet
Memory 2
31 pages
Unit II
No ratings yet
Unit II
9 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
12 pages
F11 - Cache Aware Programming For Multicores
No ratings yet
F11 - Cache Aware Programming For Multicores
20 pages
Unit 3 - LM11 - Memory Prefetching
No ratings yet
Unit 3 - LM11 - Memory Prefetching
6 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
CS7810 Prefetching: Seth Pugsley
No ratings yet
CS7810 Prefetching: Seth Pugsley
22 pages
Caches and Memory
No ratings yet
Caches and Memory
65 pages
High Performance Computing: Jeremy R. Johnson
No ratings yet
High Performance Computing: Jeremy R. Johnson
19 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
Lecture 8 Cont. Cache Memory
No ratings yet
Lecture 8 Cont. Cache Memory
29 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
Address Translation, Caches, and Tlbs
No ratings yet
Address Translation, Caches, and Tlbs
32 pages
COMP 740: Computer Architecture and Implementation: Montek Singh
No ratings yet
COMP 740: Computer Architecture and Implementation: Montek Singh
41 pages
Lec14 Demandpage
No ratings yet
Lec14 Demandpage
25 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
No ratings yet
Onur 447 Spring15 Lecture19 High Performance Caches Afterlecture
57 pages
Lec 34
No ratings yet
Lec 34
26 pages
10 Multi-Level Strategies: Assignments
No ratings yet
10 Multi-Level Strategies: Assignments
20 pages
4 Caches With Notes
No ratings yet
4 Caches With Notes
121 pages
5.5 Cache Organization
No ratings yet
5.5 Cache Organization
8 pages
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
No ratings yet
CHAPTER 2 Memory Hierarchy Design & APPENDIX B. Review of Memory Heriarchy
73 pages
Advanced Caching Techniques: Approaches To Improving Memory System Performance
No ratings yet
Advanced Caching Techniques: Approaches To Improving Memory System Performance
18 pages
CompArch Cheatsheet
No ratings yet
CompArch Cheatsheet
2 pages
Onur 447 Spring15 Lecture25 Prefetching Afterlecture
No ratings yet
Onur 447 Spring15 Lecture25 Prefetching Afterlecture
57 pages
Cache Presentation
No ratings yet
Cache Presentation
45 pages
Memory Hierarchy for Engineers
No ratings yet
Memory Hierarchy for Engineers
15 pages
10 Caches Detail
No ratings yet
10 Caches Detail
45 pages
CS530 Fall2015 Lecture6
No ratings yet
CS530 Fall2015 Lecture6
3 pages
UNIT-IV Memory and I/O
No ratings yet
UNIT-IV Memory and I/O
36 pages
L07 MemoryII
No ratings yet
L07 MemoryII
27 pages
Cache
No ratings yet
Cache
36 pages
Lecture 5 Cache Optimization
No ratings yet
Lecture 5 Cache Optimization
25 pages
25 e 50 Beb 5 Aad 8 F 60
No ratings yet
25 e 50 Beb 5 Aad 8 F 60
49 pages
Cau 6 Cache
No ratings yet
Cau 6 Cache
25 pages
Module 5
No ratings yet
Module 5
17 pages
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
No ratings yet
Memory Hierarchy Design: A Quantitative Approach, Fifth Edition
17 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Cache Performance Improving Cache Performance
No ratings yet
Cache Performance Improving Cache Performance
6 pages
Cache Memory Organization Guide
No ratings yet
Cache Memory Organization Guide
18 pages
COA Digital-Cheatsheet
No ratings yet
COA Digital-Cheatsheet
4 pages
R RRRRRRRR Final
No ratings yet
R RRRRRRRR Final
28 pages
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
No ratings yet
Cache Coherence: Write-Invalidate Snooping Protocol For Write-Back
21 pages
09 Caches Tlbs
No ratings yet
09 Caches Tlbs
33 pages
Lab3 Suppl
No ratings yet
Lab3 Suppl
25 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
3 - Evolution of The Transport System
No ratings yet
3 - Evolution of The Transport System
4 pages
Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
No ratings yet
Onur Comparch Fall2018 Lecture10b Memorylatency Afterlecture
115 pages
2 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 4th NOV 2024 - 9th NOV 2024
No ratings yet
2 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 4th NOV 2024 - 9th NOV 2024
4 pages
6 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 2nd DEC 2024 - 6th DEC 2024
No ratings yet
6 - PORTIONS - COMPLETED - WEEKLY - UPDATES - GRADE - 4 - 2nd DEC 2024 - 6th DEC 2024
3 pages
Huawei SRv6
100% (1)
Huawei SRv6
97 pages
01-01 Segment Routing
No ratings yet
01-01 Segment Routing
12 pages
BGP Overview FAL
100% (1)
BGP Overview FAL
24 pages
Simplified Evpn Vxlan
No ratings yet
Simplified Evpn Vxlan
75 pages
tc5 Manual en PDF
No ratings yet
tc5 Manual en PDF
15 pages
Quantitative Primer
No ratings yet
Quantitative Primer
89 pages
Activity 6.4.4: Basic Route Summarization: Topology Diagram
No ratings yet
Activity 6.4.4: Basic Route Summarization: Topology Diagram
3 pages
Chapter 85: Checking The Status of The Saposcol Before Performance Checking The Status of The Saposcol Before Performance Tuning Tuning
No ratings yet
Chapter 85: Checking The Status of The Saposcol Before Performance Checking The Status of The Saposcol Before Performance Tuning Tuning
4 pages
FN 07
No ratings yet
FN 07
1 page
2.16 UG BCA NEP Syllabus 3 4th Sem 2022 23 Onwards 17 11 22
No ratings yet
2.16 UG BCA NEP Syllabus 3 4th Sem 2022 23 Onwards 17 11 22
28 pages
Final Merger and Compaq
No ratings yet
Final Merger and Compaq
15 pages
Strong Naming On Precompiled Assemblies
0% (1)
Strong Naming On Precompiled Assemblies
75 pages
Array Size Error in GetPrintableDeviceMode On 32-Bit Architecture Issue #73 Thotro - Arduino-Dw1000 GitHub
No ratings yet
Array Size Error in GetPrintableDeviceMode On 32-Bit Architecture Issue #73 Thotro - Arduino-Dw1000 GitHub
1 page
Ooad Solutions
100% (1)
Ooad Solutions
11 pages
Schematic Diagram
100% (1)
Schematic Diagram
12 pages
B.tech R22 Mid Question Bank Java 1
No ratings yet
B.tech R22 Mid Question Bank Java 1
3 pages
The 2G, 3G and 4G Wireless Network Infrastructure Market: 2014 - 2020 - With An Evaluation of WiFi and WiMAX
No ratings yet
The 2G, 3G and 4G Wireless Network Infrastructure Market: 2014 - 2020 - With An Evaluation of WiFi and WiMAX
4 pages
DDNStorage Insight Datasheet v4
No ratings yet
DDNStorage Insight Datasheet v4
2 pages
Computer System Programming
No ratings yet
Computer System Programming
56 pages
Instant Ebooks Textbook Data Structures Other Objects Using Java 4th Edition Michael Mann Download All Chapters
100% (21)
Instant Ebooks Textbook Data Structures Other Objects Using Java 4th Edition Michael Mann Download All Chapters
64 pages
8085 Microprocessor Programming Rev. 1.0
No ratings yet
8085 Microprocessor Programming Rev. 1.0
4 pages
C++ Classes: A Beginner's Guide
No ratings yet
C++ Classes: A Beginner's Guide
80 pages
Trend Micro - Server Protect
No ratings yet
Trend Micro - Server Protect
166 pages
Python Interview Questions and Answers For Freshers and Advanced Level Experienced
No ratings yet
Python Interview Questions and Answers For Freshers and Advanced Level Experienced
18 pages
Online Bus Booking System
100% (1)
Online Bus Booking System
69 pages
Depot Repair Test Scripts
No ratings yet
Depot Repair Test Scripts
33 pages
Craig Esquivel - Professional Resume
No ratings yet
Craig Esquivel - Professional Resume
1 page
Animal-Face AI Test
No ratings yet
Animal-Face AI Test
1 page
Objects:: Xing, Alto and Waganr Represents Individual Objects. in This Context Each Car Object
No ratings yet
Objects:: Xing, Alto and Waganr Represents Individual Objects. in This Context Each Car Object
9 pages
Tally Software User Guide
No ratings yet
Tally Software User Guide
150 pages
Recruitment & Selection Insights
No ratings yet
Recruitment & Selection Insights
3 pages
Greedy Solution To The Fractional Knapsack Prob
No ratings yet
Greedy Solution To The Fractional Knapsack Prob
3 pages
2011 ETRM Analyst Report IDC MarketScape Excerpt
No ratings yet
2011 ETRM Analyst Report IDC MarketScape Excerpt
11 pages
Stacks and Queues
100% (1)
Stacks and Queues
41 pages

Stanford Advanced Caches

Uploaded by

Stanford Advanced Caches

Uploaded by

EE282 Lecture 4

Advanced Caching (2)

EE282 – Spring 2011 – Lecture 04

 Beyond processor caches

 Readings: H&P 5.1-2 and 4.2

Stall only when

Valid Type Block Offset Destination Load/store 1

Valid Type Block Offset Destination Load/store 2

Valid Type Block Offset Destination Load/store 3

Cache Miss rate Miss Hit Band-

 Characteristics of ideal prefetching

 Called “Adjacent Cache Line Prefetch” or “Spatial

 Extend to fetching N sequential memory blocks

 Stride detected using a PC-based table 0x03fa8 1024 0xf0ab2 11

 For each PC, remember the stride

 How to use the PC-based table

 Different correlation techniques

Cache Miss rate Miss Hit Band-

 Can be implemented in multiple ways

 What is difficult about multi-porting

Request 2 Cache Data 2

 Partition address space into multiple banks

 Benefits: accesses can go in parallel if no conflicts

Cache Miss rate Miss Hit Band-

 Beyond processor caches

 Cores may see different values for u

 There needs to be a serialization point

 How many copies can you have in M state?

 How does L2 inclusion help?

 Beyond processor caches

 Private on-chip memory with its own address

 DMA transfers (Direct Memory Access) to

dma(void *local_address, void *remote_address,

get(a) do_something(a) get(b) do_something(b)

 Overlap communication with computation

People resort to implementing

Cache Miss rate Miss Hit Band-

 Beyond processor caches

Level 1 Cache 1-3 cycles ~64KB Hardware

Level 2 Cache 5-10 cycles 1-10MB Hardware

DRAM ~100 cycles ~10GB Software/OS

 Same optimization dimensions

 DRAM (Main Memory)

You might also like

dma(void local_address, void remote_address,