Understand CPU Caching
Concepts
Concept of Caching
Need for Cache has come about due to reasons :
The concept of Locality of reference.
-> 5 percent of the data is accessed 95 percent of the times, so
makes sense to cache the 5 percent of the data.
The gap between CPU and main memory speeds.
-> In analogy to producer consumer problem, the CPU is the
consumer and RAM, hard disks act as producers. Slow
producers limit the performance of the consumer.
Locality of Reference
Spatial locality : If a particular memory location say nth location is
referenced at a particular time, then it is likely that (n+1) th memory
location will be referenced in the near future
The actual piece of data that was requested is called the critical word,
and the surrounding group of bytes that gets fetched along with it is
called a cache line or cache block.
Temporal Locality: If at one point in time say T a particular memory
location is referenced, then it is likely that the same location will be
referenced again at time T+ delta.
This is very similar to the concept of working set, i.e., set of pages which
the CPU frequently accesses.
CPU Cache and its operation
A CPU cache is a smaller, faster memory which stores copies of the
data from the most frequently used main memory locations. The
concept of locality of reference drives caching concept, we cache the
most frequently used, data, instruction for faster data access.
CPU cache could be data cache, instruction cache. Unlike RAM, cache
is not expandable.
The CPU first checks in the L1 cache for data, if it does not find it at
L1, it moves over to L2 and finally L3. If not found at L3, it’s a cache
miss and RAM is searched next, followed by the hard drive.
If the CPU finds the requested data in cache, it’s a cache hit, and if
not, it’s a cache miss.
Levels of caching and speed, size
comparisons
Level Access Typical Technology Managed
Time Size By
Level 1 2-8 ns 8 KB-128 KB SRAM Hardware
Cache (on-
chip)
Level 2 5-12 ns 0.5 MB - 8 SRAM Hardware
Cache (off- MB
chip)
Main Memory 10-60 ns 64 MB - 2 DRAM Operating
GB System
Hard Disk 3,000,000 - 100 GB - 2 Magnetic Operating
10,000,000 ns TB System
Cache organization
When the processor needs to read or write a location in main
memory, it first checks whether that memory location is in the cache.
This is accomplished by comparing the address of the memory location
to all tags in the cache that might contain that address.
If the processor finds that the memory location is in the cache, we say
that a cache hit has occurred; otherwise, we speak of a cache miss.
Cache Entry structure
Cache row entries usually have the following structure:
Data Displacem
Tag Index Valid bit
blocks ent
The data blocks (cache line) contain the actual data fetched from the main memory.
The memory address is split into the tag, the index and the displacement (offset),
while the valid bit denotes that this particular entry has valid data.
•The index length is bits and describes which row the data has been put in.
•The displacement length is and specifies which block of the ones we have stored we
need.
•The tag length is address − index − displacement
Cache organization - 1
Cache is divided into blocks. The blocks form the basic unit of cache
organization. RAM is also organized into blocks of the same size as the
cache's blocks
When the CPU requests a byte from a particular RAM block, it needs to
be able to determine three things very quickly:
1. Whether or not the needed block is actually in the cache
2. The location of the block within the cache
3. The location of the desired byte within the block
Mapping RAM blocks to cache block
Fully associative : Any RAM block can be stored in any available block
frame. The problem with this scheme is that if you want to retrieve a
specific block from the cache, you have to check the tag of every single
block frame in the entire cache because the desired block could be in any
of the frames
Direct mapping : In a direct-mapped cache, each block frame can
cache only a certain subset of the blocks in main memory. For Eg. Ram
block X whose modulo results in 1 are always stored in Cache block 1.
The problem with this approach is certain cache blocks could remain
unused and there could be frequent eviction of cache entries for certain
cache blocks.
N way associative : Ram block X, could be either mapped to Cache
Block X or Y.
Handling Cache Miss
In order to make room for the new entry on a cache miss, the cache has to evict
one of the existing entries.
The heuristic that it uses to choose the entry to evict is called the replacement
policy. The fundamental problem with any replacement policy is that it must
predict which existing cache entry is least likely to be used in the future. Some
of the replacement policies are :
Random Eviction: Removal of any cache entry by random choice.
LIFO: Evicting the latest cache entry.
FIFO: Evicting the oldest cache entry.
LRU: Evicting the Least recently used cache entry.
Mirroring Cache to Main memory
If data are written to the cache, they must at some point be written to
main memory and other higher order cache as well. The timing of this
write is controlled by what is known as the write policy.
A Write-through cache, every write to the cache causes a write to main
memory and higher order cache like L2, L3.
Write-back or copy-back cache, writes are not immediately mirrored to
the main memory. Instead, the cache tracks which locations will be
evicted. Such entries are written to main memory, higher order cache just
before eviction of the cache entry
Stale data in cache
The data in main memory being cached may be changed by other entities
(e.g. peripherals using direct memory access or multi-core processor), in
which case the copy in the cache may become out-of-date or stale.
Alternatively, when the CPU in a multi-core processor updates the data in
the cache, copies of data in caches associated with other cores will
become stale.
Communication protocols between the cache managers. Which keep the
data consistent are known as cache coherence protocols. Eg. Snoopy
based, directory based, token based.
State of the Art today
Current day research on cache design, handling cache coherence, is
more biased to multicore architectures.
References
Wikipedia : http://en.wikipedia.org/wiki/CPU_cache
ArsTechnica : http://arstechnica.com/
http://software.intel.com
What Every Programmer Should Know About Memory -
- Ulrich Drepper, Red Hat, Inc.
Q/A