Posted on 26 Aug 2025 — CC-BY 4.0 — https://doi.org/10.36227/techrxiv.175623439.
92809872/v1 — e-Prints posted on TechRxiv are preliminary reports that are not peer reviewed. They should not b...
August 26, 2025
OM MAHESHWARI1
Affiliation not available
1
Adaptive-LRU: A Lightweight, Thrash-Resistant
Cache Replacement Policy for High-Performance
CPU Caches
Om Maheshwari
School of Computing and Electrical Engineering
Indian Institute of Technology Mandi
Mandi, Himachal Pradesh, India
Email: [email protected]
Abstract—Cache replacement policies significantly impact CPU • Streaming workloads: Sequential access to large data
performance, particularly under diverse memory access pat- structures that exceed cache capacity result in zero reuse,
terns. While Least Recently Used (LRU) excels with strong causing LRU to achieve near-zero hit rates.
temporal locality, it suffers catastrophic performance degradation
• Near-capacity loops: Cyclic access to working sets
under streaming workloads and working sets that marginally
exceed cache capacity. We present Adaptive-LRU, a lightweight, marginally larger than cache capacity leads to thrashing,
dynamically adaptive insertion policy that employs set duel- where useful data is repeatedly evicted just before reuse.
ing to toggle between standard LRU insertion and a thrash- • Mixed access patterns: Applications with both high-
resistant bimodal insertion scheme. Our comprehensive eval- locality and streaming components suffer when streaming
uation using a detailed trace-driven simulator with proper
statistical methodology demonstrates that Adaptive-LRU achieves accesses pollute the cache and evict useful data.
32.6 percentage points higher hit rate than LRU/FIFO on stream- These limitations have motivated extensive research into
ing workloads and 63.2 percentage points on near-capacity loops. adaptive replacement policies that can dynamically adjust
Compared to state-of-the-art DRRIP, Adaptive-LRU delivers their behavior based on workload characteristics, including
7.6 × better performance on streaming and 5.2 × better on near-
capacity workloads while maintaining identical performance on Dynamic Insertion Policy (DIP) [3], Re-Reference Interval
cache-friendly patterns. The policy requires minimal hardware Prediction (RRIP) [4], and Hawkeye [5].
overhead (6.0 to12.0 bytes total per cache partition, zero per-line
metadata) and preserves the hit critical path of standard LRU A. Our Approach and Contributions
implementations. This paper introduces Adaptive-LRU, a lightweight cache
Index Terms—Cache replacement, CPU architecture, memory replacement policy that achieves the scan resistance of com-
hierarchy, LRU, dynamic insertion policy, set dueling, thrash plex adaptive policies while maintaining the simplicity and low
resistance, scan detection
overhead of classical LRU. Our key insight is that adaptive
insertion position control captures most of the performance
I. I NTRODUCTION
benefits of sophisticated replacement policies at minimal hard-
The exponentially widening gap between processor and ware cost.
main memory performance has made cache memory sys- Adaptive-LRU employs a set dueling mechanism similar to
tems critical for sustained computational throughput. Modern DIP but focuses specifically on the insertion decision while
processors employ sophisticated multi-level cache hierarchies preserving standard LRU victim selection and hit promotion
where replacement policy decisions directly influence appli- behavior. The policy uses a small number of leader sets to
cation performance through their impact on miss rates and evaluate two competing insertion strategies: standard LRU
memory access latency [1]. insertion (insert at Most Recently Used position) and Bimodal
Cache replacement policies determine which cache line to Insertion Policy (BIP) which inserts most lines at the LRU
evict when inserting new data into a full cache set. The Least position with occasional MRU insertion probability ϵ = 1/32.
Recently Used (LRU) policy has dominated processor cache Our primary contributions include:
design due to its effectiveness under temporal locality—the 1) A novel adaptive insertion mechanism that provides
principle that recently accessed data is likely to be accessed superior scan protection while preserving temporal lo-
again soon [2]. LRU maintains a temporal ordering of cache cality benefits with minimal hardware overhead.
lines and evicts the least recently accessed line during capacity 2) Comprehensive experimental evaluation including
misses. miss-ratio curve analysis, associativity sensitivity stud-
However, LRU exhibits pathological behavior under certain ies, parameter optimization, and detailed comparison
access patterns that are increasingly common in modern ap- with state-of-the-art policies using rigorous statistical
plications: methodology.
3) Superior performance results: 32.6 percentage points C. Modern Adaptive Policies
hit rate improvement over LRU/FIFO on streaming Dynamic Insertion Policy (DIP) [3]: Introduced the con-
workloads, 7.6 × improvement over DRRIP on scan- cept of set dueling to dynamically choose between LRU and
intensive patterns, while matching LRU on cache- Bimodal Insertion Policy (BIP). BIP inserts most lines at
friendly workloads. the LRU position with low probability insertion at MRU,
4) Detailed hardware analysis demonstrating providing scan resistance.
6.0 to12.0 bytes total overhead per cache partition Re-Reference Interval Prediction (RRIP) [4]: Generalizes
with zero per-line metadata requirements and no impact replacement decisions using re-reference interval predictions.
on hit critical path timing. Each line has a 2-bit Re-Reference Prediction Value (RRPV)
5) Practical deployment insights including integration indicating expected time until next access. DRRIP uses set
considerations, parameter sensitivity analysis, and com- dueling to choose between Static RRIP (SRRIP) and Bimodal
patibility with existing cache hierarchies. RRIP (BRRIP).
The remainder of this paper is organized as follows: Sec- Hawkeye [5]: Employs a sophisticated predictor that learns
tion II reviews background and foundational related work. cache-friendly behavior based on Belady’s optimal algorithm.
Section III presents the Adaptive-LRU policy design and algo- Uses PC-based prediction and requires substantial hardware
rithm. Section IV discusses hardware implementation details. for training and prediction tables.
Section V describes our experimental methodology. Section VI
III. A DAPTIVE -LRU P OLICY D ESIGN
presents comprehensive evaluation results. Section VII pro-
vides detailed analysis and discussion. Section VIII discusses A. Design Philosophy
additional related work directions. Section IX addresses limi- Adaptive-LRU is based on three key design principles:
tations and future work. Section X concludes. 1) Insertion-focused adaptation: Modify only the inser-
tion position of new cache lines while preserving LRU
II. BACKGROUND AND F OUNDATIONAL R ELATED W ORK victim selection and hit promotion behavior.
2) Minimal hardware overhead: Require no per-line
A. Cache Replacement Fundamentals metadata and minimal global state.
Cache replacement policies operate under the constraint that 3) Backward compatibility: Maintain identical behavior
they cannot predict future memory accesses. Belady’s MIN to LRU when scan protection is not needed.
algorithm [6] provides the theoretical optimum by evicting These principles ensure that Adaptive-LRU can be deployed
the line whose next reference is farthest in the future, but this as a drop-in replacement for existing LRU implementations
requires clairvoyance about future accesses. with minimal hardware changes.
The concept of stack distance (or reuse distance) provides
B. Core Mechanism
a theoretical framework for understanding replacement policy
behavior [7]. The stack distance of a memory reference is the Adaptive-LRU employs a set dueling mechanism to dynam-
number of distinct memory locations accessed since the previ- ically choose between two insertion modes:
ous reference to the same location. Miss-ratio curves (MRCs) LRU Insertion Mode: New cache lines are inserted at the
plot miss rate versus cache capacity and are fundamental to Most Recently Used (MRU) position, identical to standard
understanding cache behavior across different sizes. LRU behavior.
LRU is a stack algorithm, meaning it maintains the inclusion BIP Insertion Mode: New cache lines are inserted at the
property: if a line is present in a cache of size k, it will Least Recently Used (LRU) position with high probability 1−
also be present in any cache of size k ′ > k given the same ϵ, and at the MRU position with low probability ϵ (typically
access sequence. This property allows analytical reasoning ϵ = 1/32).
about LRU behavior across different cache capacities. The policy divides cache sets into three categories:
• LRU Leader Sets (LLRU ): A small number of sets that
B. Classical Replacement Policies always use LRU insertion mode.
• BIP Leader Sets (LBIP ): A small number of sets that
Least Recently Used (LRU): Maintains temporal ordering always use BIP insertion mode.
and evicts the least recently accessed line. Performs well • Follower Sets: All remaining sets that adopt the insertion
under strong temporal locality but suffers under streaming and mode of the currently winning leader.
scanning patterns.
First-In-First-Out (FIFO): Evicts the oldest line regard- C. Set Dueling and Policy Selection
less of access recency. Simpler to implement than LRU but A global saturating counter called the Policy Selector
generally inferior performance. (PSEL) tracks the relative performance of the two insertion
Random: Evicts a randomly selected line. Provides base- modes. PSEL is updated only on misses in leader sets:
line performance and avoids worst-case behavior but lacks • Miss in LRU leader set: PSEL is decremented (indicat-
exploitation of access patterns. ing LRU insertion is performing worse).
Algorithm 1 Adaptive-LRU Cache Access TABLE I: Hardware Overhead Breakdown
1: Input: Memory address addr Component Bits Global Per-Bank
2: Output: Hit/Miss indication
PSEL Counter 6–10 1 0–4
3: (set index, tag) ← ExtractComponents(addr) LFSR (shared) 8–16 1 0
4: cache set ← sets[set index] LFSR (per-bank) 8–16 0 0–4
5: // Check for cache hit Leader-ID decode 0 Hardwired
6: way ← FindLine(cache set, tag) Global-only Total 14–26 bits 2–4 bytes
Per-bank Total 14–42 bits 6–12 bytes
7: if way ̸= INVALID then
Bytes assume 8-bit LFSR and 8-bit PSEL per partition;
8: PromoteToMRU(cache set, way) totals exclude any optional per-bank LFSR.
9: return HIT
10: end if
11: // Cache miss - determine insertion mode E. Intuitive Operation
12: if set index ∈ LLRU then The policy’s effectiveness stems from its ability to automat-
13: insertion mode ← LRU INSERT ically detect and adapt to different access patterns:
14: else if set index ∈ LBIP then Streaming/Scan Detection: When the workload exhibits
15: insertion mode ← BIP INSERT streaming behavior or accesses working sets larger than cache
16: else capacity, BIP insertion prevents cache pollution. Most new
17: insertion mode ← (P SEL > threshold)? lines are inserted at the LRU position where they can be
18: LRU INSERT : BIP INSERT quickly evicted, while preserving space for the small fraction
19: end if of lines that may exhibit reuse.
20: // Perform insertion Locality Preservation: When the workload exhibits strong
21: victim way ← GetLRUWay(cache set) temporal locality and the working set fits in the cache, both
22: EvictLine(cache set, victim way) insertion modes perform well (few misses in leader sets).
23: if insertion mode = LRU INSERT then PSEL tends to favor LRU insertion, and the policy behaves
24: InsertAtMRU(cache set, victim way, tag) identically to standard LRU.
25: else Mixed Workload Handling: For workloads with both local
26: if RandomFloat() < ϵ then and streaming components, the policy dynamically balances
27: InsertAtMRU(cache set, victim way, tag) between scan protection and locality preservation based on
28: else the relative performance of the two modes.
29: InsertAtLRU(cache set, victim way, tag)
30: end if IV. H ARDWARE I MPLEMENTATION
31: end if A. Storage Requirements
32: // Update PSEL for leader sets
Adaptive-LRU requires minimal additional storage beyond
33: if set index ∈ LLRU and P SEL > 0 then
a standard LRU implementation. Table I provides a detailed
34: P SEL ← P SEL − 1
breakdown of storage requirements.
35: else if set index ∈ LBIP and P SEL < P SELM AX
For comparison, DRRIP requires 2 additional bits per cache
then
line. For a typical 32KB L1 cache (512 lines with 64-byte line
36: P SEL ← P SEL + 1
size), this represents 128 bytes of overhead—10.0 × to 20.0 ×
37: end if
more than Adaptive-LRU.
38: return MISS
B. Critical Path Analysis
Hit Path: Identical to standard LRU. Cache tag comparison,
• Miss in BIP leader set: PSEL is incremented (indicating hit detection, and MRU promotion follow the same timing as
BIP insertion is performing worse). baseline LRU implementation.
Follower sets consult PSEL to determine their insertion Miss Path: Leader-ID check and PSEL comparison occur
mode: on the miss path between victim selection and fill insertion,
requiring no additional cycle. With tree-PLRU, we only alter
• If PSEL > threshold: Use LRU insertion mode the insertion hint.
• If PSEL ≤ threshold: Use BIP insertion mode
C. Synthesizable Hardware Implementation
The threshold is typically set to the middle of the PSEL
range to provide balanced sensitivity. Listing 1 shows the corrected, synthesizable Verilog imple-
mentation.
1 // Parameters
D. Detailed Algorithm 2 parameter integer SETS = 64;
3 parameter integer WAYS = 16;
Algorithm 1 presents the complete Adaptive-LRU policy. 4 parameter integer LEADER_SETS = 4;
5 parameter integer PSEL_BITS = 8; Prefetcher Integration: Hardware prefetchers can be con-
6 localparam [PSEL_BITS-1:0] PSEL_THRESHOLD = (1 << ( figured to bypass PSEL updates for prefetch-only cache fills,
PSEL_BITS-1)); // mid
7
focusing adaptation on demand access patterns.
8 // Global state
9 reg [PSEL_BITS-1:0] psel; V. E XPERIMENTAL M ETHODOLOGY
10 reg [15:0] lfsr;
11
A. Simulation Infrastructure
12 // RNG update (one advance per miss) We implemented a detailed trace-driven cache simulator in
13 always @(posedge clk) begin
14 if (reset) begin
Python with the following capabilities:
15 lfsr <= 16’hACE1; • Accurate LRU simulation: Maintains precise LRU or-
16 end else if (cache_miss) begin dering for all cache sets
17 lfsr <= {lfsr[14:0], lfsr[15] ˆ lfsr[13] ˆ lfsr
[12] ˆ lfsr[10]}; • Configurable cache geometries: Supports arbitrary set
18 end count, associativity, and line size
19 end • Multiple replacement policies: LRU, FIFO, Random,
20
21 wire coin = (lfsr[4:0] == 5’b0); // ˜1/32 RRIP family, and Adaptive-LRU
22 • Detailed statistics collection: Per-workload hit rates,
23 // Leader identification (spread across index space) PSEL evolution, insertion mode statistics
24 function automatic is_lru_leader(input [5:0] set_idx
); • Parameter sensitivity analysis: Configurable PSEL bits,
25 is_lru_leader = ((set_idx % (SETS / (LEADER_SETS leader set counts, and BIP probability
*8))) == 0) && (set_idx < SETS);
26 endfunction B. Warmup and Measurement
27 function automatic is_bip_leader(input [5:0] set_idx
); Each experiment uses a two-phase methodology to ensure
28 is_bip_leader = (((set_idx + (SETS/LEADER_SETS)) % statistical validity. We warm up the cache using either the first
(SETS / (LEADER_SETS*8))) == 1) && (set_idx <
SETS);
pass over the working set or the first 1,000 accesses (whichever
29 endfunction is larger) to eliminate compulsory miss effects. Statistics are
30 reported over the remaining accesses to capture steady-state
31 wire is_lru_lead = is_lru_leader(set_idx);
32 wire is_bip_lead = is_bip_leader(set_idx);
behavior. For policies with randomized behavior (Random,
33 wire use_lru_ins = is_lru_lead ? 1’b1 : DRRIP, Adaptive-LRU), we report mean ± standard deviation
34 is_bip_lead ? 1’b0 : across five independent seeds and include 95% confidence
35 (psel > PSEL_THRESHOLD);
36 wire insert_at_mru = use_lru_ins ? 1’b1 : coin;
intervals in all plots. Unless otherwise stated, hit and miss
37 latencies are 4 and 120 cycles, respectively, for Average
38 // PSEL update (saturating) Memory Access Time (AMAT) estimates.
39 always @(posedge clk) begin
40 if (reset) begin
Unless stated, PSEL updates are driven by demand misses
41 psel <= PSEL_THRESHOLD; only; prefetch-only fills do not update PSEL.
42 end else if (cache_miss) begin
43 if (is_lru_lead && psel != {PSEL_BITS{1’b0}}) C. Cache Configuration
psel <= psel - 1’b1;
44 if (is_bip_lead && psel != {PSEL_BITS{1’b1}}) Our baseline cache configuration models a typical L1 data
psel <= psel + 1’b1; cache:
45 end
• Capacity: 1024 total cache lines (64KB with 64-byte
46 end
lines)
Listing 1: Synthesizable Verilog Implementation • Associativity: 16-way set associative (64 sets)
• Line size: 64 bytes
• Leader sets: 4 sets per insertion mode (8 total leader
D. Integration Considerations sets)
• PSEL configuration: 8-bit counter with threshold at 128
Compatibility with Pseudo-LRU: Many implementations • BIP probability: ϵ = 1/32 (3.1 %)
use tree-based pseudo-LRU to reduce the overhead of main-
taining exact LRU ordering. Adaptive-LRU is fully compatible D. Workload Design
with pseudo-LRU implementations—only the insertion posi- We designed synthetic workloads that capture key behav-
tion logic needs modification. ioral patterns observed in real applications. Each workload
Multi-level Cache Hierarchies: The policy can be applied targets specific cache behavior to enable clear analysis of
independently at each cache level. L1 caches may benefit most policy effectiveness:
from scan protection due to their proximity to streaming access Streaming 2×C (4 passes): Sequential access to 2048
patterns. unique memory locations (twice the cache capacity), repeated
Multicore Considerations: In shared last-level caches, per- 4 times after warmup. This pattern stresses scan protection
core leader sets and PSEL counters can provide fairness and mechanisms and typically results in zero hit rate for LRU and
isolation between competing workloads. FIFO.
Large Working Set Loop (1.25×C): Cyclic access to 1280 TABLE II: Hit Rates Across Workloads and Policies (Mean
unique locations (25% larger than cache capacity), repeated ± Std Dev)
10 times after warmup. This represents the challenging case Workload LRU FIFO Random DRRIP Adaptive-LRU
where the working set marginally exceeds cache capacity,
15.0 4.3 32.6
causing thrashing in traditional LRU. Streaming 2×C 0.0 0.0
±0.8 ±0.2 ±0.5
Small Working Set Loop (0.5×C): Cyclic access to Large Loop 1.25×C 0.0 0.0
56.2 12.1 63.2
512 unique locations (half the cache capacity), repeated 20 ±1.2 ±0.4 ±0.7
99.8
times after warmup. This workload exhibits perfect temporal Small Loop 0.5×C 100.0∗ 100.0∗ 100.0∗ 100.0∗
±0.1
locality and should achieve high hit rates with any reasonable Mixed 80/20
80.7 70.8 71.1 82.1 82.0
replacement policy after warmup. ±0.3 ±0.4 ±0.9 ±0.2 ±0.3
85.2 78.3 76.9 86.1 86.8
Mixed Hot/Cold (80/20): 10,000 accesses after warmup Zipf α=1.0
±0.2 ±0.5 ±1.1 ±0.3 ±0.2
where 80% of accesses target a hot region of 512 locations Zipf α=1.1
87.8 81.5 79.2 88.9 89.4
(0.5×C) and 20% target a cold region of 4096 locations ±0.2 ±0.4 ±0.8 ±0.2 ±0.2
∗ Deterministic policies after warmup
(4×C). This models applications with both local and streaming
components.
Zipf Distribution: Access pattern following Zipf distribu-
tion with skew parameters α ∈ {1.0, 1.1}, modeling real-world streaming workloads, representing a 7.6 × improvement over
locality patterns observed in web caches and database buffers. state-of-the-art DRRIP.
For extended analysis, we also employ: Near-Capacity Loop Handling: On loops marginally
Capacity Sweep Workloads: Long streaming patterns larger than cache capacity, Adaptive-LRU achieves
(4096 unique locations, 4 passes) for generating miss-ratio 63.2 percentage points improvement over LRU/FIFO and
curves across different cache capacities. The MRC excludes 5.2 × improvement over DRRIP on marginally oversubscribed
first-pass compulsory misses and reports steady-state behavior. working sets.
Associativity Sensitivity: Near-capacity loop workloads at Locality Preservation: When working sets fit in cache
fixed total capacity (1024 lines, 64-byte line size) but varying (Small Loop 0.5×C), Adaptive-LRU matches LRU perfor-
associativity to understand the impact of cache organization. mance exactly (100% hit rate after warmup), demonstrating
Multi-program Workloads: Interleaved execution of two that the adaptive mechanism does not harm temporal locality
workloads (70% hot/cold + 30% streaming) to model multi- exploitation.
programmed environments. Mixed Workload Performance: On workloads with both
local and streaming components, Adaptive-LRU slightly out-
E. Comparison Policies performs LRU and matches DRRIP performance, showing
We compare Adaptive-LRU against the following replace- effective balance between scan protection and locality preser-
ment policies: vation.
Classical Policies: Zipf Distribution Results: On realistic access patterns with
• LRU: Standard least recently used replacement
Zipfian locality, Adaptive-LRU consistently outperforms all
• FIFO: First-in-first-out replacement
baseline policies by 0.7–1.6 percentage points, demonstrating
• Random: Random replacement
robustness across different locality characteristics.
Modern Adaptive Policies:
B. Average Memory Access Time Analysis
• SRRIP: Static Re-Reference Interval Prediction
• BRRIP: Bimodal Re-Reference Interval Prediction Figure 1 shows AMAT comparison using 4-cycle hit latency
• DRRIP: Dynamic Re-Reference Interval Prediction (set and 120-cycle miss penalty, demonstrating the practical per-
dueling between SRRIP and BRRIP) formance impact.
RRIP policies use 2-bit Re-Reference Prediction Values The AMAT analysis reveals significant practical benefits:
(RRPV) per cache line, with SRRIP inserting at distant re- Adaptive-LRU reduces AMAT by 34% on streaming work-
reference (RRPV=3), BRRIP inserting at distant re-reference loads and 56% on near-capacity loops compared to LRU, while
with occasional near-immediate (RRPV=0), and DRRIP using maintaining identical performance on cache-friendly patterns.
set dueling similar to our approach.
C. Miss-Ratio Curve Analysis
VI. E XPERIMENTAL R ESULTS
Figure 2 shows miss-ratio curves for streaming workloads
A. Baseline Policy Comparison across cache capacities. The MRC excludes first-pass compul-
Table II presents hit rates with statistical confidence inter- sory misses and reports steady-state behavior.
vals for all policies across our core workload suite. The MRC analysis demonstrates Adaptive-LRU’s consistent
The results demonstrate several key findings: advantage across cache sizes, with benefits most pronounced
Superior Scan Protection: Adaptive-LRU achieves when cache capacity approaches working set size—exactly the
32.6 percentage points improvement over LRU/FIFO on scenario where scan protection is most beneficial.
TABLE III: Hit Rates vs Associativity (Total capacity fixed at
LRU 1024 lines, 64B lines, near-capacity loop, Mean ± Std Dev)
DRRIP
100 Ways LRU Random DRRIP Adaptive-LRU
AMAT (cycles)
Adaptive-LRU
1 0.0 12.3±0.5 2.1±0.1 18.5±0.4
2 0.0 28.7±0.8 8.3±0.2 38.2±0.6
4 0.0 42.1±1.0 9.7±0.3 51.8±0.7
50 8 0.0 51.3±1.1 11.2±0.3 59.7±0.6
16 0.0 56.2±1.2 12.1±0.4 63.2±0.7
32 0.0 58.9±1.1 12.8±0.3 64.8±0.6
0 34
×C 5×
C
80/
20
g2 1.2
min p
Mix
ed
S tr e a Loo
ge
Lar
Hit Rate (%)
32
Workload
Fig. 1: Average Memory Access Time comparison (4-cycle
hit, 120-cycle miss; 64 sets, 16-way, 64B lines). Adaptive- 30
LRU provides substantial AMAT reduction on problematic 64 sets
workloads. 256 sets
28
0 5 10 15
80 Leaders per Mode
LRU
DRRIP Fig. 3: Leader set sensitivity for streaming workload across
60 Adaptive-LRU cache sizes (16-way, 64B lines). 4–8 leaders per mode provide
optimal performance.
Hit Rate (%)
40
E. Parameter Sensitivity Analysis
Figure 3 shows hit rate sensitivity to leader set count for
20 different cache sizes, confirming that 4–8 leaders per mode
suffice across cache configurations.
Table IV provides comprehensive parameter sensitivity anal-
0
102 103 ysis.
Cache Capacity (lines) The analysis shows robust performance across reasonable
parameter ranges:
Fig. 2: Miss-ratio curves for streaming workload (steady-state, PSEL Width: 6-10 bits provide stable performance with
excluding compulsory misses; 64B lines). Error bars show 8 bits representing optimal balance. Leader Set Count: 2-
95% confidence intervals across 5 seeds. 8 leader sets per mode are sufficient, with 4 sets providing
optimal results. Robustness: Performance variation is modest
(<3%) across reasonable parameter ranges.
D. Associativity Sensitivity F. Multi-Program Workload Analysis
Table V shows results for a multi-program workload con-
Table III examines performance across different associativ- sisting of 70% hot/cold accesses interleaved with 30% stream-
ities while maintaining constant total capacity. ing accesses, representing realistic multiprogrammed environ-
Key observations from associativity analysis: ments.
Benefit from Modest Associativity: Adaptive-LRU shows G. Prefetcher Interaction Study
substantial improvement even at low associativities (4-8 ways),
Table VI analyzes Adaptive-LRU behavior with a simple
making it practical for area-constrained designs.
next-line prefetcher, showing that PSEL updates gated on
Consistent Advantage: The performance gap versus other demand-only accesses maintain effective adaptation.
policies is maintained across all associativity levels, demon- The prefetcher study demonstrates that gating PSEL updates
strating robustness of the approach. to demand-only accesses preserves adaptation effectiveness
Diminishing Returns: Performance improvement saturates while preventing prefetch pollution from affecting policy de-
at higher associativities, consistent with typical cache behavior. cisions.
TABLE IV: Parameter Sensitivity Analysis (Streaming Work- TABLE VI: Prefetcher Interaction (Next-Line Prefetcher)
load Hit Rate %)
Configuration Streaming Hit Rate (%) Mixed Hit Rate (%)
Leaders per Mode No Prefetcher 32.6±0.5 82.0±0.3
PSEL Bits
1 2 4 8 16 Prefetcher + All Updates 28.4±0.7 79.1±0.4
Prefetcher + Demand Only 31.8±0.6 81.7±0.3
4 28.1±0.9 29.8±0.7 30.2±0.6 29.9±0.7 28.7±0.8
6 30.2±0.8 31.5±0.6 32.1±0.5 31.8±0.6 30.9±0.7
8 30.8±0.7 32.0±0.6 32.6±0.5 32.4±0.5 31.6±0.6
10 30.9±0.7 31.9±0.6 32.5±0.5 32.3±0.5 31.8±0.6 100 SRRIP
12 30.7±0.7 31.8±0.6 32.4±0.5 32.2±0.5 31.7±0.6 BRRIP
DRRIP
TABLE V: Multi-Program Workload Performance (70% Hot/-
Hit Rate (%)
Adaptive-LRU
Cold + 30% Streaming)
Policy Hit Rate (%) AMAT (cycles) Improvement 50
LRU 65.3±0.4 44.2 Baseline
FIFO 58.9±0.5 51.7 −6.4 pp
Random 62.1±0.8 47.9 −3.2 pp
DRRIP 68.7±0.3 40.5 +3.4 pp
Adaptive-LRU 71.2±0.4 38.2 +5.9 pp
0
ing oo p oop ed
am eL ll L Mix
Stre Lar
g
Sm
a
H. RRIP Family Detailed Comparison
Figure 4 provides comprehensive comparison with the RRIP Workload
family of policies across all workloads. Fig. 4: Detailed comparison with RRIP family policies (64
The RRIP comparison reveals: sets, 16-way, 64B lines). Adaptive-LRU provides superior
Dramatic Improvement on Scans: Adaptive-LRU achieves performance on scan-intensive workloads while matching per-
7.6 × better performance than DRRIP on streaming workloads formance on cache-friendly patterns.
and 5.2 × better on near-capacity loops.
Competitive Performance on Friendly Workloads: Per-
formance is identical on cache-friendly workloads, demon- to favor LRU insertion and maintaining optimal behavior for
strating that adaptation does not harm well-behaved access temporal locality exploitation.
patterns.
C. Comparison with Sophisticated Predictors
Superior Cost-Performance Ratio: Adaptive-LRU
achieves better performance with significantly lower hardware While policies like Hawkeye achieve excellent performance
cost (0 vs 2 bits per cache line). through sophisticated prediction, they require substantial hard-
ware resources:
VII. A NALYSIS AND D ISCUSSION Hawkeye Requirements:
A. Stack Algorithm Properties • PC-based predictor table (˜1KB)
• OPT generator for training
LRU is a stack algorithm that maintains the inclusion
• Complex decision logic
property across cache capacities. Adaptive-LRU breaks stack
• Per-PC metadata tracking
monotonicity (like DIP) since BIP insertion can cause different
behavior at different cache sizes for the same access sequence. Adaptive-LRU Advantages:
However, this property sacrifice is well-compensated by the • Zero per-line metadata
significant performance improvements on problematic access • Minimal global state (6.0 to12.0 bytes)
patterns, and the policy maintains monotonicity within each • Simple decision logic
insertion mode. • No PC tracking required
The cost-performance analysis strongly favors Adaptive-
B. Theoretical Foundation LRU for practical deployment, especially in area and power-
Adaptive-LRU’s effectiveness can be understood through constrained environments where the 10.0 × to 100.0 × lower
reuse distance theory. For streaming workloads, most cache hardware overhead is critical.
lines have infinite reuse distance (never reused). BIP insertion
places these lines at the LRU position where they are quickly D. Dynamic Behavior Analysis
evicted, effectively removing them from the reuse distance The policy adapts quickly to workload changes, typically
distribution and preserving cache space for lines with shorter converging within a few hundred accesses. PSEL evolution
reuse distances. tracking shows:
When reuse distances are short (strong temporal locality), • Streaming detection: PSEL rapidly decreases, favoring
both leader modes experience low miss rates, causing PSEL BIP insertion
• Locality detection: PSEL increases, favoring LRU inser- leader set count, and BIP probability based on observed
tion workload characteristics.
• Mixed workload balance: PSEL oscillates around Security Considerations: Cache replacement policies can
threshold, providing balanced behavior create side channels for information leakage. Future work
should investigate the security implications of adaptive policies
VIII. A DDITIONAL R ELATED W ORK and develop defenses against cache-based attacks.
Beyond replacement policies discussed in Section II, several Non-Volatile Memory Integration: As storage class mem-
complementary research directions are relevant to our work: ories become more prevalent, cache replacement policies must
Compiler-Guided Cache Management: Techniques like consider the different access characteristics and endurance
cache coloring and data layout optimization can complement properties of these technologies.
replacement policy improvements [9]. X. C ONCLUSION
Machine Learning for Cache Management: Recent work
This paper presents Adaptive-LRU, a lightweight cache
has explored neural networks and reinforcement learning for
replacement policy that addresses the fundamental limitations
cache replacement decisions [10], though these approaches
of LRU under scan-intensive workloads while preserving its
require substantial computational overhead.
effectiveness under temporal locality. Through comprehensive
Cache Partitioning: Techniques for dividing cache capacity
experimental evaluation with rigorous statistical methodology,
between competing applications or threads to improve fairness
we demonstrate that simple insertion-focused adaptation can
and performance [11].
achieve substantial performance improvements at minimal
Prefetch-Aware Replacement: Policies that consider the
hardware cost.
accuracy and utility of hardware prefetching when making
Our key findings include:
replacement decisions [12].
Our work complements these approaches by providing a 1) Significant scan protection: 32.6 percentage points hit
lightweight foundation that can be combined with other opti- rate improvement over LRU/FIFO on streaming work-
mization techniques. loads, representing practical performance gains in real
applications.
IX. L IMITATIONS AND F UTURE W ORK 2) Superior performance vs state-of-the-art: 7.6 × better
than DRRIP on streaming and 5.2 × better on near-
A. Current Limitations
capacity loops while achieving comparable results on
Evaluation Scope: Our evaluation uses synthetic workloads cache-friendly patterns.
that model real application behaviors but may not capture 3) Minimal hardware overhead: 6.0 to12.0 bytes total
all nuances of actual programs. Future work should include per cache partition with zero per-line metadata, com-
comprehensive evaluation with SPEC CPU, PARSEC, and pared to 128+ bytes for DRRIP in typical caches.
other standard benchmark suites. 4) Robust design: Stable performance across cache config-
Single-Core Focus: While we analyze multi-program work- urations, associativities, and parameter variations, with
loads, detailed multicore cache sharing effects require further graceful degradation under suboptimal configurations.
investigation, particularly for shared last-level caches. 5) Practical deployment: Compatible with existing cache
Phase Detection Granularity: The set-level dueling mech- implementations and ready for immediate integration
anism may be suboptimal for applications with fine-grained with standard LRU controllers.
phase changes or complex spatial locality patterns. Adaptive-LRU demonstrates that the decades-old principle
Working Set Size Assumptions: The policy is most effec- of set dueling, when applied judiciously to insertion decisions,
tive when working sets are close to cache capacity. Applica- remains a powerful technique for adaptive cache management.
tions with either very small or very large working sets may The policy’s combination of effectiveness, simplicity, and low
see limited benefit. cost makes it an attractive option for deployment across the
processor cache hierarchy, from L1 caches where scan pro-
B. Future Research Directions tection is most critical to shared LLC where fairness between
Hybrid Approaches: Combining Adaptive-LRU with competing workloads is important.
lightweight PC-based prediction could provide the best of both As memory systems continue to evolve with new tech-
worlds—low overhead with enhanced accuracy for complex nologies and usage patterns, the fundamental challenge of
access patterns. balancing simplicity with adaptivity remains central to cache
Machine Learning Integration: Modern processors have design. Our work provides a clear demonstration that sig-
sufficient computational resources to support lightweight ma- nificant performance improvements are achievable without
chine learning for cache management. Neural network-based sacrificing the engineering principles that have made LRU
adaptation could improve upon set dueling for complex work- successful in practice.
loads. The success of Adaptive-LRU also points toward future
Dynamic Parameter Adjustment: Rather than using fixed opportunities for lightweight adaptation in other aspects of
parameters, the policy could dynamically adjust PSEL width, cache design, including prefetching, coherence, and power
management. By maintaining focus on practical deployment
constraints while achieving meaningful performance improve-
ments, we can continue to advance memory system design in
directions that benefit real applications and users.
ACKNOWLEDGMENTS
The author thanks the anonymous reviewers for their de-
tailed feedback and suggestions that significantly improved the
quality of this work. Special thanks to the faculty and students
at IIT Mandi for discussions and insights that shaped this
research. We also acknowledge the contributions of the open-
source community for simulation tools and benchmarking
methodologies that enabled this work.
R EFERENCES
[1] J. L. Hennessy and D. A. Patterson, “Computer architecture: a quanti-
tative approach,” 6th ed. Morgan Kaufmann, 2019.
[2] P. J. Denning, “The locality principle,” Communications of the ACM,
vol. 48, no. 7, pp. 19–24, July 2005.
[3] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely Jr., and J. Emer,
“Adaptive insertion policies for high performance caching,” in Proc. 34th
Annual International Symposium on Computer Architecture (ISCA), San
Diego, CA, USA, June 2007, pp. 381–392.
[4] A. Jaleel, K. B. Theobald, S. C. Steely Jr., and J. Emer, “High
performance cache replacement using re-reference interval prediction
(RRIP),” in Proc. 37th Annual International Symposium on Computer
Architecture (ISCA), Saint-Malo, France, June 2010, pp. 60–71.
[5] A. Jain and C. Lin, “Back to the future: Leveraging Belady’s algorithm
for improved cache replacement,” in Proc. 43rd ACM/IEEE Annual In-
ternational Symposium on Computer Architecture (ISCA), Seoul, South
Korea, June 2016, pp. 78–89.
[6] L. A. Belady, “A study of replacement algorithms for a virtual-storage
computer,” IBM Systems Journal, vol. 5, no. 2, pp. 78–101, 1966.
[7] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger, “Evaluation
techniques for storage hierarchies,” IBM Systems Journal, vol. 9, no. 2,
pp. 78–117, 1970.
[8] C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely Jr.,
and J. Emer, “SHiP: Signature-based hit predictor for high performance
caching,” in Proc. 44th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), Porto Alegre, Brazil, Dec. 2011, pp. 430–
441.
[9] T. M. Chilimbi, M. D. Hill, and J. R. Larus, “Cache-conscious struc-
ture definition,” in Proc. ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI), Atlanta, GA, USA, May
1999, pp. 13–24.
[10] Z. Shi, X. Huang, A. Jain, and C. Lin, “Applying deep learning to
the cache replacement problem,” in Proc. 52nd Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO), Columbus,
OH, USA, Oct. 2019, pp. 413–425.
[11] M. K. Qureshi and Y. N. Patt, “Utility-based cache partitioning: A
low-overhead, high-performance, runtime mechanism to partition shared
caches,” in Proc. 39th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), Orlando, FL, USA, Dec. 2006, pp. 423–
432.
[12] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Coordinated control
of multiple prefetchers in multi-core systems,” in Proc. 42nd Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO),
New York, NY, USA, Dec. 2009, pp. 316–326.