Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views12 pages

Exploring DRAM Cache Prefetching For Pooled Memory

This paper discusses the implementation of DRAM cache prefetching for pooled memory systems, particularly in datacenters utilizing CXL interconnect standards. It proposes a mechanism to prefetch sub-page blocks from Fabric Attached Memory (FAM) into local DRAM caches to reduce memory access latency and improve application performance, achieving up to 10% IPC improvement. The study evaluates the system's effectiveness through various benchmarks and highlights the importance of optimizing memory utilization in modern workloads.

Uploaded by

powor54658
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

Exploring DRAM Cache Prefetching For Pooled Memory

This paper discusses the implementation of DRAM cache prefetching for pooled memory systems, particularly in datacenters utilizing CXL interconnect standards. It proposes a mechanism to prefetch sub-page blocks from Fabric Attached Memory (FAM) into local DRAM caches to reduce memory access latency and improve application performance, achieving up to 10% IPC improvement. The study evaluates the system's effectiveness through various benchmarks and highlights the importance of optimizing memory utilization in modern workloads.

Uploaded by

powor54658
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Exploring DRAM Cache Prefetching for Pooled

Memory
1st Chandrahas Tirumalasetty 2nd Narasimha Reddy Annapareddy
Dept. of Electrical & Computer Engineering Dept. of Electrical & Computer Engineering
Texas A&M University Texas A&M University
College Station, TX USA College Station, TX, USA
[email protected] [email protected]
arXiv:2406.14778v1 [cs.AR] 20 Jun 2024

Abstract—Hardware based memory pooling enabled by in- Datacenter servers host applications with diverse memory
terconnect standards like CXL have been gaining popularity requirements. Provisioned larger DRAM capacities can poten-
amongst cloud providers and system integrators. While pooling tially ameliorate the performance needs of some applications,
memory resources has cost benefits, it comes at a penalty of
increased memory access latency. With yet another addition to while not being touched by rest. Hence, Memory Under-
the memory hierarchy, local DRAM can be potentially used as utilization is rampant in today’s datacenters. Trace analysis
a block cache(DRAM Cache) for fabric attached memory(FAM) from production clusters at Google and Alibaba revealed that
and data prefetching techniques can be used to hide the FAM 45%-60% of allocated memory to jobs is not utilized [57].
access latency. This paper proposes a system for prefetching Untouched memory for virtual machine instances (VM) in
sub-page blocks from FAM into DRAM cache for improving
the data access latency and application performance. We fur- Azure servers, average about 25%, while full of compute is
ther optimize our DRAM cache prefetch mechanism through used [42].
enhancements that mitigate the performance degradation due Furthermore, $/GB price of DDR memory has plateaued
to bandwidth contention at FAM. We consider the potential for last few generations [4]. As a result, cost to provision
for providing additional functionality at the CXL-memory node larger memory capacities in today’s severs is steeply increasing
through weighted fair queuing of demand and prefetch requests.
We compare such a memory-node level approach to adapting with every generation. Memory contributes 37% -50% of total
prefetch rate at the compute-node based on observed latencies. cost of ownership(TCO) of server fleet [3], [47]. Memory
We evaluate the proposed system in single node and multi-node underutilization incurs substantial costs at the scale of today’s
configurations with applications from SPEC, PARSEC, Splash datacenters.
and GAP benchmark suites. Our evaluation suggests DRAM Memory Disaggregation enables applications to avail mem-
cache prefetching result in 7% IPC improvement and both of
proposed optimizations can further increment IPC by 7-10%. ory from a central resource on an ad-hoc basis, freeing the
memory from being tied up statically at node-level. Mod-
I. I NTRODUCTION ern data center servers have been exploring the potential of
disaggregated memories to provide a less expensive means
Modern workloads are evolving rapidly. Heavy use of ML of furnishing memory [33], [40], [57]. The disaggregated
techniques, from the data-center to client/mobile, are placing approaches have taken two parallel paths: (1) RDMA based
new, more stringent demands on system design across all approaches that employ memory at another node as remote/far
platforms. Many of these ML techniques from large language memory, accessed through Operating System (OS) based pag-
models [19], [34], [61], [62] to video processing [29], [32], ing mechanisms [6], [26], [46]. (2) The second approach plans
[63] and others rely on large amounts of data for training to employ CXL to provide a shared common pool of memory
and sometimes for retrieval. In the data center, where a given across multiple nodes [1], [5], [17], [67]. We refer such
workload may reference Terabytes of data, spread across memory organization as Fabric Attached Memory(FAM). With
many nodes, great demands on computer memory systems are either approach, it is expected that the memory is used more
becoming common place [22], [27]. efficiently across different workloads with divergent memory
In recent years, the performance gap between DRAM and needs.
disk has grown so large to lead system designers to eschew These architectural paths result in additional layers in the
using storage to extend DRAM entirely in favor of over- cache-memory hierarchy, with remote or shared common
provisioning DRAM [9], [23], [43], [49], [54], [66]. Many DRAM across an interconnect being the new layer beyond
different techniques have been proposed to reduce the cost the DRAM. As new memory layers including disaggregated
of data movement and page fault handling penalties. These memories, far memories and non-volatile memories close the
range from employing a larger DRAM memory and running gap in speed between different layers of memory, it has
applications entirely in memory [49], [55], prefetching data become necessary to pursue lower latency approaches for
blocks [21], [30], [35], and employing remote memories [6], accessing data from these new layers of memory [6], [41].
[26], [46]. This paper pursues the approach of prefetching data between
DRAM and the lower layers of memory such as disaggregated devices like Network Interface Controller(NIC) has a cache
memory (over CXL-like interconnect) to this end. hierarchy but does not have local memory, uses CXL.cache.
This paper explores the potential of utilizing LLC misses Type-2 device like GPU, FPGA which comprise both caches
that are visible at the root-complex level to build a prefetching and local memory uses CXL.cache, CXL.mem. Type-3 device
mechanism between FAM and DRAM, utilizing a portion of like memory expanders that does not have a local cache
local DRAM as a hardware managed cache for FAM. We call hierarchy use CXL.mem.
this proposed cache as DRAM cache. We employ SPP [36] as Our discussion in this paper is based on systems that
an example prefetcher to demonstrate the performance gains leverages CXL.mem protocol for memory pooling. Fig. 1
with DRAM cache prefetching, but other prefetchers can be shows compute nodes pooling memory resources from a
employed as well. Our DRAM cache prefetcher maintains the shared memory node. We call the memory attached to the
metadata for the cached FAM data. Root complex equipped processor using CXL as Fabric Attached Memory(FAM).
with a prefetcher redirects requests to cached data to the Fig.2 details our system architecture with CXL and FAM
DRAM cache. On a hit, the cached data will see DRAM la-
tencies instead of FAM latencies. Unlike previous mechanisms Node 1 Node 2 Node 3
that considered page level transfers between DRAM and lower
layer memory [6], [41], [46], [64] we consider the potential Local Mem. Local Mem. Local Mem.
for sub-page level prefetches at the hardware level. CXL.mem
CXL.mem CXL.mem
Since multiple nodes can pool memory from FAM, it is Memory Node
imperative that the FAM bandwidth is utilized and shared
Pooled Mem. Pooled Mem.
across multiple nodes effectively. As previous work has shown, Pooled Mem. Pooled Mem.
prefetch throttling [28], [60] is an effective mechanism to Pooled Mem. Pooled Mem.
Pooled Mem. Pooled Mem.
utilize the memory bandwidth well. We take inspiration from
this earlier work and incorporate ideas for prefetch throttling
Fig. 1: CXL.mem enabled memory pooling
to effectively manage the FAM bandwidth across demand components. CXL root complex comprises of an agent that
and prefetch streams across multiple nodes. We take inspi- implements CXL.mem protocol. Agent acts on behalf of the
ration from network congestion algorithms [24] to develop host(CPU) handling all communication and data transfers with
bandwidth adaptation techniques at the source(compute node). the CXL end point. In our system CXL end point comprises
Since CXL-connected memory devices can be enhanced with of FAM device and FAM Controller. FAM Controller directly
extra functionality, we evaluate the potential of employing interfaces with the agent, translating CXL.mem commands
Weighted Fair Queueing (WFQ) at the memory node and into requests that can be understood by the FAM device(eg:
compare that approach with prefetch throttling at the source. DDR commands).
This paper makes the following significant contributions: As illustrated, load misses and writebacks from LLC are
• Proposes a system architecture for caching and prefetch- handled either by the local memory controller or CXL root
ing FAM data at local DRAM. Cache being managed at complex based on the physical addresss. The address decoding
the granularity of sub-page blocks. is implemeted in Host managed Device Memory(HDM) de-
• Proposes an adaptive prefetching mechanism that throttles coders. During the device enumeration phase, HDM decoders
DRAM cache prefetches in response to congestion at are programmed for every CXL.mem device and their contri-
FAM. bution to flat address space of the processor.
• Proposes a WFQ-enabled CXL-memory node and com-
pares its performance against prefetch throttling at the Processor Processor Processor Processor
source. L1 Cache L1 Cache L1 Cache L1 Cache
L2 Cache L2 Cache L2 Cache L2 Cache
• Evaluates the proposed prefetch mechanism to demon-
Last Level Cache(LLC)
strate its efficacy in single node and multiple node
configurations. Local Memory CXL Root Complex
Controller Agent
CXL.mem
II. BACKGROUND Local Mem. CXL End Point
FAM Controller
A. CXL enabled memory pooling
Fabric Attached Memory
Compute Express Link(CXL) [2] is a cache-coherent inter-
connect standard for processors to communicate with devices
Fig. 2: CXL & Fabric Attached Memory(FAM) Architecture
like accelerators and memory expanders. CXL builts upon the
physical layer of PCIe(electrically compatible). CXL offers B. Memory Prefetching & Signature Path Prefetcher(SPP)
3 kinds of protocols- CXL.cache, CXL.mem, CXL.io. Any Data prefetching techniques to memory hide access latency
device that connects to the host processor using CXL can use across different levels of memory hierarchy, are well studied
either or all of aforementioned protocols. CXL identifies 3 in the literature. Prefetchers typically use learning based
types of devices that use one or more these protocols. Type-1 approach to predict the future memory access addresses [8],
[35]. Most common features include address delta correlation, Page Last Accessed Signature Signature Access Delta Weight of
Address Block Addr. Access Delta
program counter(PC) that cause cache misses, and access 0xA000 0x3 0x44222 0x2 3
history. Recent work has applied sophisticated mechanisms 0x4422
0x4 5
like neural networks [58], reinforcement learning [11], [58] to 0x4 3
Signature Table 0x44222
0x2 2
prefetching.
1) Signature Path Prefetcher(SPP): In this work, we use Pattern Table
SPP [35] as base architecture for our DRAM cache prefetcher.
Fig. 4: State of SPP after access at an address 0xA003
SPP uses signatures to keep track of memory access patterns
of the application. Signatures are a compact representation of
history of memory access delta’s of the program. Architec- system. Our focus in this work is to demonstrate the usefulness
turally SPP comprises of 2 tables - Signature Table, Pattern of sub-page level prefetching to hide FAM access latency and
Table. Fig. 3 shows the organization of SPP with these tables. to present adaptive optimizations to cater to the shared nature
Page Last Accessed Signature Signature Access Delta Weight of of FAM.
Address Block Addr. Access Delta
0xA000 0x1 0x4422 0x2 3
0x4422 III. S YSTEM A RCHITECTURE
0x4 5
0x4 3 In this section we describe the system architecture compo-
Signature Table 0x44222
0x2 1
nents that implement DRAM prefetching/caching mechanism
Pattern Table
Fig. 3: Signature and Pattern tables of SPP for FAM bound requests. Through the rest of the section, the
On a cache miss, the page physical address of the cache demands and prefetches that we refer to, are LLC misses and
miss is used to index into the signature table. Output from DRAM cache prefetches. DRAM cache prefetches should not
the signature table gives the last cache miss address(within be confused with prefetch requests issued by per-core cache
the same physical page), and the current signature. With this prefetcher(core prefetches). Our system architecture doesn’t
state, we can calculate the delta and updated signature as per distinguish between type of requests that miss in LLC, hence
formula shown below. core prefetch requests that miss in LLC are treated like demand
misses and are subsequently used for training the DRAM
delta = (M iss Addresscurrent − M iss Addressprevious ) cache prefetcher as well.
signature = (signature << 4) ⊕ delta
A. Enhanced Root Complex
Now the generated signature is used to index into the pattern DRAM caching/prefetching is implemented through en-
table. Pattern table maps signature to address delta’s of future hancements the root complex. We add prefetcher and prefetch
memory accesses. Each pattern table entry has the following queue, to facilitate issue of both prefetch and demand requests
entries. to FAM. Fig. 5 outlines the architecture of enhanced root
1) Signature - Obtained from the signature table. Serves as complex. We explain the significance each of the component
an index to this table. in detail below.
2) Signature weight - Counts the number of times the Enhanced CXL Root Complex
corresponding signature has been accessed since the Prefetcher
Demand
creation of entry. Requests
3) delta, weight [4] - Address delta that comprise the sig- Prefetch Queue
Prefetch
nature and their corresponding access count. 4 ordered Requests
pairs. Local Mem. DRAM Cache Agent
The obtained address delta can then be combined with current CXL.mem
signature to generate a speculative signature(using the afore- Prefetch Aware FAM Controller
mentioned formulation). Speculative signature can further be
used to index into the signature table, to generate another Fabric Attached
Memory
address delta. This recursive indexing into the pattern table
can be continued desired number of times or till the pattern Fig. 5: System architecture of root complex with DRAM cache
table was not able to provide any more delta’s. and prefetcher
On an access to the prefetcher(cache miss to a certain page),
generated signature and the block address of the current access 1) Prefetcher: As mentioned before, our DRAM cache
are used to update the state of SPP. Fig. 4 shows the state employs an SPP based prefetcher. We make design changes
of SPP after an example memory access. Additionally, SPP to SPP to operate with sub-page blocks, instead of 64 byte
maintains global history table that bootstraps the learning of cache blocks. Our prefetcher trains on node physical address of
access history, when the data access stream moves from one LLC misses. Based on observed patterns, prefetcher generates
Wetonote
page that our proposals in this paper are not specific
another. addresses that are aligned with sub-page block size. Note
to SPP prefetcher and ideas from other prefetchers such as that, for memory access to complete, node physical addresses
[12], [48] can be employed with suitable modifications in our need to be translated to FAM local address. We assume
that this translation is handled by the elements lower in the address space can be large compared to the DRAM cache, we
memory hierarchy. Storage overhead required to implement manage DRAM cache metadata by hashing the FAM addresses
our prefetch is 11kB(2× to that SPP [36]). into a small number of slots. The number of slots can be
2) Prefetch Queue: Along with the prefetcher, we added a higher than the number of available FAM blocks to reduce the
fixed-length prefetch queue to the root complex. For every read probability of collision with another address or it is possible
miss in LLC that is headed to FAM, the prefetcher generates a to employ techniques such as Cuckoo hashing [50]. Like in
predefined number of prefetch requests(at the maximum), we cpu caches, tag comparison ensures the correctness of hashing,
call this number the prefetch degree. For each such prefetch in an event of collision. The allocated FAM block address is
request to be issued to FAM, it should have a vacant position noted in this slot during allocation and a cache hit. The format
to hold in the prefetch queue. The prefetch request will be of the DRAM cache metadata and its retrieval is shown in Fig.
held in the queue until the respective response is received. 6. e.g., when managed as fully associative cache, 16MB cache
Since, the prefetch queue houses the prefetch requests in with 256B block size would require approx. 450KB(64K*7B)
progress, the queue provides an easy way to check if a demand of metadata to cover 48-bit physical address space, which is
request address belongs to any prefetch in progress. In a sense, less than 5% of DRAM cache size. Hence managing DRAM
prefetch queue functionality is similar to MSHR(Miss Status cache metadata is practical.
Handling Register) in processor caches. When the prefetch
queue is full, no further prefetches are issued until a prefetch C. Demand & Prefetch requests handling mechanism
response is received. Fig. 7 explains the flow of demand requests with DRAM
We should note that the prefetch queue itself could control cache and prefetcher. For every outgoing FAM demand re-
the rate of prefetch requests issued to FAM, due to its fixed quest, the prefetcher is consulted to check if the requested
length. As we will show later, such static approaches work well block is present in the DRAM cache. Prefetcher promptly
for a few applications while leading to wasteful prefetching for checks the metatdata for the requested block. If the demand
several applications. Prefetch bandwidth optimizations adapt block is present(DRAM cache hit), a new request with the
the prefetch issue rate beyond the fixed length queue. DRAM cache block address(obtained from the metadata) is
Prefetchers that fetch data into the on-chip processor caches, sent to the local memory controller. FAM demand request
share queue with the demand requests in MSHR’s. In our waits for the response of this new request and will return with
design, DRAM cache prefetcher cannot use LLC MSHR due response data. As said earlier, this FAM demand request can
to the difference in block size. While it is possible to use be a true demand request by application or a core prefetch,
multiple entries in the LLC MSHR, it is not a resource efficient either of request type can be served by the DRAM cache.
approach. Subsequently the corresponding LRU field in metadata will
Prefetch requests that are leaving the prefetch queue are be updated. If the demand block is not present in the DRAM
tagged. Architectural components in the fabric or at the FAM cache(DRAM cache miss), demand request will proceed per
node, can likely take advantage of this to enforce priority/QoS usual to the FAM.
schemes.
Last Level FAM demand requests CXL Root Complex
B. DRAM Cache Cache(LLC) With Prefetcher
Agent
DRAM cache is explicitly managed in hardware without
Local Memory Hit with DRAM Cache
intervention of the operating system(OS). OS, specifically the Controller
memory allocator only play a role during initialization phase. Local DRAM Miss with DRAM Cache
The memory allocator should partition the local memory phys- demand requests
CXL.mem CXL End Point
ical address space and expose a contiguous physical address Local Mem. FAM Controller
range to be used as a DRAM cache. We assume that such
Local Node Fabric Attached Memory
support in the OS already exists.
In this implementation, we manage the DRAM cache as
Fig. 7: FAM demand request flow with prefetcher and DRAM
a set-associative cache, with replacement policy being LRU.
cache
The meta-data to implement the DRAM cache lookup and
replacement will be stored outside the DRAM cache, in the Irrespective of the DRAM cache hit status, prefetcher
prefetcher state(SRAM buffers). The handling of DRAM cache generates prefetch address for every outgoing FAM demand
meta data will be discussed later in this section. Since FAM request. Before sending out the prefetch requests, prefetch
queue and DRAM cache metadata are checked to see if
DRAM cache generated prefetch request is redundant. Prefetch continue to
ID Dirty Valid LRU
physical block address
Node Physical Address issue stage, once the queue and metadata check clears. When
DRAM Cache
in the issue stage, prefetch request can be dropped if the
Hash prefetch queue is full or at pre-defined threshold(eg: 95%).
Metadata table
(eg. Cuckoo Hashing)
(Prefetcher State) Past the issue stage successfully, once the prefetch request’s
response is received, prefetcher checks the metadata to see if
Fig. 6: DRAM Cache Metadata format and retrieval
there is any vacancy in the DRAM cache. If there is a vacancy, 1.40 20.00
1.20

Geo mean IPC gain


the prefetch block would be directed to appropriate block of 16.00
1.00

FAM access latency


Geo mean relative
0.80 12.00
the DRAM cache directly. If there is no vacancy, prefetcher 0.60 8.00
issues an eviction for the LRU block first in DRAM cache 0.40
4.00
0.20
and then the corresponding position will be replaced by the 0.00 0.00
incoming prefetched block. 64 128 256 512 1024 2048 4096 64 128 256 512 1024 2048 4096
A DRAM cache block size
B DRAM cache block size
We should note the that meta data cost for DRAM cache
management increases linearly with increase in the number of Fig. 8: Subfigures A & B represent geo. mean IPC gain and
blocks. For a given DRAM cache size, the number of blocks relative FAM access latency across different prefetch block
decreases as the block size increases. Hence, an advantage of sizes(Both wrt to baseline)
using larger block sizes for DRAM cache is that the meta-data
overhead is decreased. On the flip side, using larger block size
observes about 17 × increase in relative FAM latency, result-
can cause increased latencies at FAM.
ing in substantial IPC decrement. Based on this analysis, we
D. FAM Controller consider 128, 256, 512 as block sizes for DRAM cache. Using
multiple of CPU cache block size(64B) for DRAM cache,
The task of FAM controller is to convert incoming cxl.mem
will amortize the delays due to flit packaging/serialization at
protocol requests to DDR requests that could be ultimately
fabric, as well as reduce the hardware overhead of meta-data
handled by the FAM. In a real system, FAM Controller could
management.
adapt a port based design, with each port supporting 8×, 16×,
lanes of PCIe/CXL at the frontend, while supporting multiple IV. P REFETCH O PTIMIZATIONS
channels of DDR4/DDR5 memory in backend. Network on
chip(NoC) like architectural structures might be present in Our optimizations to the DRAM cache prefetcher are aimed
order to route the incoming requests to appropriate queues at mitigating the interference between demand and prefetch
that feed into DDR channels [42]. Vertical scaling of such requests, there by enhancing the utility of FAM accesses.
controllers might be essential to support large number of PCIe To achieve this we propose two approaches - Weighted Fair
lanes/memory channels depending on the pool size and desired Queueing at memory node, Prefetch Bandwidth adaptation at
number of memory channels. compute node. We describe the design and implementation of
We abstract the functionality of FAM Controller as com- both of these optimizations below.
ponents that move incoming requests to DDR channels. All
A. Weighted Fair Queueing(WFQ)
the incoming requests from multiple nodes are filled into
the input queue. We assume that FAM controller is aware We consider WFQ as a generic means for providing priority
of the maximum memory bandwidth across all its supported to demand requests over prefetches at the FAM. By giving a
DDR channels. Hence, the controller scans the input queues at higher weight to demand requests, we can provide a higher
appropriate rate and issues the requests to the respective DDR slice of FAM bandwidth to demand requests. Under conges-
channel. tion, WFQ ensures demands are served with priority over
With the addition of prefetch to the compute node’s root prefetches, there by pontentially mitigating queueing delays
complex. FAM controller now receives two classes of requests for demands due to prefetches.
- demand and prefetch. In the baseline design, we implement We enhance the baseline queueing implementation of the
a single input queue, with FIFO scheduling. Requests(both FAM controller, by replacing the single input queue with two
demand and prefetch) from multiple nodes are dispatched to input queues, one each for demand and prefetch. Double queue
FAM in the order of their arrival. Later, we explain how implementation enables us to issue prefetches and demands at
the demand and prefetch requests can be given different independent rates. We take advantage of the prefetch request
treatment at the FAM through such mechanisms as Weighted tagging by the prefetcher at the compute node, to identify
Fair Queuing (WFQ). the placement of the incoming requests into their respective
queues. Both core prefetches and DRAM cache prefetches are
E. Sub-page block size vs. latency trade-off analysis placed in the prefetch queue.
We performed an exploratory analysis to understand the We use a WFQ scheduler to issue requests from the two
block size vs. latency tradeoff for sub-page block DRAM queues to the FAM. We use work-conserving deficit weighted
cache prefetching. We observe the IPC and average FAM round-robin(DWRR) [59] algorithm to select the queue from
access latency by varying the DRAM cache block size. Fig. 8 which the request should be issued. W-weight is used to
show our analysis. indicate how much weightage demand requests are given
As DRAM cache block size increases from 64-512B, IPC relative to the prefetch requests. Pseudo code of the our
gain stays mostly constant with marginal improvements for algorithm is as shown in Alg.1
128B, 256B block sizes. Beyond 512B, average IPC gain de- For every issue cycle, the IssueRequests() function is called
creases due to increase in relative FAM latency. Plus, moving to see either of demand or prefetch requests can be issued.
a FAM page on touch to DRAM cache (4096B block size) We use the current round variable to track the round number
Event counter Description
Function IssueRequests(): demand_requests_issued demand requests
current round += (current weight+1)%(W+1); issued to the FAM
demand queue status = CheckDemandQueue(); demand_requests_returned demand requests
prefetch queue status = CheckPrefetchQueue(); return from FAM
demand_requests_total Total demand requests
r = prefetch block size/demand block size; arrive at the prefetcher
if current round != 0 then prefetch_requests_issued prefetch requests
if demand deficit < max demand deficit then issued to FAM
minimum_demand_latency minimum demand
demand deficit += quantum; read latency in recent history.
if demand queue status and demand deficit>0
then TABLE I: Description of event counters
IssueDemandRequests(); updating deficit post issue.
demand deficit = demand deficit-1;
else if prefetch queue status and B. Prefetch Bandwidth Adaptation
prefetch deficit > r then In the baseline prefetcher, we generate a fixed number
IssuePrefetchRequests(); of prefetch requests(prefetch degree) for every LLC miss
prefetch deficit = prefetch deficit-r and issue those requests depending on the prefetch queue
else availability. When the FAM device is saturated due to high
if prefetch deficit < max prefetch deficit then number of demand requests, issuing prefetch requests would
prefetch deficit += quantum increase demand latency and could hurt performance if the
if prefetch queue status and prefetch deficit>r demand and prefetch requests are queued in a single queue.
then Under such conditions, it is better to dial back the prefetch
IssuePrefetchRequests(); request issue rate and wait till the FAM has enough bandwidth
prefetch deficit = prefetch deficit-r; to accommodate prefetch requests. Hence to incorporate such
else if demand queue status and
feeedback, we implement prefetch bandwidth adaptation at the
demand deficit > 0 then
source.
IssueDemandRequests();
We take a sampling based approach, to adapt the prefetch
demand deficit = demand deficit-1
end issue rate. To learn about the system state, we add event
Algorithm 1: Demand/Prefetch Issue Algorithm counters to the root complex’s prefetcher state. Each counter
stores two values, instantaneous value and average value.
The instantaneous values of counters are scanned and reset
within a W+1 round window, when the prefetches and de- during each start of each sampling cycle. Average value of
mands are serverd in 1:W ratio. Prefetch requests are preferred the event counter stores the exponential moving average of
in only one round of the window. During the rest of rounds, we the respective instantaneous values. The descriptions of event
prefer to issue demand requests. Due to the scheduler being counters that are stored in prefetch state are as shown in
work conserving in nature, if the preferred choice of requests Table.I.
are not available, we try to issue the other type of requests. Demand latency increasing
When its the demand turn, if the demand deficit did not ex- = (current demand latency > 1.30*minimum_demand_latency)

ceed maximum permissible deficit(max demand deficit), we


Demand
increment the demand deficit. If the demand queue is non- Yes
latency
No

empty and demand deficit is greater than 0, then we issue a Decrease increasing Increase
demand request, resulting in decrement of the demand deficit Prefetch Rate Prefetch Rate

by 1. If either demand queue is empty or demand does not


Compute
have enough deficit, we try to issue prefetch requests, which prefetch per demand, demand per prefetch, prefetch
is again subject to prefetch queue non-emptiness and status of greater than demand
the prefetch deficit. The reverse applies for issue logic when
it is prefetch’s turn. Fig. 9: State diagram for prefetch BW adaptation algorithm
In order to account for the difference in prefetch and Abstract-level logical flowchart of our bandwidth adaptation
demand block sizes. We enforce that prefetch deficit needs algorithm is shown in Fig.9. The algorithm is executed every
to be at least the ratio of prefetch block size and demand sampling cycle. Our key idea here is to reduce the prefetch
block size, for a prefetch to be issued to the FAM. Our issue rate when FAM is experiencing congestion. Hence, we
proposed DRAM prefetcher works in conjunction with core track demand request latency and decrement the prefetch issue
prefetcher. So, WFQ need to handle both CPU cache block rate whenever the latency start growing. Measuring demand
core prefetches as well as sub-page block DRAM cache latency, we compare the measured demand read latency with
prefetches. In our implementation, when its the prefetch turn, minimum achievable demand read latency. Minimum achiev-
based on the available deficit we issue either a core prefetch or able demand read latency is unknown,dynamically changing,
DRAM cache prefetch. Block size is taken into account when and depends on fabric topology. We approximate minimum
Benchmark Workload FAM usage
achievable demand read latency to lowest average value in the suite
recent past. By closely tuning the past history, one can tweak SPEC17 603.bwaves s 824 MB
the agility of prefetch throttling. If the latency is above 125% 607.cactuBSSN s 257 MB
619.lbm s 1.55 GB
of minimum demand read latency(above the noise level), it 628.pop2 s 590 MB
might be because of congestion at FAM, then we proceed 649.fotonik3d s 587 MB
to decrease the prefetch issue rate. Else, we can increase the 654.roms s 245 MB
657.xz s 561 MB
prefetch issue rate. Splash 3 LU 515 MB
We employ Multiplicative Increase and Multiplicative De- FFT 625 MB
crease (MIMD) [18] for adjusting the prefetch rate. In our GAP bfs 864 MB
cc 802 MB
implementation, we set the increase factor to 1.125(12.5% bc 593 MB
over prev. value). The decrease factor is determined dynam- sssp 545 MB
ically based on the observed behavior. The decrease factor PARSEC dedup 868 MB
is a function of prefetcher accuracy, with higher accuracy facesim 188 MB
canneal 849 MB
resulting in slower decreases. We expect this to result in more NPB mg 431 MB
accurate prefetches to be issued when multiple applications is 1 GB
are competing for bandwidth at FAM. In addition, we mimic XSBench XSBench 611 MB
RED [14], [24] at the source and make the decrease factor TABLE III: Benchmark configurations
linearly dependent on the difference of observed latency and fault handler. Opal allows us to configure the memory footo-
minimum read latency, when the latency is above the threshold print between local DRAM and pooled FAM. CXL network
of 125% of minimum latency. 25% is heuristic, we chose for is simulated by provided flit based network model, with
the noise level. programmed delay and bandwidth.
While both WFQ and bandwidth adaptation at the source We evaluated 19 memory intensive workloads from bench-
are trying to address the same problem, we evaluated both the mark suites like SPEC [16], PARSEC [15], GAP [10], Splash3
schemes to evaluate their relative merits. CXL memory nodes [56], and NPB [7]. Modern servers contain 64-128 processors
could provide additional functionality and our evaluation of per node equipped with 100’GB of main-memory. Simulating
WFQ is intended to understand the implications of augmenting such a system, is impractical given the simulation speeds.
a memory node with WFQ. We compare the two approaches For realistic simulation schedules, we simulate a system with
to throttling prefetches. scaled down configuration, that runs regions of interest(ROI)
within each benchmark. We expect that our simulator and
Processor 8 out-of-order cores the corresponding performance characterstics to scale to a
clock: 3.3 GHz, 6 issue/cycle
max pending transactions : 16 larger configuration, with no issues. . The simulated system
L1 cache 32 KB, 4 ways configuration is detailed in Table.II. Evaluated applications and
4 cycle access latency their respective memory footprints are shown in Table III.
L2 cache 256 kB, 8 ways
12 cycle access latency
We have simulated both single and multi node systems
L2 Signature Path accessing FAM memory pool. For multi-node systems, we
Cache Prefetcher Prefetcher(SPP) [36] ran copies of same application on different nodes, as well
L3 cache 8 MB, 16 ways as different applications on different nodes. In multinode
30 cycle access latency
Local memory DDR4-3200 systems, we expect the higher loads at FAM to result in
2 channels, 2 ranks tighter availability of bandwidth and hence possibly higher
Nodes 1-4 congestion. We evaluated 7 workload mixes for 4-node system.
CXL Network 256B flit-size, Min-packet size: 28B
Bandwidth: 128 GB/s/direction Below we define figures of merit, terms, and configurations
Min. Latency: 70ns that we use in discussion through the rest of the section
Per-Node 256 1) Core pretcher - Each of the node in our system, com-
prefetch queue size
Pooled FAM DDR4-2400 prises a multi-core CPU. Within each core, a prefetcher
2 channels, 2 ranks is at the L2 cache level.
2) Baseline configuration - Workload running with no core
TABLE II: Simulated system configuration
prefetching and DRAM cache prefetching enabled. Core
V. E VALUATION prefetcher is turned ON for all config, except in baseline.
3) all-local configuration - Workload running with core
A. Methodology prefetching, with entirety of its memory footprint re-
We evaluate DRAM cache prefetcher along with optimiza- siding in local DRAM.
tions using SST [52] simulation components. We used Ariel, 4) allocation ratio(X) - Workload’s memory footprint di-
a pin-tool based processor front-end simulator, to simulate vided between FAM and DRAM in ratio of X:1 respec-
compute nodes. Ramulator [37] was to model both local tively.
memory(DRAM) and FAM devices . We used Opal [38] to 5) IPC gain - Ratio of IPC for a given workload config to
emulate the operating system’s memory allocator and page that of workload in baseline config .(Higher the better)
Geomean Relati
0
1 Node 2 Nodes 4 Nodes

FAM latency with 2.25


IPC gain with BW. adaptation for 4-node system
Core prefetcher B
Core + DRAM cache prefetcher DRAM cache prefetch
1.6
Core + DRAM cache + Prefetch BW. adaptation

Geomean Relative FAM


1.75

IPC gain wrt to baseline


1.2
A IPC gain with DRAM cache prefetch

Latency
0.8 1.25
1.3
Geomean IPC gain

1.25 0.4
1.2 0.75
0
1.15 1 Node 2 Nodes 4 Nodes
1.1 0.25
1.05 D DRAM cache hit fraction

65 _s

fa al
s

cc

sp
s

de s

sim

g
XS is

LU
64 28.p _s

ro s

ca p

bc

ge FFT

n
ni _s

h
61 N_

_
_
ct ves_

bf

ea
nc
du

e
fo op2

s
6 lbm

65 k3d

xz
1

ss
nn
1

ce

om
Be
SS

7.
7. wa

9.
uB
0.8

4.
to
1 Node 2 Nodes 4 Nodes

60 3.b
0.6

ca
60

9.
C Relative DRAM prefetches 0.4 Core prefetcher Core+DRAM prefetch Core+DRAM prefetch+BW. adaptation
with BW. adaptation 0.2
1 0
Geomean relative DRAM

Fig. 11: IPC gain due to BW. adaptation for 4-node system.
Core+DRAM Cache

Core+DRAM Cache

Core+DRAM Cache

Core+DRAM Cache
Core+DRAM Cache

Core+DRAM Cache
0.8
prefetches issued

+ Prefetch Bw.

+ Prefetch Bw.

+ Prefetch Bw.
Adapatation

Adapatation

Adapatation
Prefetcher

Prefetcher

Prefetcher
0.6 Geomean from this analysis is represented in ”4 Nodes” in
0.4 Fig. 10A
0.2
0 1 node 2 nodes 4 nodes
Non-adaptive DRAM cache prefetching performed poorly in
1 Node 2 Nodes 4 Nodes Demand hit fraction Core Prefetch hit fraction 4 node system configuration, with no IPC improvement over
core prefetching, thus emphasising the importance of BW
Fig. 10: Evaluation of DRAM cache prefetcher with prefetch adaptation in bandwidth constrained systems. BW adaptation
bandwidth adaptation resulted in decrement of 7% and 13% in relative FAM latency
6) Relative FAM latency - Ratio of the average FAM access over non-adaptive DRAM prefetch.
latency for a workload in a given configuration to that of Fig. 10C presents the relative no. of DRAM prefetch re-
workload running in baseline configuration.(Lower the quests. Results indicate that adaptation caused 18% and 21%
better) less DRAM cache prefetches to be issued to FAM, in 2-node,
7) Relative DRAM prefetch requests issued - Ratio of 4-node systems respectively. Decreased DRAM prefetch issue
DRAM cache prefetches issued for a given config to rate resulted in decreased demand and core-prefetch hit rate,
that of DRAM cache prefetches issued with no optimiza- according to analysis presented in Fig. 10D. For instance, BW
tions(FIFO scheduling and no prefetch BW adaptation). adaption reduced the demand and core-prefetch hit fraction
8) Demand hit fraction - Fraction of demand requests that from 57% and 83% to 50% and 72%, for 4-node system.
miss LLC, that hit in DRAM cache. Performance improved despite hit fraction decrement, which
9) Core Prefetch hit fraction - Fraction of core prefetch reveals that prefetch requests are causing considerable queuing
requests that miss in LLC, that hit in DRAM cache. delays at FAM.
Further, we analyze the IPC gain with prefetch bandwidth
B. Performance gain with Prefetch Bandwidth Adaptation adaptation for 4-node system across different benchmarks, as
For this analysis, we run workloads in 1,2, and 4 node shown in Fig. 11. DRAM cache prefetch significant improves
system configurations. Each workload ran in 3 prefetch config- IPC for applications like dedup, LU, 628.pop2 s, mg, is,
urations - core prefetcher turned ON, core prefetcher+DRAM and facesim. Workloads like canneal, bfs, cc, and bc saw
cache prefetcher turned ON, core prefetcher+DRAM cache IPC decrement with DRAM cache prefetch, possibly due to
prefetcher+prefetch BW adaptation turned ON. We call the increased FAM latency. BW adaptation was able to improve
second configuration, non-adaptive DRAM prefetch. Each the IPC substantially for these applications, expect for cc.
workload has memory allocation ratio of 8. BW adaptation would mitigate congestion only when the
Fig. 10 outlines the results of our experimentation. 10A DRAM prefetches are in some proportion responsible for
shows geomean IPC gain of all benchmarks, for each prefetch creating it. This is because our algorithm implementation
configuration, across 3 node configuration. Across the board, throttles only DRAM cache prefetch issue rate. BW adaptation
DRAM prefetching improves overall performance compared to would be of little help if core prefetches are responsible for
core prefetching. Core prefetching resulted in IPC gain of 1.20, congestion. Future implementations of our prefetch throttling
1.18, 1.10 for 1,2,4 node systems. With both core prefetching algorithm can relay the congestion occurrence to the CPU
and DRAM cache prefetching turned ON, the same IPC cache controllers, enabling dynamic throttling of CPU prefetch
gains increased to 1.26, 1.24, 1.11 respectively. Performance requests.
improvement comes from reduction in FAM access latency,
as indicated in Fig. 10B. DRAM cache prefetching reduced C. Performance gain with WFQ scheduling
average FAM access latency by 29% and 34% for 1,2 node We evaluate our WFQ scheduling algorithm with 3 weights-
systems respectively. 1,2,3 (weight of 3 indicates demands and prefetches are served
Prefetch BW adaptation further enhanced the performance in 3:1 ratio). Each workload is run in 1,2,4 node system config-
of DRAM cache prefetching, for 2 and 4 node systems. BW uration, with WFQ scheduling at the FAM controller, and with
adaptation resulted in 4% and 8% IPC improvement over non- different weights. Performance of WFQ with different weights
adaptive DRAM cache prefetching for 2 and 4 node systems. is compared to FIFO scheduling(non-adaptive) prefetch. Core
FIFO WFQ 1:1 WFQ 1:2 WFQ 1:3 Relative FAM latency with WFQ IPC gain with DRAM prefetch for multi-node mix
B scheduling 1.8
A IPC gain with WFQ scheduling 1.2 1.6
1.3

IPC gain
1 1.4

Geomean relative
Geomean IPC gain

FAM Latency
0.8 1.2
1.2
0.6 1
0.4 0.8
1.1
0.2

ix1

ix2

ix3

ix4

ix5

ix6

ix7

n
ea
m

m
0

o
ge
1 1 Node 2 Nodes 4 Nodes
1 Node 2 Nodes 4 Nodes Core prefetch Core+DRAM prefetch
Core+DRAM+bandwidth adapt. Core+DRAM+WFQ (1:1)
Relative DRAM prefetch D DRAM cache hit fraction Core+DRAM+WFQ(1:2) Core+DRAM+WFQ(1:3)
C 1.00
with WFQ scheduling
1.00 0.80
Geomean Relative DRAM

0.60
Fig. 14: Performance gain with different configurations of
prefetches issued

0.80

0.60
0.40 DRAM prefetch across 7 multi-node workload mixes
0.20
0.40 0.00 IPC gain with DRAM prefetch vs allocation ratio
WFQ WFQ WFQ WFQ WFQ WFQ WFQ WFQ WFQ

IPC gain wrt to all-local config


0.20 1.00
1:1 1:2 1:3 1:1 1:2 1:3 1:1 1:2 1:3
0.00 1 node 2 nodes 4 nodes 0.90
1 Node 2 Nodes 4 Nodes Demand hit fraction Core Prefetch hit fraction
0.80

0.70
Fig. 12: Evaluation of DRAM cache prefetcher with WFQ
0.60
scheduling
0.50
IPC gain due to WFQ for 4-node system 1 2 4 6 8
2.25
FAM-DRAM allocation ratio
2
IPC gain wrt to baseline

Core Prefetcher Core + DRAM cache prefetch


1.75 Core+DRAM+Adaptive BW Core+DRAM+WFQ-2
1.5

1.25
Fig. 15: IPC gain wrt to performance with entire memory in
1 local DRAM,
0.75

0.5 benefited from WFQ as well. Interestingly, cc application see


IPC improvement with WFQ but not with BW. adaptation.
cc
uB _ s

sim

sp
s

de s

g
XS is

LU
9. po s

ro _s

ca up

bc

ge FFT

n
ni s

65 s_s

fa al

h
61 SN_

_
_
to p2_

bf

ea
nc
e
ct ves

62 lbm

65 k3d

xz

ss
d
nn
m

ce

om
Be
7.
S
7. wa

9.

Due to placement of both core prefetch and DRAM cache


64 8.

4.
60 3.b

fo
ca
60

FIFO WFQ 1:1 WFQ 1:2 WFQ 1:3 prefetch requests into the same queue. WFQ can potentially
mitigate congestion due to either of the request type. Due to
Fig. 13: IPC gain due to WFQ for 4-node system. Geomean
this reason, WFQ performs marginally better in comparison
from this analysis is represented in ”4 Nodes” in Fig. 12A
to bandwidth adaptation.
prefetcher is active for all the configurations examined here.
Fig. 12 shows our results. D. Performance analysis of multi-node workload mixes
Geomean of IPC across all benchmarks for WFQ algorithm Combining the methodology of §V-B and §V-C, we eval-
for a given weight and node configuration is shown in Fig. uated our 7 multi-workload mixes with a total of 5 prefetch
12A. WFQ improves the performance over fifo scheduler for configurations. Each mix is run on a 4 node system. Fig. 14
2,4 node systems. Weights 1,2,3 improve the average IPC by shows the IPC gain for each of the workload mix. BW adapta-
8%(3%), 9%(4%), 9%(4%) over FIFO scheduler for a 4(2) tion and WFQ provide equivalent IPC improvement for mix1,
node system. Again, the increase in IPC is due to decrement mix3, mix6 and mix7. Mix5 saw slight IPC decrement with
in relative FAM latency. For a 4(2) node system, average BW adaptation, but performance improvement with WFQ.
relative FAM latency is reduced by 24%(10%). Given that BW adaptation outperformed WFQ by 16% for mix4. WFQ
BW adaptation resulted in 7% IPC improvement over FIFO outperformed BW adaptation by 5% for mix2. On an average,
scheduler, WFQ marginally performs better. BW adaptation and WFQ resulted in 10% and 9% IPC over
WFQ also resulted in less number of DRAM prefetches non-adaptive prefetcher(FIFO scheduler) respectively.
to be issued. For a 4 node system, WFQ with weights 1,2,3 This analysis reveals to us that both of these approaches
resulted in 17%, 31%, 37% decrement in average relative are useful in resolving congestion at FAM. But the relative
DRAM prefetches issued. Such behavior is expected because, performance gain due to either of these techniques depends not
as the weightage to demands increase, prefetch request latency just on the workload alone, but also on co-existing workloads,
increases, filling the prefetch queue, subsequently causing less that are accessing FAM.
number of prefetches to be issued. Fig. 12D shows the demand
and core-prefetch hit fraction with WFQ across different node E. Performance improvement across allocation ratio’s
configurations and weights. We analyze the impact of DRAM cache prefetch along
Additionally, we analyze the IPC gain with WFQ for 4- with proposed optimizations across different allocation ra-
node system across different benchmarks, as shown in Fig. tios. We considered 4 prefetch configurations for this exper-
13. Set of workloads that benefited from BW. adaptation, iment - Core prefetcher ON, Core+DRAM cache prefetch,
DRAM cache size sensitivity analysis across different workloads requests and fair queuing have been studied earlier in the
2.25 context of multiprocessor systems where resources are shared
2 [20], [31]. We leverage this earlier work in a different context.
1.75 Blue [53] considers timeliness of prefetches in order to
IPC gain wrt baseline

1.5 issue prefetches sufficiently ahead. However, the timeliness


1.25
measures in earlier work do not consider dynamic latencies
at the memory system. Our approach here also separates
1
prefetches and demands into separate queues. Criticality aware
0.75
prefetchers consider the criticality of prefetches in reducing
0.5
stalls [51]. Our system here operates outside the processor
sim

g
XS is

LU
9. po s

ro s
ni s

65 _s

fa al

h
cc
uB _ s

sp
9. s

de s
ca up

bc

ge FFT

n
_

_
to 2_

bf
61 SN_

nc

ea
chip and doesn’t have access to such information.
e
62 lbm

65 k3d

s
xz
ct ves

ss
d
nn
m
p

ce

Be

om
7.
S
7. wa

64 8.

4.
60 3.b

Prefetching from remote memories has received significant


fo
ca
60

4MB 8MB 16MB 32MB


recent attention [6], [21], [26], [46], in order to reduce the
Fig. 16: DRAM cache size sensitivity analysis, with DRAM cost of moving pages from one node to another while allowing
cache sizes varied from 4-32MB memory to be shared across nodes on a network. Our work
takes motivation from this work, but considers hardware-based
prefetching in a more tightly connected environments. Recent
Core+DRAM+BW. adaptation, Core+DRAM+WFQ(1:2). We work [41] has shown that it is possible to avoid cache conflicts
vary the allocation ratio 1 to 8 and measure the IPC gain with in fast memory when fast memory is completely used as
respect to all-local configuration for each benchmark, for a a cache for FAM. Our work employs only part of the fast
4-node system. Fig. 15 shows the geo. mean of IPC gains memory as a cache.
of all benchmarks running with a given allocation ratio and Data placement and movement can have a significant impact
prefetcher configuration. in a tiered memory systems and recent work [42], [44], [45],
Firstly this analysis reveals that utility of core prefetching [47], [65] has considered strategies for keeping hot or more
decreases as FAM usage increases. With allocation ratio of frequently utilized data in higher performance tiers. These
1, workloads saw an average of 10% IPC decrement, but strategies are complementary to our approach of mitigating
with allocation ratio of 8, workloads saw an average of the latency when the slower memory has to be accessed.
28% IPC decrement. DRAM prefetch helps the bridge the Few of the tiering approaches require intimate knowledge of
performance gap between pooled memory configuration and workload, to create hot/cold profile of application’s memory
all-local configuration, improving the IPC by an average of footprint, which might not be possible for every kind of system
5%-6% across all the allocation ratios. Importance of BW. use case.
adaptation and WFQ are more evident in higher allocation PreFAM [39] proposes prefetching at a distance from FAM
ratios, non-optimal DRAM cache prefetching result in no IPC into local DRAM cache. While there might be similarities
improvement for 4, 6, 8 allocation ratios. in the system architecture, PreFAM doesn’t consider resource
contention at FAM, due to other participating nodes.
F. Sensitivity to DRAM Cache Size Direct CXL [25] implements a memory pooling solution
To analyse the sensitivity of our design to DRAM cache leveraging CXL.mem protocol, that has access latency around
size, we simulated a 4-node system, running same copies 200 ns. DRAM cache prefetching along with bandwidth op-
of the given workload, with varying DRAM cache sizes. To timizations proposed in this work are agnostic to CXL fabric
potentially negate the effect of contention, we have used WFQ latency.
scheduling policy with weight as 2. Our analysis is shown in Recent work [13] has suggested that the employment of
Fig. 16 shows the results of our analysis. Benchmarks like large pages may not be universally beneficial in disaggregated
628.pop2 s, 654.roms s, cc, bc, XSBench showed sensitivity memory systems. Our work here considers the movement of
to DRAM cache size, their respective IPC gains increased with sub-pages of data between DRAM and FAM to reduce the
increase in DRAM cache size. On average, DRAM cache size access latencies at FAM.
of 4MB, 8MB, 16MB, 32MB resulted in IPC gain of 1.17,
1.19, 1.20, 1.22. IPC improved by 5% as the DRAM cache VII. C ONCLUSION
size increased from 8 to 32MB on average. This paper proposed a prefetching mechanism for caching
sub-page blocks from FAM in a portion of the DRAM. The
VI. R ELATED W ORK prefetcher learns from LLC misses heading to FAM and issues
Earlier work has considered feedback to control the prefetch requests to bring FAM data to DRAM to reduce the
prefetching rate from DRAM into LLC [60].Our work differs latency of future accesses. We considered two optimizations to
from this work in significant ways, considers prefetching from mitigate congestion: compute-node based issue rate manage-
FAM into DRAM, employs more sophisticated prefetchers ment mechanism based on observed latencies, and a memory-
and employs separate queues and priorities for demand and node based weighted fair queuing mechanism. We show that
prefetch requests. Separate queues for prefetch and demand both of these mechanisms can improve the IPC of non-adaptive
DRAM cache prefetch by 7-10%. Our evaluation reveals that [23] B. Fitzpatrick, “Distributed caching with memcached,” Linux J.,
both of the approaches are effective in improving performance vol. 2004, no. 124, pp. 5–, Aug. 2004. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1012889.1012894
in different system configurations and workloads. [24] S. Floyd and V. Jacobson, “Random early detection gateways for
congestion avoidance,” ACM/IEEE Trans. on Networking, 1993.
R EFERENCES [25] D. Gouk, S. Lee, M. Kwon, and M. Jung, “Direct access,
[1] Compute express link. Accessed 23-Aug-2022. [Online]. Available: High-Performance memory disaggregation with DirectCXL,” in 2022
https://www.computeexpresslink.org USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad,
[2] Compute express link 3.0. Accessed 23-Aug-2022. [Online]. Available: CA: USENIX Association, Jul. 2022, pp. 287–294. [Online]. Available:
https://www.computeexpresslink.org/spec-landing https://www.usenix.org/conference/atc22/presentation/gouk
[3] Cxl and gen-z iron out a coherent interconnect strategy. Accessed [26] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin, “Efficient
May 2024. [Online]. Available: https://www.nextplatform.com/2020/04/ memory disaggregation with infiniswap,” Proc. of USENIX NSDI Conf.,
03/cxl-and-gen-z-iron-out-a-coherent-interconnect-strategy/ 2017.
[4] What do we do when compute and mem- [27] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, “Toward dark
ory stop getting cheaper? Accessed May 2024. silicon in servers,” IEEE Micro, vol. 31, no. 4, pp. 6 –15, july-aug. 2011.
[Online]. Available: https://www.nextplatform.com/2023/01/18/ [28] W. Heirman, K. DuBois, Y. Vandriessche, S. Eyerman, and I. Hur, “Near-
what-do-we-do-when-compute-and-memory-stop-getting-cheaper/ side prefetch throttling: Adaptive prefetching for high-performance
[5] N. Agarwal and T. Wenisch, “Thermostat: Application transparent page many-core processors,” ACM PACT Conf., 2018.
management for two-tiered main memory,” ACM SIGARCH Computer [29] C.-C. Hung, G. Ananthanarayanan, P. Bodik, L. Golubchik, M. Yu,
Architecture News, 2017. P. Bahl, and M. Philipose, “Videoedge: Processing camera streams using
[6] E. amaro, C. Branner-Augmon, Z. Luo, A. Ousterhout, M. Aguilera, hierarchical clusters,” IEEE/ACM Symposium on Edge Computing, 2018.
A. Panda, S. Ratnasamy, and S. Shenker, “Can far memory improve job [30] A. Jain and C. Lin, “Linearizing irregular memory accesses for improved
throughput?” Proc. of ACM Eurosys Conf., Apr.2020. correlated prefetching.” in MICRO, 2013, pp. 247–259.
[7] D. Bailey, T. Harris, W. Saphir, R. Van Der Wijngaart, A. Woo, and [31] N. Jerger, E. Hill, and M. Lipasti, “Friendly fire: Understanding the
M. Yarrow, “The nas parallel benchmarks 2.0,” Technical Report NAS- effects of multiprocessor prefetches,” ISPASS, 2006.
95-020, NASA Ames Research Center, Tech. Rep., 1995. [32] J. Jiang, G. Ananthanarayanan, P. Bodik, S. Sen, and I. Stoica,
[8] M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi- “Chameleon: scalable adaptation of video analytics,” Proc. of ACM
Azad, “Bingo spatial data prefetcher,” Proc. of IEE HPCA Conf., 2019. SIGCOMM Conf., 2018.
[9] S. Bavikadi, P. R. Sutradhar, K. N. Khasawneh, A. Ganguly, and S. M. [33] S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan, “Heteroos: Os
Pudukotai Dinakarrao, “A review of in-memory computing architectures design for heterogeneous memory management in datacenter,” Proc. of
for machine learning applications,” in Proceedings of the 2020 on ACM ISCA Conf., 2017.
Great Lakes Symposium on VLSI, ser. GLSVLSI ’20. New York, NY, [34] O. Khattab and M. Zaharia, “ColBERT: Efficient and Effective
USA: Association for Computing Machinery, 2020, p. 89–94. [Online]. Passage Search via Contextualized Late Interaction over BERT,”
Available: https://doi.org/10.1145/3386263.3407649 arXiv:2004.12832, 2020.
[10] S. Beamer, K. Asanović, and D. Patterson, “The gap benchmark suite,”
[35] J. Kim, S. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and
2017.
Z. Chishti, “Path confidence based lookahead prefetching,” in The 49th
[11] R. Bera, K. Kanellopoulos, A. Nori, T. Shahroodi, S. Subramoney, and
Annual IEEE/ACM International Symposium on Microarchitecture, ser.
O. Mutlu, “Pythia: A customizable hardware prefetching framework
MICRO, 2016.
using online reinforcement learning,” Proc. of IEEE MICRO Conf., Oc.
[36] J. Kim, S. H. Pugsley, P. V. Gratz, A. N. Reddy, C. Wilkerson, and
2021.
Z. Chishti, “Path confidence based lookahead prefetching,” in 2016
[12] R. Bera, A. Nori, O. Mutlu, and S. Subramoney, “Dspatch: Dual spatial
49th Annual IEEE/ACM International Symposium on Microarchitecture
pattern prefetcher,” ACM ISCA, 2011.
(MICRO), 2016, pp. 1–12.
[13] S. Bergman, P. Faldu, B. Grot, L. Vilanova, and M. Silberstein, “Re-
considering os memory optimizations in the presence of disaggregated [37] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible
memory,” ACM Int. Symp. on Memory Management, June 2022. dram simulator,” IEEE Computer Architecture Letters, vol. 15, no. 1,
[14] S. Bhandarkar, A. L. N. Reddy, Y. Zhang, and D. Loguinov, “Emulating pp. 45–49, 2016.
AQM from end hosts,” ACM Sigcomm Conf., Oct. 2007. [38] V. Kommareddy, C. Hughes, S. D. Hammond, and A. Awad, “Opal: A
[15] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation, centralized memory manager for investigating disaggregated memory
Princeton University, January 2011. systems.” 8 2018. [Online]. Available: https://www.osti.gov/biblio/
[16] J. Bucek, K.-D. Lange, and J. v. Kistowski, “Spec cpu2017: Next- 1467164
generation compute benchmark,” in Companion of the 2018 ACM/SPEC [39] V. R. Kommareddy, J. Kotra, C. Hughes, S. D. Hammond, and A. Awad,
International Conference on Performance Engineering, ser. ICPE ’18. “Prefam: Understanding the impact of prefetching in fabric-attached
New York, NY, USA: Association for Computing Machinery, 2018, p. memory architectures,” in The International Symposium on Memory
41–42. [Online]. Available: https://doi.org/10.1145/3185768.3185771 Systems, ser. MEMSYS 2020. New York, NY, USA: Association
[17] I. Calciu, M. Imran, I. Puddu, S. Kashyap, H. Maruf, O. Mutlu, and for Computing Machinery, 2021, p. 323–334. [Online]. Available:
A. Kolli, “Rethinking software runtimes for disaggregated memory,” https://doi.org/10.1145/3422575.3422804
Proc. of ACM ASPLOS Conf., 2021. [40] A. Lagar-Cavilla, J. Ahn, S. Souhlal, N. Agarwal, R. Burny, S. Butt,
[18] D. M. Chiu and R. Jain, “Analysis of increase and decrease algorithms J. Chang, A. Chaugule, N. Deng, J. Shahid, G. Thelen, K. A. Yurt-
for congestion avoidance in computer networks,” Journal of Computer sever, Y. Zhao, and P. Ranganathan, “Software-defined far memory in
Networks and ISDN Systems, pp. 1–14, 1989. warehouse-scale computers,” Proc. of ACM ASPLOS Conf., 2019.
[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- [41] B. Lepers and W. Zwzenepoel, “Johnny Cache: the End of DRAM Cache
training of Deep Bidirectional Transformers for Language Understand- Conflicts (in Tiered Main Memory Systems),” USENIX OSDI, 2023.
ing,” arXiv:1810.04805, 2019. [42] H. Li, D. S. Berger, L. Hsu, D. Ernst, P. Zardoshti, S. Novakovic,
[20] E.Ebrahimi, C. K. Lee, O. Mutlu, and Y. Patt, “Prefetch-aware shared- M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura,
resource management for multi-core systems,” ACM ISCA, 2011. and R. Bianchini, “Pond: Cxl-based memory pooling systems for cloud
[21] V. Fedorov, J. Kim, M. Qin, A. L. N. Reddy, and P. Gratz, “Speculative platforms,” in Proceedings of the 28th ACM International Conference
paging for future nvm and ssd,” in Proceedings of the 2017 International on Architectural Support for Programming Languages and Operating
Symposium on Memory Systems, Oct. 2017. Systems, Volume 2, ser. ASPLOS 2023. New York, NY, USA:
[22] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevd- Association for Computing Machinery, 2023, p. 574–587. [Online].
jic, C. Kaynak, A. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the Available: https://doi.org/10.1145/3575693.3578835
clouds: a study of emerging scale-out workloads on modern hardware,” [43] M. Li, J. Tan, Y. Wang, L. Zhang, and V. Salapura, “Sparkbench:
in Proceedings of the seventeenth international conference on Archi- A comprehensive benchmarking suite for in memory data analytic
tectural Support for Programming Languages and Operating Systems. platform spark,” in Proceedings of the 12th ACM International
ACM, 2012, pp. 37–48. Conference on Computing Frontiers, ser. CF ’15. New York,
NY, USA: ACM, 2015, pp. 53:1–53:8. [Online]. Available: http: in Proceedings of the 9th USENIX Conference on Networked
//doi.acm.org/10.1145/2742854.2747283 Systems Design and Implementation, ser. NSDI’12. Berkeley, CA,
[44] Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, and O. Mutlu, “Utility- USA: USENIX Association, 2012, pp. 2–2. [Online]. Available:
based hybrid memory management,” Proc. of IEEE Int.l Conf. on Cluster http://dl.acm.org/citation.cfm?id=2228298.2228301
Computing (CLUSTER), 2017. [67] Q. Zhang, P. Bernstein, D. Berger, and B. Chandramouli, “Redy: remote
[45] M. Maas, D. Andersen, M. Isard, M. Mahdi, K. McKinley, and C. Raffel, dynamic memory cache,” Proc. of VLDB, Dec. 2021.
“Combining machine learning and lifetime-based resource management
for memory allocation and beyond,” Communications of ACM, 2022.
[46] H. A. Maruf and M. Chowdhury, “Effectively prefetching remote mem-
ory with leap,” Proc. of In USENIX ATC, 2020.
[47] H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhat-
tacharya, C. Petersen, M. Chowdhury, S. Kanaujia, and P. Chauhan,
“TPP: Transparent page placement for cxl-enabled tiered memory,”
arxiv.org2206.0287v1, June 2022.
[48] A. Navarro-Torres, B. Panda, J. Alastruey-Benede, P. Ibanez, V. Vinals-
Yufera, and A. Ros, “Berti: an accurate local-delta data prefetcher,”
ACM/IEEE MICRO, 2022.
[49] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich,
D. Mazières, S. Mitra, A. Narayanan, G. Parulkar, and M. Rosenblum,
“The case for ramclouds: scalable high-performance storage entirely in
dram,” ACM SIGOPS Operating Systems Review, vol. 43, no. 4, pp.
92–105, 2010.
[50] R. Pagh and F. F. Rodler, “Cuckoo hashing,” Algorithms, Lecture Notes
in CS, 2001.
[51] B. Panda, “Clip: Load criticality based data prefetching for bandwidth-
constrained many-core systems,” ACM/IEEE MICRO, 2023.
[52] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield,
M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and
B. Jacob, “The structural simulation toolkit,” SIGMETRICS Perform.
Eval. Rev., vol. 38, no. 4, p. 37–42, mar 2011. [Online]. Available:
https://doi.org/10.1145/1964218.1964225
[53] A. Ros, “Blue: A timely ip-based data prefetcher,” 1st ML-based Data
Prefetching Competition, June 2021.
[54] K. Roy, I. Chakraborty, M. Ali, A. Ankit, and A. Agrawal, “In-memory
computing in emerging memory technologies for machine learning: An
overview,” in 2020 57th ACM/IEEE Design Automation Conference
(DAC), 2020, pp. 1–6.
[55] S. Rumble, A. Kejriwal, and J. Ousterhout, “Log-structured memory for
dram-based storage,” Proc. of Usenix FAST Conf., Feb. 2014.
[56] C. Sakalis, C. Leonardsson, S. Kaxiras, and A. Ros, “Splash-3: A
properly synchronized benchmark suite for contemporary research,”
in 2016 IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS), 2016, pp. 101–111.
[57] Y. Shan, Y. Huang, Y. Chen, and Y. Zhang, “Legoos: A disseminated,
distributed os for hardware resource disaggregation,” Proc. of USENIX
OSDI, 2018.
[58] Z. Shi, A. Jain, K. Swersky, M. Hashemi, P. Ranganathan, and C. Lin, “A
hierarchical neural model of data prefetching,” Proc. of ACM ASPLOS
Conf., Apr. 2021.
[59] M. Shreedhar and G. Varghese, “Efficient fair queuing using deficit
round-robin,” IEEE/ACM Transactions on Networking, vol. 4, no. 3,
pp. 375–385, 1996.
[60] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt, “Feedback directed
prefetching: Improving the performance and bandwidth-efficiency of
hardware prefetchers,” in 2007 IEEE 13th International Symposium on
High Performance Computer Architecture, 2007, pp. 63–74.
[61] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez,
A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and Efficient
Foundation Language Models,” arXiv:2302.13971, 2023.
[62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” Proc. of NIPS
Conf., 2017.
[63] S. Venkataraman, A. Panda, K. Ousterhout, M. Armbrust, A. Ghodsi,
M. J. Franklin, B. Recht, and I. Stoica, “Drizzle: Fast and adaptable
stream processing at scale,” Proc. of ACM SIGOPS Conf., 2017.
[64] W.P.Chen, A. Rudoff, and R. Agarwal, “Dynamic multilevel memory
system,” US Patent 20220229575, Mar. 2022.
[65] Z. Yan, D. Lustig, D. Nellans, and A. Bhattacharjee, “Nimble page
management for tiered memory systems,” Proc. of ACM ASPLOS Conf.,
2019.
[66] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley,
M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed
datasets: A fault-tolerant abstraction for in-memory cluster computing,”

You might also like