Exploring DRAM Cache Prefetching For Pooled Memory
Exploring DRAM Cache Prefetching For Pooled Memory
Memory
1st Chandrahas Tirumalasetty 2nd Narasimha Reddy Annapareddy
Dept. of Electrical & Computer Engineering Dept. of Electrical & Computer Engineering
Texas A&M University Texas A&M University
College Station, TX USA College Station, TX, USA
[email protected] [email protected]
arXiv:2406.14778v1 [cs.AR] 20 Jun 2024
Abstract—Hardware based memory pooling enabled by in- Datacenter servers host applications with diverse memory
terconnect standards like CXL have been gaining popularity requirements. Provisioned larger DRAM capacities can poten-
amongst cloud providers and system integrators. While pooling tially ameliorate the performance needs of some applications,
memory resources has cost benefits, it comes at a penalty of
increased memory access latency. With yet another addition to while not being touched by rest. Hence, Memory Under-
the memory hierarchy, local DRAM can be potentially used as utilization is rampant in today’s datacenters. Trace analysis
a block cache(DRAM Cache) for fabric attached memory(FAM) from production clusters at Google and Alibaba revealed that
and data prefetching techniques can be used to hide the FAM 45%-60% of allocated memory to jobs is not utilized [57].
access latency. This paper proposes a system for prefetching Untouched memory for virtual machine instances (VM) in
sub-page blocks from FAM into DRAM cache for improving
the data access latency and application performance. We fur- Azure servers, average about 25%, while full of compute is
ther optimize our DRAM cache prefetch mechanism through used [42].
enhancements that mitigate the performance degradation due Furthermore, $/GB price of DDR memory has plateaued
to bandwidth contention at FAM. We consider the potential for last few generations [4]. As a result, cost to provision
for providing additional functionality at the CXL-memory node larger memory capacities in today’s severs is steeply increasing
through weighted fair queuing of demand and prefetch requests.
We compare such a memory-node level approach to adapting with every generation. Memory contributes 37% -50% of total
prefetch rate at the compute-node based on observed latencies. cost of ownership(TCO) of server fleet [3], [47]. Memory
We evaluate the proposed system in single node and multi-node underutilization incurs substantial costs at the scale of today’s
configurations with applications from SPEC, PARSEC, Splash datacenters.
and GAP benchmark suites. Our evaluation suggests DRAM Memory Disaggregation enables applications to avail mem-
cache prefetching result in 7% IPC improvement and both of
proposed optimizations can further increment IPC by 7-10%. ory from a central resource on an ad-hoc basis, freeing the
memory from being tied up statically at node-level. Mod-
I. I NTRODUCTION ern data center servers have been exploring the potential of
disaggregated memories to provide a less expensive means
Modern workloads are evolving rapidly. Heavy use of ML of furnishing memory [33], [40], [57]. The disaggregated
techniques, from the data-center to client/mobile, are placing approaches have taken two parallel paths: (1) RDMA based
new, more stringent demands on system design across all approaches that employ memory at another node as remote/far
platforms. Many of these ML techniques from large language memory, accessed through Operating System (OS) based pag-
models [19], [34], [61], [62] to video processing [29], [32], ing mechanisms [6], [26], [46]. (2) The second approach plans
[63] and others rely on large amounts of data for training to employ CXL to provide a shared common pool of memory
and sometimes for retrieval. In the data center, where a given across multiple nodes [1], [5], [17], [67]. We refer such
workload may reference Terabytes of data, spread across memory organization as Fabric Attached Memory(FAM). With
many nodes, great demands on computer memory systems are either approach, it is expected that the memory is used more
becoming common place [22], [27]. efficiently across different workloads with divergent memory
In recent years, the performance gap between DRAM and needs.
disk has grown so large to lead system designers to eschew These architectural paths result in additional layers in the
using storage to extend DRAM entirely in favor of over- cache-memory hierarchy, with remote or shared common
provisioning DRAM [9], [23], [43], [49], [54], [66]. Many DRAM across an interconnect being the new layer beyond
different techniques have been proposed to reduce the cost the DRAM. As new memory layers including disaggregated
of data movement and page fault handling penalties. These memories, far memories and non-volatile memories close the
range from employing a larger DRAM memory and running gap in speed between different layers of memory, it has
applications entirely in memory [49], [55], prefetching data become necessary to pursue lower latency approaches for
blocks [21], [30], [35], and employing remote memories [6], accessing data from these new layers of memory [6], [41].
[26], [46]. This paper pursues the approach of prefetching data between
DRAM and the lower layers of memory such as disaggregated devices like Network Interface Controller(NIC) has a cache
memory (over CXL-like interconnect) to this end. hierarchy but does not have local memory, uses CXL.cache.
This paper explores the potential of utilizing LLC misses Type-2 device like GPU, FPGA which comprise both caches
that are visible at the root-complex level to build a prefetching and local memory uses CXL.cache, CXL.mem. Type-3 device
mechanism between FAM and DRAM, utilizing a portion of like memory expanders that does not have a local cache
local DRAM as a hardware managed cache for FAM. We call hierarchy use CXL.mem.
this proposed cache as DRAM cache. We employ SPP [36] as Our discussion in this paper is based on systems that
an example prefetcher to demonstrate the performance gains leverages CXL.mem protocol for memory pooling. Fig. 1
with DRAM cache prefetching, but other prefetchers can be shows compute nodes pooling memory resources from a
employed as well. Our DRAM cache prefetcher maintains the shared memory node. We call the memory attached to the
metadata for the cached FAM data. Root complex equipped processor using CXL as Fabric Attached Memory(FAM).
with a prefetcher redirects requests to cached data to the Fig.2 details our system architecture with CXL and FAM
DRAM cache. On a hit, the cached data will see DRAM la-
tencies instead of FAM latencies. Unlike previous mechanisms Node 1 Node 2 Node 3
that considered page level transfers between DRAM and lower
layer memory [6], [41], [46], [64] we consider the potential Local Mem. Local Mem. Local Mem.
for sub-page level prefetches at the hardware level. CXL.mem
CXL.mem CXL.mem
Since multiple nodes can pool memory from FAM, it is Memory Node
imperative that the FAM bandwidth is utilized and shared
Pooled Mem. Pooled Mem.
across multiple nodes effectively. As previous work has shown, Pooled Mem. Pooled Mem.
prefetch throttling [28], [60] is an effective mechanism to Pooled Mem. Pooled Mem.
Pooled Mem. Pooled Mem.
utilize the memory bandwidth well. We take inspiration from
this earlier work and incorporate ideas for prefetch throttling
Fig. 1: CXL.mem enabled memory pooling
to effectively manage the FAM bandwidth across demand components. CXL root complex comprises of an agent that
and prefetch streams across multiple nodes. We take inspi- implements CXL.mem protocol. Agent acts on behalf of the
ration from network congestion algorithms [24] to develop host(CPU) handling all communication and data transfers with
bandwidth adaptation techniques at the source(compute node). the CXL end point. In our system CXL end point comprises
Since CXL-connected memory devices can be enhanced with of FAM device and FAM Controller. FAM Controller directly
extra functionality, we evaluate the potential of employing interfaces with the agent, translating CXL.mem commands
Weighted Fair Queueing (WFQ) at the memory node and into requests that can be understood by the FAM device(eg:
compare that approach with prefetch throttling at the source. DDR commands).
This paper makes the following significant contributions: As illustrated, load misses and writebacks from LLC are
• Proposes a system architecture for caching and prefetch- handled either by the local memory controller or CXL root
ing FAM data at local DRAM. Cache being managed at complex based on the physical addresss. The address decoding
the granularity of sub-page blocks. is implemeted in Host managed Device Memory(HDM) de-
• Proposes an adaptive prefetching mechanism that throttles coders. During the device enumeration phase, HDM decoders
DRAM cache prefetches in response to congestion at are programmed for every CXL.mem device and their contri-
FAM. bution to flat address space of the processor.
• Proposes a WFQ-enabled CXL-memory node and com-
pares its performance against prefetch throttling at the Processor Processor Processor Processor
source. L1 Cache L1 Cache L1 Cache L1 Cache
L2 Cache L2 Cache L2 Cache L2 Cache
• Evaluates the proposed prefetch mechanism to demon-
Last Level Cache(LLC)
strate its efficacy in single node and multiple node
configurations. Local Memory CXL Root Complex
Controller Agent
CXL.mem
II. BACKGROUND Local Mem. CXL End Point
FAM Controller
A. CXL enabled memory pooling
Fabric Attached Memory
Compute Express Link(CXL) [2] is a cache-coherent inter-
connect standard for processors to communicate with devices
Fig. 2: CXL & Fabric Attached Memory(FAM) Architecture
like accelerators and memory expanders. CXL builts upon the
physical layer of PCIe(electrically compatible). CXL offers B. Memory Prefetching & Signature Path Prefetcher(SPP)
3 kinds of protocols- CXL.cache, CXL.mem, CXL.io. Any Data prefetching techniques to memory hide access latency
device that connects to the host processor using CXL can use across different levels of memory hierarchy, are well studied
either or all of aforementioned protocols. CXL identifies 3 in the literature. Prefetchers typically use learning based
types of devices that use one or more these protocols. Type-1 approach to predict the future memory access addresses [8],
[35]. Most common features include address delta correlation, Page Last Accessed Signature Signature Access Delta Weight of
Address Block Addr. Access Delta
program counter(PC) that cause cache misses, and access 0xA000 0x3 0x44222 0x2 3
history. Recent work has applied sophisticated mechanisms 0x4422
0x4 5
like neural networks [58], reinforcement learning [11], [58] to 0x4 3
Signature Table 0x44222
0x2 2
prefetching.
1) Signature Path Prefetcher(SPP): In this work, we use Pattern Table
SPP [35] as base architecture for our DRAM cache prefetcher.
Fig. 4: State of SPP after access at an address 0xA003
SPP uses signatures to keep track of memory access patterns
of the application. Signatures are a compact representation of
history of memory access delta’s of the program. Architec- system. Our focus in this work is to demonstrate the usefulness
turally SPP comprises of 2 tables - Signature Table, Pattern of sub-page level prefetching to hide FAM access latency and
Table. Fig. 3 shows the organization of SPP with these tables. to present adaptive optimizations to cater to the shared nature
Page Last Accessed Signature Signature Access Delta Weight of of FAM.
Address Block Addr. Access Delta
0xA000 0x1 0x4422 0x2 3
0x4422 III. S YSTEM A RCHITECTURE
0x4 5
0x4 3 In this section we describe the system architecture compo-
Signature Table 0x44222
0x2 1
nents that implement DRAM prefetching/caching mechanism
Pattern Table
Fig. 3: Signature and Pattern tables of SPP for FAM bound requests. Through the rest of the section, the
On a cache miss, the page physical address of the cache demands and prefetches that we refer to, are LLC misses and
miss is used to index into the signature table. Output from DRAM cache prefetches. DRAM cache prefetches should not
the signature table gives the last cache miss address(within be confused with prefetch requests issued by per-core cache
the same physical page), and the current signature. With this prefetcher(core prefetches). Our system architecture doesn’t
state, we can calculate the delta and updated signature as per distinguish between type of requests that miss in LLC, hence
formula shown below. core prefetch requests that miss in LLC are treated like demand
misses and are subsequently used for training the DRAM
delta = (M iss Addresscurrent − M iss Addressprevious ) cache prefetcher as well.
signature = (signature << 4) ⊕ delta
A. Enhanced Root Complex
Now the generated signature is used to index into the pattern DRAM caching/prefetching is implemented through en-
table. Pattern table maps signature to address delta’s of future hancements the root complex. We add prefetcher and prefetch
memory accesses. Each pattern table entry has the following queue, to facilitate issue of both prefetch and demand requests
entries. to FAM. Fig. 5 outlines the architecture of enhanced root
1) Signature - Obtained from the signature table. Serves as complex. We explain the significance each of the component
an index to this table. in detail below.
2) Signature weight - Counts the number of times the Enhanced CXL Root Complex
corresponding signature has been accessed since the Prefetcher
Demand
creation of entry. Requests
3) delta, weight [4] - Address delta that comprise the sig- Prefetch Queue
Prefetch
nature and their corresponding access count. 4 ordered Requests
pairs. Local Mem. DRAM Cache Agent
The obtained address delta can then be combined with current CXL.mem
signature to generate a speculative signature(using the afore- Prefetch Aware FAM Controller
mentioned formulation). Speculative signature can further be
used to index into the signature table, to generate another Fabric Attached
Memory
address delta. This recursive indexing into the pattern table
can be continued desired number of times or till the pattern Fig. 5: System architecture of root complex with DRAM cache
table was not able to provide any more delta’s. and prefetcher
On an access to the prefetcher(cache miss to a certain page),
generated signature and the block address of the current access 1) Prefetcher: As mentioned before, our DRAM cache
are used to update the state of SPP. Fig. 4 shows the state employs an SPP based prefetcher. We make design changes
of SPP after an example memory access. Additionally, SPP to SPP to operate with sub-page blocks, instead of 64 byte
maintains global history table that bootstraps the learning of cache blocks. Our prefetcher trains on node physical address of
access history, when the data access stream moves from one LLC misses. Based on observed patterns, prefetcher generates
Wetonote
page that our proposals in this paper are not specific
another. addresses that are aligned with sub-page block size. Note
to SPP prefetcher and ideas from other prefetchers such as that, for memory access to complete, node physical addresses
[12], [48] can be employed with suitable modifications in our need to be translated to FAM local address. We assume
that this translation is handled by the elements lower in the address space can be large compared to the DRAM cache, we
memory hierarchy. Storage overhead required to implement manage DRAM cache metadata by hashing the FAM addresses
our prefetch is 11kB(2× to that SPP [36]). into a small number of slots. The number of slots can be
2) Prefetch Queue: Along with the prefetcher, we added a higher than the number of available FAM blocks to reduce the
fixed-length prefetch queue to the root complex. For every read probability of collision with another address or it is possible
miss in LLC that is headed to FAM, the prefetcher generates a to employ techniques such as Cuckoo hashing [50]. Like in
predefined number of prefetch requests(at the maximum), we cpu caches, tag comparison ensures the correctness of hashing,
call this number the prefetch degree. For each such prefetch in an event of collision. The allocated FAM block address is
request to be issued to FAM, it should have a vacant position noted in this slot during allocation and a cache hit. The format
to hold in the prefetch queue. The prefetch request will be of the DRAM cache metadata and its retrieval is shown in Fig.
held in the queue until the respective response is received. 6. e.g., when managed as fully associative cache, 16MB cache
Since, the prefetch queue houses the prefetch requests in with 256B block size would require approx. 450KB(64K*7B)
progress, the queue provides an easy way to check if a demand of metadata to cover 48-bit physical address space, which is
request address belongs to any prefetch in progress. In a sense, less than 5% of DRAM cache size. Hence managing DRAM
prefetch queue functionality is similar to MSHR(Miss Status cache metadata is practical.
Handling Register) in processor caches. When the prefetch
queue is full, no further prefetches are issued until a prefetch C. Demand & Prefetch requests handling mechanism
response is received. Fig. 7 explains the flow of demand requests with DRAM
We should note that the prefetch queue itself could control cache and prefetcher. For every outgoing FAM demand re-
the rate of prefetch requests issued to FAM, due to its fixed quest, the prefetcher is consulted to check if the requested
length. As we will show later, such static approaches work well block is present in the DRAM cache. Prefetcher promptly
for a few applications while leading to wasteful prefetching for checks the metatdata for the requested block. If the demand
several applications. Prefetch bandwidth optimizations adapt block is present(DRAM cache hit), a new request with the
the prefetch issue rate beyond the fixed length queue. DRAM cache block address(obtained from the metadata) is
Prefetchers that fetch data into the on-chip processor caches, sent to the local memory controller. FAM demand request
share queue with the demand requests in MSHR’s. In our waits for the response of this new request and will return with
design, DRAM cache prefetcher cannot use LLC MSHR due response data. As said earlier, this FAM demand request can
to the difference in block size. While it is possible to use be a true demand request by application or a core prefetch,
multiple entries in the LLC MSHR, it is not a resource efficient either of request type can be served by the DRAM cache.
approach. Subsequently the corresponding LRU field in metadata will
Prefetch requests that are leaving the prefetch queue are be updated. If the demand block is not present in the DRAM
tagged. Architectural components in the fabric or at the FAM cache(DRAM cache miss), demand request will proceed per
node, can likely take advantage of this to enforce priority/QoS usual to the FAM.
schemes.
Last Level FAM demand requests CXL Root Complex
B. DRAM Cache Cache(LLC) With Prefetcher
Agent
DRAM cache is explicitly managed in hardware without
Local Memory Hit with DRAM Cache
intervention of the operating system(OS). OS, specifically the Controller
memory allocator only play a role during initialization phase. Local DRAM Miss with DRAM Cache
The memory allocator should partition the local memory phys- demand requests
CXL.mem CXL End Point
ical address space and expose a contiguous physical address Local Mem. FAM Controller
range to be used as a DRAM cache. We assume that such
Local Node Fabric Attached Memory
support in the OS already exists.
In this implementation, we manage the DRAM cache as
Fig. 7: FAM demand request flow with prefetcher and DRAM
a set-associative cache, with replacement policy being LRU.
cache
The meta-data to implement the DRAM cache lookup and
replacement will be stored outside the DRAM cache, in the Irrespective of the DRAM cache hit status, prefetcher
prefetcher state(SRAM buffers). The handling of DRAM cache generates prefetch address for every outgoing FAM demand
meta data will be discussed later in this section. Since FAM request. Before sending out the prefetch requests, prefetch
queue and DRAM cache metadata are checked to see if
DRAM cache generated prefetch request is redundant. Prefetch continue to
ID Dirty Valid LRU
physical block address
Node Physical Address issue stage, once the queue and metadata check clears. When
DRAM Cache
in the issue stage, prefetch request can be dropped if the
Hash prefetch queue is full or at pre-defined threshold(eg: 95%).
Metadata table
(eg. Cuckoo Hashing)
(Prefetcher State) Past the issue stage successfully, once the prefetch request’s
response is received, prefetcher checks the metadata to see if
Fig. 6: DRAM Cache Metadata format and retrieval
there is any vacancy in the DRAM cache. If there is a vacancy, 1.40 20.00
1.20
empty and demand deficit is greater than 0, then we issue a Decrease increasing Increase
demand request, resulting in decrement of the demand deficit Prefetch Rate Prefetch Rate
Latency
0.8 1.25
1.3
Geomean IPC gain
1.25 0.4
1.2 0.75
0
1.15 1 Node 2 Nodes 4 Nodes
1.1 0.25
1.05 D DRAM cache hit fraction
65 _s
fa al
s
cc
sp
s
de s
sim
g
XS is
LU
64 28.p _s
ro s
ca p
bc
ge FFT
n
ni _s
h
61 N_
_
_
ct ves_
bf
ea
nc
du
e
fo op2
s
6 lbm
65 k3d
xz
1
ss
nn
1
ce
om
Be
SS
7.
7. wa
9.
uB
0.8
4.
to
1 Node 2 Nodes 4 Nodes
60 3.b
0.6
ca
60
9.
C Relative DRAM prefetches 0.4 Core prefetcher Core+DRAM prefetch Core+DRAM prefetch+BW. adaptation
with BW. adaptation 0.2
1 0
Geomean relative DRAM
Fig. 11: IPC gain due to BW. adaptation for 4-node system.
Core+DRAM Cache
Core+DRAM Cache
Core+DRAM Cache
Core+DRAM Cache
Core+DRAM Cache
Core+DRAM Cache
0.8
prefetches issued
+ Prefetch Bw.
+ Prefetch Bw.
+ Prefetch Bw.
Adapatation
Adapatation
Adapatation
Prefetcher
Prefetcher
Prefetcher
0.6 Geomean from this analysis is represented in ”4 Nodes” in
0.4 Fig. 10A
0.2
0 1 node 2 nodes 4 nodes
Non-adaptive DRAM cache prefetching performed poorly in
1 Node 2 Nodes 4 Nodes Demand hit fraction Core Prefetch hit fraction 4 node system configuration, with no IPC improvement over
core prefetching, thus emphasising the importance of BW
Fig. 10: Evaluation of DRAM cache prefetcher with prefetch adaptation in bandwidth constrained systems. BW adaptation
bandwidth adaptation resulted in decrement of 7% and 13% in relative FAM latency
6) Relative FAM latency - Ratio of the average FAM access over non-adaptive DRAM prefetch.
latency for a workload in a given configuration to that of Fig. 10C presents the relative no. of DRAM prefetch re-
workload running in baseline configuration.(Lower the quests. Results indicate that adaptation caused 18% and 21%
better) less DRAM cache prefetches to be issued to FAM, in 2-node,
7) Relative DRAM prefetch requests issued - Ratio of 4-node systems respectively. Decreased DRAM prefetch issue
DRAM cache prefetches issued for a given config to rate resulted in decreased demand and core-prefetch hit rate,
that of DRAM cache prefetches issued with no optimiza- according to analysis presented in Fig. 10D. For instance, BW
tions(FIFO scheduling and no prefetch BW adaptation). adaption reduced the demand and core-prefetch hit fraction
8) Demand hit fraction - Fraction of demand requests that from 57% and 83% to 50% and 72%, for 4-node system.
miss LLC, that hit in DRAM cache. Performance improved despite hit fraction decrement, which
9) Core Prefetch hit fraction - Fraction of core prefetch reveals that prefetch requests are causing considerable queuing
requests that miss in LLC, that hit in DRAM cache. delays at FAM.
Further, we analyze the IPC gain with prefetch bandwidth
B. Performance gain with Prefetch Bandwidth Adaptation adaptation for 4-node system across different benchmarks, as
For this analysis, we run workloads in 1,2, and 4 node shown in Fig. 11. DRAM cache prefetch significant improves
system configurations. Each workload ran in 3 prefetch config- IPC for applications like dedup, LU, 628.pop2 s, mg, is,
urations - core prefetcher turned ON, core prefetcher+DRAM and facesim. Workloads like canneal, bfs, cc, and bc saw
cache prefetcher turned ON, core prefetcher+DRAM cache IPC decrement with DRAM cache prefetch, possibly due to
prefetcher+prefetch BW adaptation turned ON. We call the increased FAM latency. BW adaptation was able to improve
second configuration, non-adaptive DRAM prefetch. Each the IPC substantially for these applications, expect for cc.
workload has memory allocation ratio of 8. BW adaptation would mitigate congestion only when the
Fig. 10 outlines the results of our experimentation. 10A DRAM prefetches are in some proportion responsible for
shows geomean IPC gain of all benchmarks, for each prefetch creating it. This is because our algorithm implementation
configuration, across 3 node configuration. Across the board, throttles only DRAM cache prefetch issue rate. BW adaptation
DRAM prefetching improves overall performance compared to would be of little help if core prefetches are responsible for
core prefetching. Core prefetching resulted in IPC gain of 1.20, congestion. Future implementations of our prefetch throttling
1.18, 1.10 for 1,2,4 node systems. With both core prefetching algorithm can relay the congestion occurrence to the CPU
and DRAM cache prefetching turned ON, the same IPC cache controllers, enabling dynamic throttling of CPU prefetch
gains increased to 1.26, 1.24, 1.11 respectively. Performance requests.
improvement comes from reduction in FAM access latency,
as indicated in Fig. 10B. DRAM cache prefetching reduced C. Performance gain with WFQ scheduling
average FAM access latency by 29% and 34% for 1,2 node We evaluate our WFQ scheduling algorithm with 3 weights-
systems respectively. 1,2,3 (weight of 3 indicates demands and prefetches are served
Prefetch BW adaptation further enhanced the performance in 3:1 ratio). Each workload is run in 1,2,4 node system config-
of DRAM cache prefetching, for 2 and 4 node systems. BW uration, with WFQ scheduling at the FAM controller, and with
adaptation resulted in 4% and 8% IPC improvement over non- different weights. Performance of WFQ with different weights
adaptive DRAM cache prefetching for 2 and 4 node systems. is compared to FIFO scheduling(non-adaptive) prefetch. Core
FIFO WFQ 1:1 WFQ 1:2 WFQ 1:3 Relative FAM latency with WFQ IPC gain with DRAM prefetch for multi-node mix
B scheduling 1.8
A IPC gain with WFQ scheduling 1.2 1.6
1.3
IPC gain
1 1.4
Geomean relative
Geomean IPC gain
FAM Latency
0.8 1.2
1.2
0.6 1
0.4 0.8
1.1
0.2
ix1
ix2
ix3
ix4
ix5
ix6
ix7
n
ea
m
m
0
o
ge
1 1 Node 2 Nodes 4 Nodes
1 Node 2 Nodes 4 Nodes Core prefetch Core+DRAM prefetch
Core+DRAM+bandwidth adapt. Core+DRAM+WFQ (1:1)
Relative DRAM prefetch D DRAM cache hit fraction Core+DRAM+WFQ(1:2) Core+DRAM+WFQ(1:3)
C 1.00
with WFQ scheduling
1.00 0.80
Geomean Relative DRAM
0.60
Fig. 14: Performance gain with different configurations of
prefetches issued
0.80
0.60
0.40 DRAM prefetch across 7 multi-node workload mixes
0.20
0.40 0.00 IPC gain with DRAM prefetch vs allocation ratio
WFQ WFQ WFQ WFQ WFQ WFQ WFQ WFQ WFQ
0.70
Fig. 12: Evaluation of DRAM cache prefetcher with WFQ
0.60
scheduling
0.50
IPC gain due to WFQ for 4-node system 1 2 4 6 8
2.25
FAM-DRAM allocation ratio
2
IPC gain wrt to baseline
1.25
Fig. 15: IPC gain wrt to performance with entire memory in
1 local DRAM,
0.75
sim
sp
s
de s
g
XS is
LU
9. po s
ro _s
ca up
bc
ge FFT
n
ni s
65 s_s
fa al
h
61 SN_
_
_
to p2_
bf
ea
nc
e
ct ves
62 lbm
65 k3d
xz
ss
d
nn
m
ce
om
Be
7.
S
7. wa
9.
4.
60 3.b
fo
ca
60
FIFO WFQ 1:1 WFQ 1:2 WFQ 1:3 prefetch requests into the same queue. WFQ can potentially
mitigate congestion due to either of the request type. Due to
Fig. 13: IPC gain due to WFQ for 4-node system. Geomean
this reason, WFQ performs marginally better in comparison
from this analysis is represented in ”4 Nodes” in Fig. 12A
to bandwidth adaptation.
prefetcher is active for all the configurations examined here.
Fig. 12 shows our results. D. Performance analysis of multi-node workload mixes
Geomean of IPC across all benchmarks for WFQ algorithm Combining the methodology of §V-B and §V-C, we eval-
for a given weight and node configuration is shown in Fig. uated our 7 multi-workload mixes with a total of 5 prefetch
12A. WFQ improves the performance over fifo scheduler for configurations. Each mix is run on a 4 node system. Fig. 14
2,4 node systems. Weights 1,2,3 improve the average IPC by shows the IPC gain for each of the workload mix. BW adapta-
8%(3%), 9%(4%), 9%(4%) over FIFO scheduler for a 4(2) tion and WFQ provide equivalent IPC improvement for mix1,
node system. Again, the increase in IPC is due to decrement mix3, mix6 and mix7. Mix5 saw slight IPC decrement with
in relative FAM latency. For a 4(2) node system, average BW adaptation, but performance improvement with WFQ.
relative FAM latency is reduced by 24%(10%). Given that BW adaptation outperformed WFQ by 16% for mix4. WFQ
BW adaptation resulted in 7% IPC improvement over FIFO outperformed BW adaptation by 5% for mix2. On an average,
scheduler, WFQ marginally performs better. BW adaptation and WFQ resulted in 10% and 9% IPC over
WFQ also resulted in less number of DRAM prefetches non-adaptive prefetcher(FIFO scheduler) respectively.
to be issued. For a 4 node system, WFQ with weights 1,2,3 This analysis reveals to us that both of these approaches
resulted in 17%, 31%, 37% decrement in average relative are useful in resolving congestion at FAM. But the relative
DRAM prefetches issued. Such behavior is expected because, performance gain due to either of these techniques depends not
as the weightage to demands increase, prefetch request latency just on the workload alone, but also on co-existing workloads,
increases, filling the prefetch queue, subsequently causing less that are accessing FAM.
number of prefetches to be issued. Fig. 12D shows the demand
and core-prefetch hit fraction with WFQ across different node E. Performance improvement across allocation ratio’s
configurations and weights. We analyze the impact of DRAM cache prefetch along
Additionally, we analyze the IPC gain with WFQ for 4- with proposed optimizations across different allocation ra-
node system across different benchmarks, as shown in Fig. tios. We considered 4 prefetch configurations for this exper-
13. Set of workloads that benefited from BW. adaptation, iment - Core prefetcher ON, Core+DRAM cache prefetch,
DRAM cache size sensitivity analysis across different workloads requests and fair queuing have been studied earlier in the
2.25 context of multiprocessor systems where resources are shared
2 [20], [31]. We leverage this earlier work in a different context.
1.75 Blue [53] considers timeliness of prefetches in order to
IPC gain wrt baseline
g
XS is
LU
9. po s
ro s
ni s
65 _s
fa al
h
cc
uB _ s
sp
9. s
de s
ca up
bc
ge FFT
n
_
_
to 2_
bf
61 SN_
nc
ea
chip and doesn’t have access to such information.
e
62 lbm
65 k3d
s
xz
ct ves
ss
d
nn
m
p
ce
Be
om
7.
S
7. wa
64 8.
4.
60 3.b