Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views15 pages

MODeL Memory Optimizations For Deep Learning

MODeL is an algorithm designed to optimize memory usage in deep neural networks by efficiently managing tensor lifetimes and memory locations, achieving an average memory reduction of 30% without impacting training time or model accuracy. The approach formulates the problem as a joint integer linear program (ILP) that can be quickly solved using standard solvers, making it scalable for state-of-the-art networks. MODeL is open-source and addresses existing limitations in deep learning frameworks regarding memory allocation and fragmentation.

Uploaded by

Samir Gourdache
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views15 pages

MODeL Memory Optimizations For Deep Learning

MODeL is an algorithm designed to optimize memory usage in deep neural networks by efficiently managing tensor lifetimes and memory locations, achieving an average memory reduction of 30% without impacting training time or model accuracy. The approach formulates the problem as a joint integer linear program (ILP) that can be quickly solved using standard solvers, making it scalable for state-of-the-art networks. MODeL is open-source and addresses existing limitations in deep learning frameworks regarding memory allocation and fragmentation.

Uploaded by

Samir Gourdache
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MODeL: Memory Optimizations for Deep Learning

Benoit, Steiner 1 Mostafa, Elhoushi 2 Jacob, Kahn 2 James, Hegarty 3

Abstract
2012
The size of deep neural networks has grown expo- 2013
nentially in recent years. Unfortunately, hardware 2014
devices have not kept pace with the rapidly in-
2015
creasing memory requirements. To cope with this,
2016
researchers have proposed various techniques in-
2017
cluding spilling, recomputation, reduced preci-
2018
sion training, model pruning, and so on. However,
2019
these approaches suffer from various limitations:
they can increase training time, affect model ac- 2020

curacy, or require extensive manual modifications 2021


to the neural networks. 2022

100 1000 10000 100000 1000000


We present MODeL, an algorithm that optimizes
the lifetime and memory location of the tensors Parameter count (in millions)
used to train neural networks. Our method auto-
matically reduces the memory usage of existing Figure 1. The number of deep neural network parameters has in-
neural networks without any of the drawbacks of creased by 100,000 fold over the last 10 years, starting to grow
other techniques. exponentially around 2016. The x-axis is plotted on a log scale.

We formulate the problem as a joint integer linear


program (ILP). We present several techniques to needed to store the weights of the neural network and the
simplify the encoding of the problem, and enable intermediate results (e.g., activations and gradients) gener-
our approach to scale to the size of state-of-the-art ated during the training process. Compounding the problem,
neural networks using an off-the-shelf ILP solver. researchers are training neural networks on larger inputs,
We experimentally demonstrate that MODeL only such as high-resolution images (Dong et al., 2016; Tai et al.,
takes seconds to allow the training of neural net- 2017), video (Feichtenhofer et al., 2019), three dimensional
works using 30% less memory on average. point-clouds (Chen et al., 2017), long natural language se-
MODeL is an open-source project available at quences (Vaswani et al., 2017; Child et al., 2019; Devlin
https://github.com/facebookresearch/model opt. et al., 2018), and using larger batch sizes to increase effi-
ciency (Smith et al., 2018).
Unfortunately, due to the slowing of Moore’s law, the mem-
1. Introduction ory capacity of hardware has only increased linearly over the
Scale is a major force behind the accuracy improvements of last decade (Figure 2). Thus, the amount of memory avail-
machine-learning-based solutions (Bubeck & Sellke, 2021), able on the hardware used to train DNNs has not kept pace
and both the depth and width of deep neural networks with the needs of deep learning. Furthermore, features pow-
(DNN) are expanding exponentially (Sevilla et al., 2022) ered by machine learning, such as automatic speech recog-
(Figure 1). This inflation in size increases the memory nition (Paulik et al., 2021) or keyboard suggestions (Hard
et al., 2018), are being personalized by fine tuning models
1
Anthropic, San Francisco, USA 2 Meta, FAIR, Menlo Park, on-device. This means that model training is increasingly be-
USA 3 Meta, Reality Labs, Seattle, USA. Correspondence to: ing pushed to even more memory constrained edge devices
Benoit Steiner <[email protected]>. such as smartphones. As a result, memory is increasingly be-
Proceedings of the 40 th International Conference on Machine coming a bottleneck that hinders progress, and researchers
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright frequently mention memory scarcity as a limiting factor that
2023 by the author(s). impacts their work (Krizhevsky et al., 2012; He et al., 2016;

1
MODeL: Memory Optimizations for Deep Learning

minimizes memory fragmentation (Figure 4). We encode


2012 these two objectives as an integer linear program (ILP) that
can be solved quickly by commodity solvers, and present
2013
MODeL (Memory Optimizations for Deep Learning), our
2014
algorithm for memory-optimal training of neural networks.
2015
In addition to significantly reducing memory usage, our so-
2016 lution has several key strengths. First, it does not impact
2017 the accuracy of the predictions of the neural networks. Sec-
2018
ond, it requires no modification to the neural network or the
training procedure. Third, it doesn’t increase training time.
2019
Fourth, it is orthogonal to and can be combined with other
2020 memory reductions techniques, such as the ones listed in
2021 section 2, to further reduce the memory needs of a neural
network.
2022

0 20 40 60 80
Our work makes the following novel contributions:

Memory (GB) • We formulate the problem of finding the lifetime and


memory location of tensors that minimizes the peak
Figure 2. The memory capacity of NVidia datacenter GPUs (in
gigabytes) has only increased tenfold over the last decade, which
memory required to train neural networks as a joint
has not kept pace with the rapidly increasing size of deep neural integer linear program.
networks. The x-axis is plotted on a linear scale. • We demonstrate how to leverage domain knowledge to
simplify the ILP formulation, which enables off-the-
shelf solvers to quickly reduce the memory usage of
Chen et al., 2015; Dai et al., 2019; Child et al., 2019). large DNNs.

Popular deep learning frameworks such as PyTorch (Paszke • We study empirically the practicality and effectiveness
et al., 2019) and TensorFlow (Abadi et al., 2016) do not fully of our solution on a wide variety of DNNs, which
utilize the limited memory available. Similar to traditional achieves average memory savings exceeding 30% in a
dynamic memory allocators such as tcmalloc (Google) and median time of less than 7 seconds.
jemalloc (Evans), these frameworks maintain a pool of free • We provide an open source implementation of MODeL
blocks of memory at runtime. To serve memory requests, at https://github.com/facebookresearch/model opt.
they look for a large enough memory block in the mem-
ory pool, or allocate it from the physical memory if none
is available. This results in memory fragmentation when 2. Related Work
free memory blocks do not exactly match the size of an
Various approaches, complementary to ours, have been pro-
allocation request, which occurs frequently.
posed to break the “memory wall” and train larger networks.
Furthermore, DNN frameworks do not optimize tensor life- The first technique distributes the computation for a sin-
times. PyTorch (Paszke et al., 2019) executes operations in gle forward-backward iteration over several hardware de-
the order in which they are defined in the program. Tensor- vices, thus making more memory available overall. How-
Flow (Abadi et al., 2016) keeps a queue of operators that ever, this approach, known as model parallelism (Karakus
are ready to run, and executes them on a first-come, first- et al., 2021), significantly increases the financial cost of
served basis. As a result, tensors can be allocated earlier training deep neural networks since it requires access to ad-
than required, or freed later than necessary, wasting valuable ditional expensive compute accelerators and fast networks.
memory. Furthermore, partitioning a deep neural network efficiently
to balance communication and computation remains an open
Our method overcomes these two limitations of existing
problem still actively researched (Gholami et al., 2018; Jia
deep learning frameworks. We model the computations per-
et al., 2018; Mirhoseini et al., 2018).
formed to train a deep neural network as a dataflow graph
of operations. We analyze this graph to find a topological In parallel, the research community has developed numerous
ordering of the nodes that adjusts the lifetime of the tensors solutions to reduce the memory footprint of neural networks:
generated by these operations to minimize the peak amount
of memory that needs to be allocated (Figure 3). Further- • Novel neural network architectures reduce the number
more, we find an optimal packing of these tensors, which of parameters needed to achieve a given level of accu-

2
MODeL: Memory Optimizations for Deep Learning

racy (Iandola et al., 2017; Tan & Le, 2019). Further- • More recently, combining several of these techniques
more, automated search techniques known as neural has been proposed to increase their effectiveness and
architecture search (Tan et al., 2019) have been pro- mitigate their drawbacks (Beaumont et al., 2021; Patil
posed to automatically design memory efficient models. et al., 2022) without fully eliminating them.
The main drawbacks of these methods are that they are
time consuming to deploy, and fail to match the result Additionally, some techniques developed primarily to in-
quality of state of the art DNNs. crease execution speed are also beneficial for memory:

• Model compression methods (Blalock et al., 2020) • Operator fusion can reduce memory footprint by avoid-
prune (Louizos et al., 2018; Frankle & Carbin, 2018; ing the need to materialize large intermediate buffers
Molchanov et al., 2019; Elkerdawy et al., 2022; He and keep them around for backpropagation (Niu et al.,
et al., 2020) or share (Dehghani et al., 2019) weights to 2021).
improve the efficiency of the model parameterization.
However, the majority of these techniques require train- • Machine learning frameworks such as PyTorch (Paszke
ing the unpruned neural network first, and are therefore et al., 2019) and TensorFlow (Abadi et al., 2016) allow
most useful for inference. some of their operators to store the data they gener-
ate in one of their input tensors, thus avoiding the
• Training using reduced precision arithmetic on 16-bit need to allocate an output tensor. This is known as in-
floating point or even quantized representations (Wang place-update, and saves memory. However, users must
et al., 2018; Zhu et al., 2020; Kalamkar et al., 2019) sig- manually modify their neural networks to leverage this
nificantly reduces memory (Fan et al., 2020; Lin et al., capability, and it can lead to correctness issues if used
2016). However, these techniques can compromise the indiscriminately (Paszke et al., 2017).
accuracy of the neural networks, make training unsta-
ble, and require careful implementation to be deployed Optimizing the location of tensors in memory to reduce
successfully (NVidia; Lin & Talathi, 2016). fragmentation, also known as the dynamic storage alloca-
tion problem, is NP-hard (Garey & Johnson., 1979). This
problem has been studied in the context of deep learning by
Several efforts have looked at the problem from a systems
other researchers (Sekiyama et al., 2018) who proposed an
perspective, and presented solutions to reduce pressure on
exact formulation to minimize the memory fragmentation
the memory subsystem. These techniques encompass:
of deep neural networks. However, their approach scaled
poorly and only succeeded in optimizing two small neural
• In-memory tensor compression, which can result networks in inference mode. As a result, they ultimately
in minimal accuracy loss in many DNN applica- advocated for a heuristics based approach.
tions (Chen et al., 2021; Jain et al., 2018). However,
this comes with a runtime penalty, since the data must Improving the lifetime of tensors has also been studied be-
be compressed and uncompressed on the fly. fore. Liberis et al. (Liberis & Lane, 2020) and Serenity (Ahn
et al., 2020) looked for a memory-optimal execution sched-
• Rematerialization, also known as checkpointing, dis- ule by enumerating the topological orders of the DNN graph
cards activations in the forward pass to save memory, and calculating their peak memory usage. To speed things
and recomputes those values as needed when com- up, they both proposed dynamic programming based op-
puting the gradients. Numerous strategies to identify timizations to prune the number of orderings they needed
which activations to discard have been proposed (Jain to consider. However, the complexity of their algorithms
et al., 2020; Zheng et al., 2020; Chen et al., 2016; remains prohibitive at O(|V | ∗ 2|V | ) in both cases, and they
Griewank & Walther, 2000; Shah et al., 2021). While only managed to make them work for inference on tiny
effective at reducing memory usage, these techniques graphs. Lin et al. (Lin et al., 2022) also mentioned reorder-
add extra computations, which increases the training ing computations as a way to enable operator fusion and
time. reduce the peak memory footprint while training. Unfortu-
nately, they didn’t describe the algorithm they used to find a
• Paging, aka spilling, consists of moving data between
suitable node ordering.
a small but high bandwidth and low latency memory
pool, and a large but slow external memory. This has
been demonstrated to effectively offload the data stored 3. Background
on a GPU device onto the host memory (Peng et al.,
3.1. Representing Neural Networks as Dataflow Graphs
2020; Hildebrand et al., 2020; Meng et al., 2017), but
again increases training time due to extra memory trans- Deep neural networks can be represented using dataflow
fers. graphs, as pioneered by TensorFlow (Abadi et al., 2016).

3
MODeL: Memory Optimizations for Deep Learning

The nodes of the graph encode the computations to be per- e3:20Mb e5:5Mb
v3
formed (e.g. matrix multiplications, convolutions, activation e1:10Mb
functions), while the edges represent the data (aka tensor or v1 v4
e6:10Mb
array) that is produced by an operation and transferred to e2:10Mb v2 e4:30Mb
consumer nodes.
Resident Sets:
Due to the producer-consumer relation between connected
Order #1, Peak=60Mb Order #2, Peak=45Mb
nodes, edges are oriented. Each edge has exactly one source,
v1 e1, e2, e3 40Mb v1 e1, e2, e3 40Mb
which is the operator that generated the corresponding ten- v2 e2, e3, e4 60Mb v3 e2, e3, e5 35Mb
sor. Since a tensor can be consumed by more than one node, v3 e3, e4, e5 55Mb v2 e2, e4, e5 45Mb
edges can have multiple sinks. v4 e4, e5, e6 45Mb v4 e4, e5, e6 45Mb
Operators can have multiple incoming (aka fanin) edges.
Typically, one of these incoming edges will be the tensor Figure 3. Node execution orders can impact peak memory usage.
generated by the previous layer, and another one will be a Edges are annotated with the size of their corresponding tensors,
weight tensor. Similarly, operators can have multiple out- and the two feasible node orders are annotated with the set of
going (aka fanout) edges: while most operations generate a tensors resident in memory at each step. Running v3 before v2 is
single output tensor, some may create two or more. Opera- significantly more memory efficient.
tors with no fanout edges are used to model the final outputs
of the neural network. Operators without fanin edges can
model random number generators, constants, weights, or
initial inputs to the neural network. efficient. However, as demonstrated in prior works (Bern-
stein et al., 1989; Bruno & Sethi, 1976), finding an optimal
In the remainder of this paper, we assume that the graphs scheduling for a generic DAG is an NP-complete problem,
are acyclic. In practice, this is not a significant limitation which cannot be solved with a simple greedy approach.
since recurrent neural networks such as LSTM (Hochreiter
& Schmidhuber, 1997) have been eclipsed by transform- 3.3. Optimizing Tensor Locations in Memory
ers (Vaswani et al., 2017). Furthermore, their loops can be
unrolled to avoid the problem altogether. Similar to malloc-style memory allocators, the tensor allo-
cation schemes used by typical deep learning frameworks
3.2. Optimizing Tensor Lifetimes operate online and suffers from fragmentation. Indeed, free
memory is often segregated into small blocks and inter-
For an operator to run, all its input tensors must be resident spersed by memory allocated to live tensors. As a result,
in memory, and its output tensors must have been allocated a significant fraction of the total memory is effectively un-
so that they can be written to while the node executes. Ad- usable because it is divided into pieces that are not large
ditionally, to avoid recomputing tensors, once a tensor is enough to fit a tensor. Figure 4 illustrates this phenomenon
generated it must be preserved in memory until all its con- and demonstrates how planning the location of each ten-
sumers have been run. sor ahead of time can significantly reduce the overall peak
We define the resident set RS(s) at a given step s in the memory usage.
execution of a neural network as the set of tensors that need
to be kept in memory at that point in time. It comprises 4. Formulation
the tensors in the fanin and fanout of the operator that is
scheduled for execution at step s, as well as all the other We propose to take advantage of the predictability of neural
tensors that were previously generated but need to be kept network computations to proactively optimize the lifetime
in memory to be able to run subsequent operators. The peak and location of tensors in memory.
resident set is the largest resident set over the execution of We formulate the problem of optimizing the ordering of com-
the network. putations (which determines the tensor lifetimes) and the
The order in which nodes are executed impacts the lifetime location of tensors in memory (which determines the amount
of the tensors, and therefore the peak working set. Figure 3 of memory fragmentation) of generic data-flow graphs, in-
illustrates a simple example in which changing the operator cluding those used in neural network training. We encode
ordering noticeably improves memory usage. the problem as an integer linear program (Wikipedia, 2023)
and use an off-the-shelf ILP solver to find a solution that
Among all possible node orderings, those prioritizing the minimizes the peak memory required to run the dataflow
execution of nodes that free large amounts of data while gen- graph.
erating little output data themselves, are likely to be more
We solve the ILP problem ahead of time, before the training

4
MODeL: Memory Optimizations for Deep Learning

Input Tensor A Fails to step s by running its source vertex.


Op 1 Tensor B Fit!
Op 2 Tensor C
Op 3
• A variable named Pe, s ∈ {0, 1} reflects whether tensor
Tensor D
Out e needs to be preserved in memory at step s or whether
Address Space it can be freed.

Input Tensor A
Op 1 Tensor B
We leverage a set of linear constraints to ensure that the se-
Op 2 Tensor C quence of tensor creations and preservations reflects a valid
Op 3 Tensor D execution sequence of the neural network corresponding to
Out a feasible topological ordering of the DAG.
Address Space
First, a tensor e can either be created or preserved at each
step s, but not both (equation 1). Note that it’s possible
Figure 4. Memory fragmentation can significantly increase the
memory needed to store tensors. A greedy allocator (top) would for both Ce, s and Pe, s to be false, which indicates that the
not leave any room between tensors A and B, thus making it im- tensor does not reside in memory at this point in time.
possible to reuse the space left once tnsor A is freed to store tensor
C. MODeL (bottom) leaves a gap between tensor A and B to enable ∀e ∈ E, ∀s ∈ S Ce, s + Pe, s ≤ 1 (1)
the reuse of the memory freed by tensor A and fits all the tensors
in less memory.
Second, a tensor e can be preserved in memory at step s if
and only if it was created or preserved at the previous step
process starts. This results in a small one-time initial cost, (equation 2).
which is negligible compared to the time it takes to train a
neural network (see section 5.3). ∀e ∈ E, ∀s ∈ S Pe, s ≤ Pe, s−1 + Ce, s−1 (2)

4.1. DNN representation


Third, to ensure that the solver does not simply avoid run-
As mentioned in section 3.1, we model a neural network ning any of the operators, we force every tensor e to be
as a directed acyclic graph G = (V, E) with n nodes created once through equation 3.
V = v1 , ..., vn that represent the operators and the neu-
ral network, and m edges E = e1 , ..., em that encode the X
∀e ∈ E Ce, s = 1 (3)
tensors exchanged by operators. The size in bytes of the
s∈S
tensor represented by edge ei is denoted as Si . The source
vertex of edge e is denoted src(e). The set of sink vertices
of edge e is denoted snks(e). Fourth, a tensor e can only be created by running its source
operator v. In order to do so, all the tensors in the fanin of v
The set of edges in the fanout of a node v is denoted f o(v), must be present in memory (equation 4).
while the set of edges in its fanin is represented as f i(v).
We will also denote f i(e) the set of edges in the fanin of the ∀e ∈ E, ∀s ∈ S, ∀f ∈ f i(e) Ce, s ≤ Pf, s (4)
source vertex of e. We represent by sib(e) the siblings to an
edge e, that is the collection of edges that are driven by the
same source vertex. Last but not least, we also need to make sure that opera-
tors with multiple outputs create their output tensors at the
We model the execution of a neural network as a sequence same time. We achieve this by tying the values of the Cf,s
of discrete steps S = s1 , ..., sn . At least one operator is variables for all the siblings f to a tensor e in equation 5.
executed at each step, and therefore, we need at most n steps
to schedule a graph of n operators.
∀e ∈ E, ∀s ∈ S, ∀f ∈ sib(e) Cf, s = Ce, s (5)

4.2. Encoding Tensor Lifetimes


The combination of constraints 1 through 5 ensures that
We track which tensors are allocated and which tensors are all the feasible solutions to the ILP correspond to valid
preserved in memory at each execution step. To do this, we schedules. They guarantee that the creation step of each
use two sets of binary variables: tensor corresponds to a topologically feasible ordering of the
vertices of the graph. Moreover, they force the preservation
• A variable labeled Ce, s ∈ {0, 1} indicates whether or in memory of each tensor from the time it is generated until
not the tensor e should be created (i.e. allocated) at the last step in which it is consumed.

5
MODeL: Memory Optimizations for Deep Learning

4.3. Encoding Tensor Locations 4.5. Decoding the ILP Result


To let our solver also optimize the placement of tensors in Given a feasible solution to our ILP, we generate an opti-
memory, we assign an integer variable Ae ∈ [0, M ] to
Peach mized execution sequence of operations ES = (v1 , ..., vn )
tensor e that encodes its base address. Here, M = e Se , for the neural network using function 1.
which corresponds to the worst case scenario where all the
tensors reside concurrently in memory. Function 1 GenerateExecutionSequence(C)
▷ Converts the output of the ILP into an optimized
We also introduce two binary variables ai,j ∈ {0, 1} and
▷ execution sequence of operations seq.
bi,j ∈ {0, 1} for each pair of tensors i and j. We constrain
seq = []
them through equation 6 in such a way that either ai,j or bi,j
for s in S do
is equal to 1 if both tensors reside in memory concurrently
for e in E do
at any point in time, but can be 0 otherwise.
if Ce,s = 1 and src(e) not in seq then
add src(e) to seq
∀(i, j) ∈ E 2 ai,j + bi,j ≤ 1 end if
ai,j + bi,j ≥ livei,t + livej,t − 1 end for
(6)
where livei,t = Ci,t + Pi,t end for
and livej,t = Cj,t + Pj,t return seq

We use these variables to prevent the overlap of tensors that Tensors are stored in a shared preallocated buffer B sized to
reside in memory at the same time in equations 7a and 7b. accommodate the peak memory usage. The value of each
Ae variable represents the offset location of tensor e in B.
∀(i, j) ∈ E 2 Ai + Si − Aj ≤ (1 − ai,j ) ∗ M (7a) We can map memory allocation requests to addresses over
multiple iterations of the training loop as follow. We’ll
∀(i, j) ∈ E 2 Ai − Aj − Sj ≥ (bi,j − 1) ∗ M (7b) assume that each operator generates a single output tensor
for the sake of simplicity, but our approach generalizes to
If ai,j takes the value 1, equation 7a degenerates into handle operators with multiple outputs. The kth memory
Ai + Si ≤ Aj . This forces tensor i to reside below allocation request corresponds to the tensor generated by
tensor j in memory. Similarly, equation 7b degenerates into the operator located at position k mod |V | in the execution
Ai ≥ Aj + Sj when bi,j takes the value 1, which forces sequence ES. This tensor e is to be located at address AB +
tensor i to be placed above tensor j. On the other hand, if Ae , where AB is the base address of buffer B. Memory
ai,j and bi,j take the value 0, equations 7a and 7b hold for deallocation requests are no-ops.
any value of Ai and Aj in the range [0, M ]. In other words,
they don’t impose further restrictions on the location of ei
and ej . 5. Experiments
Put altogether, constraints 6, 7a, and 7b ensure that ten- We measured the impact of MODeL on the memory usage of
sors can share the same memory space if and only if their DNN training. We tried to answer the following questions:
lifetimes do not overlap.
• How effective is our algorithm at reducing peak mem-
4.4. Minimizing Peak Memory Usage ory usage?

We track the peak memory usage by introducing a variable • How practical are our algorithms? Can they be applied
peak mem that we constrain as follow: to large neural networks in a reasonable amount of
time?
∀e ∈ E Ae + Se ≤ peak mem (8)
• What are the respective contributions of our two strate-
gies of node reordering and address generation to the
We find the schedule of operators and memory location
overall memory reduction?
of tensors that optimizes the memory usage of the neural
network by feeding program 9 to an ILP solver.
5.1. Experimental Setup
arg min peak mem We implemented MODeL on top of PyTorch version
C,P,A
1.11 (Paszke et al., 2019) with torchtext 0.12 and torchvision
subject to (1), (2), (3), (4), (5), (9) 0.12. We leveraged torch.FX to convert neural networks into
(6), (7a), (7b), (8) executable sequences of operator calls, and reconstructed

6
MODeL: Memory Optimizations for Deep Learning

the computation graphs from the operator arguments. We 5.3. Overall Memory Improvement
encoded and solved the memory optimizations problem
Optimizing for both operator ordering and the memory loca-
(equation 9) using Gurobi version 9.1.1 (Gurobi Optimiza-
tion of tensors using equation 9 results in a reduction in peak
tion, LLC, 2022). We translated the Gurobi results into
memory usage ranging from 8 to 45% at batch size 1, and
optimized execution sequences and memory locations as
12 to 68% at batch size 32. The average saving was 31.4%
described in section 4.5.
for batch size 1 and 32.8% for batch size 32 (Figure 5).
We leveraged several techniques to optimize the implemen-
tation of our approach. We describe the most impactful
Batch Size 1 Batch Size 32
optimizations in appendix A.
80%
We ran all our experiments on a workstation featuring a Intel
Xeon Gold 6138 CPU running at 2.0 GHz and a NVidia 60%
A100 GPU. We show how to integrate MODeL in a regular
40%
training flow in appendix B.
20%
5.2. Methodology
0%
We evaluated MODeL on a comprehensive set of neural

et

et

et

3D
et

er
T

-R
G

T
R
N

VI
N

VG
rm

M
et
BE
ex

AS

ile

es
networks. We included the ResNet (He et al., 2016) and

XL
N

fo
ob
Al

es
N

s
an
M

R
Transformer (Vaswani et al., 2017) models since they are

Tr
ubiquitous and used in many downstream tasks: the former
Figure 5. Total reduction in peak memory usage (in %) during
introduced the concept of residual connection, and the later
training at various batch sizes compared to PyTorch.
popularized the attention mechanism. We also included neu-
ral networks designed for specific tasks, such as computer
vision (AlexNet (Krizhevsky et al., 2012), VGG (Simonyan The memory optimization process takes an average of
& Zisserman, 2015), video understanding (ResNet3D (Tran 7.4 ± 0.7 seconds. In the worst case, our algorithm needs
et al., 2018)), and large language models (BERT (Devlin 18.1 seconds to run, and the best case is 100 millisec-
et al., 2018), XLM-R (Conneau et al., 2019)). onds (Figure 6). This process is run only once before train-
ing the model. It introduces a negligible overhead to the
In addition to these models that were designed to run on total training time yet significantly reduces the peak memory
datacenter hardware, we also evaluated our approach on usage.
MobileNet (Howard et al., 2017). This neural network was
tailored to run in resource constrained environments such as
edge devices. Additionally, we trained the neural networks
Batch size 1 Batch size 32
at batch size 1 and 32. Batch size 1 is commonly used
when training a model on devices with limited memory 15

capacity, while batch size 32 is used often when running in


10
datacenters.
To be representative of the evolution of DNN designs over 5
time, we made sure our models cover almost a decade of ma-
chine learning research, starting with AlexNet (Krizhevsky 0
et

et al., 2012) which was published back in 2012 and end-


et

-R
et

et

3D

er
T

VI
R
N

VG
N

M
et
BE
ex

AS

ile

es

or

XL
N

ing with VIT (Dosovitskiy et al., 2020) which was released


Al

ob

sf
N

es

an
M

Tr

in 2020. We also tested our approach on MNASNet (Tan


et al., 2019), a model designed by a computer using an au-
Figure 6. Memory optimization times (in seconds) for training
tomated process called neural architecture search (Elsken graphs at batch sizes 1 and 32.
et al., 2019).
To validate the scalability of our solution, we tested it on neu-
ral networks as small as Alexnet (Krizhevsky et al., 2012) 5.4. Impact of Address Generation
(118 operators) and as large as BERT (2116 operators). We define the fragmentation of a memory allocator as the
difference between the memory the allocator needs to re-
serve from the hardware M R and the size of the resident
set RS. We measure it when M R reaches its peak value
using the ratio (M R − RS)/M R.

7
MODeL: Memory Optimizations for Deep Learning

We measured the ability of our memory optimizer to reduce Batch size 1 Batch size 32
memory fragmentation in two scenarios: first, when our
optimizer is free to reorder operators, and second when
30%
it is forced to honor the PyTorch operator ordering. The
second scenario is implemented by constraining the values
20%
of the Ce, s variables to be consistent with the PyTorch node
ordering. In both cases, we found that our address generator 10%
was able to completely eliminate memory fragmentation on
all the models of our benchmark suite. By contrast, PyTorch 0%
suffered from an average fragmentation of 7.8% at batch

et

et

et

3D
et

er
T

-R
G

T
R
N

VI
N

VG
rm

M
et
BE
ex

AS

ile

es

XL
N
size 1, and 21.3% at batch size 32 (Figure 7). The PyTorch

fo
ob
Al

es
N

s
an
M

Tr
memory allocator uses a different strategy for small and
large objects, which could explain why fragmentation is
Figure 8. Reduction (in %) in ideal peak memory usage compared
significantly worse for the larger batch size. However, it is to PyTorch as a result of our node reordering. Ideal memory usage
unclear whether it could be modified to better handle large assumes that there is no fragmentation.
tensors without introducing other drawbacks.

Batch size 1 Batch size 32 the weights offers a great deal of flexibility, which MODeL
40% leverages to decrease the memory usage of the backward
pass. However, the gradients with respect to the weights are
30%
roughly smaller than the activations by a factor of batch size.
20%
Therefore, at large batch sizes, a larger percentage of the
total memory is used to store activations, while at smaller
10% batch sizes these gradients represent a larger fraction of
the total. As a result operator reordering tends to be more
0%
effective at small batch sizes.
et

et

-R
et

et

3D

e
T

VI
R
N

VG
N

M
et
BE

or
ex

AS

ile

es

XL
N

sf
Al

ob

R
N

es

an
M

6. Limitations
R

Tr

Figure 7. PyTorch memory fragmentation (in %) during training at Our approach suffers from three main limitations. First, the
various batch sizes. Our method fully eliminates fragmentation. neural network we want to optimize must be representable
using a dataflow graph. This is the case for all the major
neural architectures, so we do not believe that this is a sig-
5.5. Impact of Operator Reordering
nificant drawback in practice. Second, the sizes of all the
To evaluate the impact of our tensor lifetime optimization tensors must be known ahead of time. In the case where
on the overall result, we compared the ideal peak memory sizes are variable (for example, in the case of a language
necessary to run various neural networks when using the model operating on sentences of variable length) one can op-
PyTorch node ordering and the node ordering determined timize for the worst case scenario (e.g the longest sentence).
by our algorithm. For these measurements, we eliminated Third, our formulation assumes that tensors are stored in a
the impact of memory fragmentation by recording the peak single contiguous memory space. It can be extended to sup-
memory PyTorch operators need to request from the system port multiple memory spaces (for example, when multiple
to run these models under both node orderings instead of devices are used to train a large neural network), but this is
the memory actually used to run the models. beyond the scope of this paper.
We find that optimizing the order in which operators are
run reduces peak memory usage by up to 38% compared 7. Conclusion
to PyTorch (Figure 8). On average, our solution achieves a
The limited memory capacity of the hardware used by deep
reduction of 23.9% at batch size 1 and 11.7% at batch size
learning practitioners is one of the main challenges to train
32.
state-of-the art neural networks. This ”memory wall” lim-
The activations generated during the forward pass are pre- its the size of the neural networks that can be trained, and
served in memory for the backward pass of the training. As a ultimately impacts the quality of their predictions. Further-
result, MODeL has limited ability to decrease the memory us- more, as memory needs increase much faster than memory
age of the forward pass. On the other hand, the order of the capacity, we expect this memory bottleneck to worsen over
computation and application of the gradients with respect to time.

8
MODeL: Memory Optimizations for Deep Learning

To alleviate memory scarcity, we proposed to optimize both 2020. URL https://proceedings.mlsys.org/paper/2020/


the lifetime and location of tensors in memory and we pre- file/d2ddea18f00665ce8623e36bd4e3c7c5-Paper.pdf.
sented an ILP formulation of the problem.
Bruno, J. and Sethi, R. Code generation for a one-register
We tested our solution, MODeL, on a wide variety of neural machine. J. ACM, 23(3):502–510, jul 1976. ISSN 0004-
networks. We demonstrated experimentally that it can locate 5411. doi: 10.1145/321958.321971. URL https://doi.org/
tensors optimally in memory, thus eliminating the problem 10.1145/321958.321971.
of memory fragmentation. Furthermore, we showed that
it further decreases the peak memory usage of deep neural Bubeck, S. and Sellke, M. A universal law of robustness via
networks by optimizing the lifetime of the tensors, and, isoperimetry. In NeurIPS 2021, December 2021. URL
by combining these 2 techniques, MODeL reduced peak https://www.microsoft.com/en-us/research/publication/
memory usage by more than 30% on average. a-universal-law-of-robustness-via-isoperimetry/.
We also emphasized the practicality of MODeL. We estab- Chen, J., Zheng, L., Yao, Z., Wang, D., Stoica, I., Ma-
lished empirically that it scales well and can handle large honey, M., and Gonzalez, J. Actnn: Reducing training
DNNs. We showed that it finds optimal memory plans in memory footprint via 2-bit activation compressed train-
just a few seconds. ing. In Meila, M. and Zhang, T. (eds.), Proceedings of
the 38th International Conference on Machine Learn-
8. Acknowledgements ing, volume 139 of Proceedings of Machine Learning
Research, pp. 1803–1813. PMLR, 18–24 Jul 2021. URL
We would like to thank Ana Klimovic and Foteini Strati, https://proceedings.mlr.press/v139/chen21z.html.
whose insightful comments and feedback helped improve
the paper. We are also grateful to the paper reviewers for Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. P.,
their suggestions. and Yuille, A. L. Semantic image segmentation with
deep convolutional nets and fully connected crfs. CoRR,
abs/1412.7062, 2015.
References
Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training deep
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
nets with sublinear memory cost. ArXiv, abs/1604.06174,
J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.
2016.
Tensorflow: A system for large-scale machine learning.
In 12th {USENIX} Symposium on Operating Systems Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. Multi-
Design and Implementation ({OSDI} 16), pp. 265–283, view 3d object detection network for autonomous driving.
2016. 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 6526–6534, 2017.
Ahn, B. H., Lee, J., Lin, J. M., Cheng, H.-P., Hou, J., and Es-
maeilzadeh, H. Ordering chaos: Memory-aware schedul- Child, R., Gray, S., Radford, A., and Sutskever, I. Gener-
ing of irregularly wired neural networks for edge devices, ating long sequences with sparse transformers. ArXiv,
2020. abs/1904.10509, 2019.
Baruch, Z. Scheduling algorithms for high-level synthesis. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,
ACAM Scientific Journal, 5(1-2):48–57, 1996. Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer,
L., and Stoyanov, V. Unsupervised cross-lingual repre-
Beaumont, O., Eyraud-Dubois, L., and Shilova, A. Ef- sentation learning at scale. CoRR, abs/1911.02116, 2019.
ficient Combination of Rematerialization and Offload- URL http://arxiv.org/abs/1911.02116.
ing for Training DNNs. In NeurIPS 2021 - Thirty-fifth
Conference on Neural Information Processing Systems, Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., and
Virtual-only Conference, France, December 2021. URL Salakhutdinov, R. Transformer-xl: Attentive language
https://hal.inria.fr/hal-03359793. models beyond a fixed-length context, 2019. URL https:
//arxiv.org/abs/1901.02860.
Bernstein, D., Rodeh, M., and Gertner, I. On the complexity
of scheduling problems for parallel/pipelined machines. Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and
IEEE Trans. Computers, 38:1308–1313, 1989. Łukasz Kaiser. Universal transformers, 2019.

Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., and Guttag, J. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
What is the state of neural network pruning? In Dhillon, Pre-training of deep bidirectional transformers for lan-
I., Papailiopoulos, D., and Sze, V. (eds.), Proceedings of guage understanding, 2018. URL https://arxiv.org/abs/
Machine Learning and Systems, volume 2, pp. 129–146, 1810.04805.

9
MODeL: Memory Optimizations for Deep Learning

Dong, C., Loy, C. C., He, K., and Tang, X. Image super- Hard, A., Rao, K., Mathews, R., Beaufays, F., Augenstein,
resolution using deep convolutional networks. IEEE S., Eichner, H., Kiddon, C., and Ramage, D. Feder-
Transactions on Pattern Analysis and Machine Intelli- ated learning for mobile keyboard prediction. ArXiv,
gence, 38:295–307, 2016. abs/1811.03604, 2018.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, learning for image recognition. In 2016 IEEE Conference
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, on Computer Vision and Pattern Recognition (CVPR), pp.
N. An image is worth 16x16 words: Transformers for 770–778, 2016. doi: 10.1109/CVPR.2016.90.
image recognition at scale, 2020. URL https://arxiv.org/
abs/2010.11929. He, Y., Ding, Y., Liu, P., Zhu, L., Zhang, H., and Yang, Y.
Learning filter pruning criteria for deep convolutional
Elkerdawy, S., Elhoushi, M., Zhang, H., and Ray, N. Fire neural networks acceleration. In Proceedings of the
together wire together: A dynamic pruning approach with IEEE/CVF Conference on Computer Vision and Pattern
self-supervised mask prediction. In Proceedings of the Recognition (CVPR), June 2020.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2022. Hildebrand, M., Khan, J., Trika, S., Lowe-Power, J., and
Akella, V. Autotm: Automatic tensor movement in het-
Elsken, T., Metzen, J. H., and Hutter, F. Neural archi- erogeneous memory systems using integer linear pro-
tecture search: A survey. Journal of Machine Learning gramming. In Proceedings of the Twenty-Fifth Interna-
Research, 20(55):1–21, 2019. URL http://jmlr.org/papers/ tional Conference on Architectural Support for Program-
v20/18-598.html. ming Languages and Operating Systems, ASPLOS ’20,
pp. 875–890, New York, NY, USA, 2020. Association
Evans, J. jemalloc. https://github.com/jemalloc/jemalloc.
for Computing Machinery. ISBN 9781450371025. doi:
Fan, A., Stock, P., Graham, B., Grave, E., Gribonval, R., 10.1145/3373376.3378465. URL https://doi.org/10.1145/
Jegou, H., and Joulin, A. Training with quantization 3373376.3378465.
noise for extreme model compression. arXiv preprint
arXiv:2004.07320, 2020. Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfast
networks for video recognition. In Proceedings of the Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
IEEE/CVF International Conference on Computer Vision W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
(ICCV), October 2019. Efficient convolutional neural networks for mobile vision
applications. ArXiv, abs/1704.04861, 2017.
Frankle, J. and Carbin, M. The lottery ticket hy-
pothesis: Finding sparse, trainable neural networks, Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K.,
2018. URL http://arxiv.org/abs/1803.03635. cite Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-
arxiv:1803.03635Comment: ICLR camera ready. level accuracy with 50x fewer parameters and <0.5MB
model size, 2017. URL https://openreview.net/forum?id=
Garey, M. R. and Johnson., D. S. Computers and intractabil- S1xh5sYgx.
ity: A guide to the theory of np-completeness. Journal of
Symbolic Logic, 1979. Jain, A., Phanishayee, A., Mars, J., Tang, L., and Pekhi-
menko, G. Gist: Efficient data encoding for deep neu-
Gholami, A., Azad, A., Jin, P. H., Keutzer, K., and Buluç, ral network training. In Proceedings of the 45th An-
A. Integrated model, batch, and domain parallelism in nual International Symposium on Computer Architec-
training neural networks. Proceedings of the 30th on Sym- ture, ISCA ’18, pp. 776–789. IEEE Press, 2018. ISBN
posium on Parallelism in Algorithms and Architectures, 9781538659847. doi: 10.1109/ISCA.2018.00070. URL
2018. https://doi.org/10.1109/ISCA.2018.00070.
Google. Tcmalloc. https://github.com/google/tcmalloc. Jain, P., Jain, A., Nrusimha, A., Gholami, A., Abbeel,
Griewank, A. and Walther, A. Algorithm 799: revolve: an P., Gonzalez, J., Keutzer, K., and Stoica, I. Check-
implementation of checkpointing for the reverse or ad- mate: Breaking the memory wall with optimal ten-
joint mode of computational differentiation. ACM Trans. sor rematerialization. In Dhillon, I., Papailiopou-
Math. Softw., 26:19–45, 2000. los, D., and Sze, V. (eds.), Proceedings of Ma-
chine Learning and Systems, volume 2, pp. 497–511,
Gurobi Optimization, LLC. Gurobi Optimizer Reference 2020. URL https://proceedings.mlsys.org/paper/2020/
Manual, 2022. URL https://www.gurobi.com. file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper.pdf.

10
MODeL: Memory Optimizations for Deep Learning

Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring hid- Molchanov, P., Mallya, A., Tyree, S., Frosio, I., and Kautz,
den dimensions in accelerating convolutional neural net- J. Importance estimation for neural network pruning. In
works. In Dy, J. and Krause, A. (eds.), Proceedings of Proceedings of the IEEE Conference on Computer Vision
the 35th International Conference on Machine Learn- and Pattern Recognition, 2019.
ing, volume 80 of Proceedings of Machine Learning Re-
search, pp. 2274–2283. PMLR, 10–15 Jul 2018. URL Niu, W., Guan, J., Wang, Y., Agrawal, G., and Ren, B.
https://proceedings.mlr.press/v80/jia18a.html. Dnnfusion: Accelerating deep neural networks execu-
tion with advanced operator fusion. In Proceedings
Kalamkar, D. D., Mudigere, D., Mellempudi, N., Das, D., of the 42nd ACM SIGPLAN International Conference
Banerjee, K., Avancha, S., Vooturi, D. T., Jammala- on Programming Language Design and Implementa-
madaka, N., Huang, J., Yuen, H., Yang, J., Park, J., tion, PLDI 2021, pp. 883–898, New York, NY, USA,
Heinecke, A., Georganas, E., Srinivasan, S., Kundu, 2021. Association for Computing Machinery. ISBN
A., Smelyanskiy, M., Kaul, B., and Dubey, P. A 9781450383912. doi: 10.1145/3453483.3454083. URL
study of BFLOAT16 for deep learning training. CoRR, https://doi.org/10.1145/3453483.3454083.
abs/1905.12322, 2019. URL http://arxiv.org/abs/1905.
12322. NVidia. Mixed precision training. https://docs.nvidia.com/
deeplearning/performance/mixed-precision-training/
Karakus, C., Huilgol, R., Wu, F., Subramanian, A., Daniel, index.html. [Online; accessed 13-Oct-2022].
C., Çavdar, D., Xu, T., Chen, H., Rahnama, A., and
Quintela, L. Amazon sagemaker model parallelism: A Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
general and flexible framework for large model training. DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
CoRR, abs/2111.05972, 2021. URL https://arxiv.org/abs/ A. Automatic differentiation in pytorch. In NIPS 2017
2111.05972. Workshop on Autodiff, 2017. URL https://openreview.net/
forum?id=BJJsrmfCZ.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Communications of the ACM, 60:84 – 90, 2012. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
Liberis, E. and Lane, N. D. Neural networks on microcon- M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
trollers: saving memory at inference via operator reorder- Bai, J., and Chintala, S. Pytorch: An imperative style,
ing, 2020. high-performance deep learning library. In Advances
in Neural Information Processing Systems 32, pp. 8024–
Lin, D., Talathi, S., and Annapureddy, S. Fixed point quan- 8035. Curran Associates, Inc., 2019.
tization of deep convolutional networks. In International
conference on machine learning, pp. 2849–2858. PMLR, Patil, S. G., Jain, P., Dutta, P., Stoica, I., and Gonzalez, J.
2016. POET: Training neural networks on tiny devices with in-
tegrated rematerialization and paging. In Chaudhuri, K.,
Lin, D. D. and Talathi, S. S. Overcoming challenges in fixed Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato,
point training of deep convolutional networks. ArXiv, S. (eds.), Proceedings of the 39th International Confer-
abs/1607.02241, 2016. ence on Machine Learning, volume 162 of Proceedings
Lin, J., Zhu, L., Chen, W.-M., Wang, W.-C., Gan, C., and of Machine Learning Research, pp. 17573–17583. PMLR,
Han, S. On-device training under 256kb memory, 2022. 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/
URL https://arxiv.org/abs/2206.15472. patil22b.html.

Louizos, C., Welling, M., and Kingma, D. P. Learning Paulik, M., Seigel, M., Mason, H., Telaar, D., Kluivers,
sparse neural networks through l0 regularization, 2018. J., van Dalen, R., Lau, C. W., Carlson, L., Granqvist,
F., Vandevelde, C., Agarwal, S., Freudiger, J., Byde, A.,
Meng, C., Sun, M., Yang, J., Qiu, M., and Gu, Y. Train- Bhowmick, A., Kapoor, G., Beaumont, S., Cahill, A.,
ing deeper models by gpu memory optimization on Hughes, D., Javidbakht, O., Dong, F., Rishi, R., and
tensorflow. In NIPS 2017 Workshop on ML Systems, Hung, S. Federated evaluation and tuning for on-device
2017. URL http://learningsys.org/nips17/assets/papers/ personalization: System design and applications, 2021.
paper 18.pdf. URL https://arxiv.org/abs/2102.08503.

Mirhoseini, A., Goldie, A., Pham, H., Steiner, B., Le, Q. V., Peng, X., Shi, X., Dai, H., Jin, H., Ma, W., Xiong, Q.,
and Dean, J. A hierarchical model for device placement. Yang, F., and Qian, X. Capuchin: Tensor-based gpu
In ICLR, 2018. memory management for deep learning. In Proceedings

11
MODeL: Memory Optimizations for Deep Learning

of the Twenty-Fifth International Conference on Archi- Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,
tectural Support for Programming Languages and Op- and Garnett, R. (eds.), Advances in Neural Information
erating Systems, ASPLOS ’20, pp. 891–905, New York, Processing Systems, volume 30. Curran Associates, Inc.,
NY, USA, 2020. Association for Computing Machinery. 2017. URL https://proceedings.neurips.cc/paper/2017/
ISBN 9781450371025. doi: 10.1145/3373376.3378505. file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
URL https://doi.org/10.1145/3373376.3378505.
Wang, N., Choi, J., Brand, D., Chen, C.-Y., and Gopalakr-
Sekiyama, T., Imamichi, T., Imai, H., and Raymond, R. ishnan, K. Training deep neural networks with 8-bit
Profile-guided memory optimization for deep neural net- floating point numbers. In Bengio, S., Wallach, H.,
works, 2018. URL https://arxiv.org/abs/1804.10001. Larochelle, H., Grauman, K., Cesa-Bianchi, N., and
Garnett, R. (eds.), Advances in Neural Information Pro-
Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, cessing Systems, volume 31. Curran Associates, Inc.,
M., and Villalobos, P. Compute trends across three 2018. URL https://proceedings.neurips.cc/paper/2018/
eras of machine learning. 2022 International Joint Con- file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf.
ference on Neural Networks (IJCNN), Jul 2022. doi:
10.1109/ijcnn55064.2022.9891914. URL http://dx.doi. Wikipedia. Integer programming. https://en.wikipedia.org/
org/10.1109/IJCNN55064.2022.9891914. wiki/Integer programming, 2023. [Online; accessed 15-
Mar-2023].
Shah, A., Wu, C.-Y., Mohan, J., Chidambaram, V., and
Kraehenbuehl, P. Memory optimization for deep net- Zheng, B., Vijaykumar, N., and Pekhimenko, G. Echo:
works. In International Conference on Learning Repre- Compiler-based gpu memory footprint reduction for lstm
sentations, 2021. URL https://openreview.net/forum?id= rnn training. In 2020 ACM/IEEE 47th Annual Interna-
bnY0jm4l59. tional Symposium on Computer Architecture (ISCA), pp.
1089–1102, 2020. doi: 10.1109/ISCA45697.2020.00092.
Simonyan, K. and Zisserman, A. Very deep convolutional
networks for large-scale image recognition. In Bengio, Zhu, F., Gong, R., Yu, F., Liu, X., Wang, Y., Li, Z., Yang, X.,
Y. and LeCun, Y. (eds.), 3rd International Conference and Yan, J. Towards unified int8 training for convolutional
on Learning Representations, ICLR 2015, San Diego, neural network. In IEEE/CVF Conference on Computer
CA, USA, May 7-9, 2015, Conference Track Proceedings, Vision and Pattern Recognition (CVPR), June 2020.
2015. URL http://arxiv.org/abs/1409.1556.
Smith, S. L., Kindermans, P.-J., and Le, Q. V. Don’t decay
the learning rate, increase the batch size. In International
Conference on Learning Representations, 2018. URL
https://openreview.net/forum?id=B1Yy1BxCZ.
Tai, Y., Yang, J., and Liu, X. Image super-resolution via
deep recursive residual network. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), July 2017.
Tan, M. and Le, Q. V. Efficientnet: Rethinking model
scaling for convolutional neural networks. ArXiv,
abs/1905.11946, 2019.
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M.,
Howard, A., and Le, Q. V. Mnasnet: Platform-aware
neural architecture search for mobile, 2019.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and
Paluri, M. A closer look at spatiotemporal convolutions
for action recognition. In 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 6450–
6459, 2018. doi: 10.1109/CVPR.2018.00675.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. At-
tention is all you need. In Guyon, I., Luxburg, U. V.,

12
MODeL: Memory Optimizations for Deep Learning

A. Scaling to Large Neural Networks


Our formulation requires 2 × |E| × |V |) binary variables since we have one C and one P variable per tensor per step, as
well as |E| integer variables to track tensor addresses. Additionally, we create O(|V | × |E|) constraints to encode tensor
precedence and life cycle requirements, and O(|V | × |E|2 ) constraints to ensure that tensors never overlap in memory.
We develop the following techniques to reduce the complexity of the ILP formulation and enable our approach to scale well.
This permits MODeL to optimize the memory usage of neural networks with complex tensor computation graphs comprised
of thousands of vertices and edges.

A.1. Bounding Lifetime Ranges


All of the input tensors of a node must reside in memory for it to run at a given step. This means that all the operators in
the immediate fanin of the node must have been run at least one step prior. As a result, we can identify the earliest step
ASAP (v) (“as soon as possible”) during which a node v can run. ASAP (v) is the longest distance from v to an input of
the neural network, which is computed in linear time using a simple depth first search traversal of the graph (Baruch, 1996).
Using the same approach, we can also identify the latest step ALAP (v) (“as late as possible”) at which a node v can run,
which is the longest distance from v to an output of the neural network.
A node v can only run within the span [ASAP (v), ALAP (v)]. Since tensors are created when their source node is run, a
variable Ce,s will always be false outside the span of their source node (Equation 10).

SP AN (v) = [ASAP (v), ALAP (v)]


(10)
∀e ∈ E, ∀s ∈
/ SP AN (src(e)) Ce,s = 0

Furthermore, a tensor only needs to be preserved in memory until all its sink operators have run. This enables us to define
the Maximum Useful Lifetime (MUL) range of a tensor, and set the variable Pe,s for a tensor e to false outside of this range
(Equation 11).

M U L(e) = [ASAP (src(e)), max ALAP (f )]


f ∈snks(e)
(11)
∀e ∈ E, ∀s ∈
/ M U L(e) Pe,s = 0

Additionally, tensors must be preserved in memory from the time they are created until their last sink node has run. Therefore,
Pe,s must always be true from the last step at which e can be created until the earliest step at which its last sink can run
(Equation 12).

P RES(e) = [ALAP (src(e) + 1, max ASAP (f )]


f ∈snks(e)
(12)
∀e ∈ E, ∀s ∈ P RES(e) Pe,s = 1

This enables us to reduce the number of steps to track for each tensor. In the best case scenario, where a neural network
is a linear sequence of operators, the span of each node v is reduced to a single step, and we can derive the values of all
the Ce,s and Pe,s purely from the structure of the graph. However, in the opposite extreme case where a neural network
consists exclusively of operators that can run in parallel, we cannot infer any of the values of the Ce,s and Pe,s variables.
The structure of real neural networks lies somewhere between these two extremes.

A.2. Leveraging Precedence Constraints


We simplify our memory placement formulation by avoiding the need to create the variables and constraints from equations
6, 7a and 7b whenever we can determine that two tensors can never reside in memory at the same time. We exploit two
sufficient conditions to achieve this.
First, we leverage the Maximum Useful Lifetime ranges from our ASAP/ALAP analysis. If the MUL ranges of two tensors
do not overlap, they will never be present concurrently in memory.

13
MODeL: Memory Optimizations for Deep Learning

v3
e1 e2
v1 v2

v4

Figure 9. Edge precedence: e1 ≺prec e2 since the sinks v3 and v4 of e1 are both in the transitive fanin of the source node of e2 , and e1
and e2 have no vertex in common.

We complement this first condition with a precedence analysis. If a vertex v2 is reachable from another vertex v1 (i.e. if
v1 is in the transitive fanin of v2 ), the corresponding operator v1 must be run before operator v2 . Therefore, if all the sink
vertices of an edge e1 are in the transitive fanin of the source vertex of an edge e2 , e1 and e2 can only be present in memory
if there is a vertex v such that e1 is one of the fanout edges of v and e2 is one of its fanin edges (Figure 9). We call this
condition ≺prec , and if either condition e1 ≺prec e2 or e2 ≺prec e1 holds e1 and e2 can never reside together in memory.
We use a simple depth-first search (Function 2) to determine whether a vertex v2 is reachable from a vertex v1 . We leverage
memoization to ensure that answering the query for a pair (v1 , v2 ) yields to constant time queries for all future queries
(v, v2 ) that involve a vertex v on a path from v1 to v2 .

Function 2 IsInTransitiveFanin(v1, v2, cache)


▷ Returns true iff v2 can be reached from v1.
if (v1, v2) in cache then
returncache[(v1, v2)]
end if
for f in f i(v2) do
if src(f ) = v1 then
cache[(v1, v2)] ← true
return true
end if
if IsInT ransitiveF anin(v1, src(f )) then
cache[(v1, v2)] ← true
return true
end if
end for
cache[(v1, v2)] ← f alse
return false

14
MODeL: Memory Optimizations for Deep Learning

B. Usage
MODeL can be integrated in a PyTorch training script with minimal effort. We demonstrate how to do this on a simple
example.
import model_opt
import torch, torchvision

def accuracy(output, target):


# Computes the number of top-1 predictions that match the labels.
_, pred = torch.topk(output, 1)
correct = pred.eq(target)
return torch.sum(correct)

model = torchvision.models.resnet50().cuda()
sample_input = torch.rand(1, 3, 224, 224).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = torch.nn.CrossEntropyLoss()

model_trainer = model_opt.optimize(model, sample_input,


optimizer=optimizer, loss_fn=loss_fn)

train_loader = torch.utils.data.DataLoader(...)
correct = 0
for example, target in train_loader:
# model_trainer encapsulates the forward pass, the backward pass,
# and the weight update step
output = model_trainer(example)
correct += accuracy(output, target)

15

You might also like