MODeL Memory Optimizations For Deep Learning
MODeL Memory Optimizations For Deep Learning
Abstract
2012
The size of deep neural networks has grown expo- 2013
nentially in recent years. Unfortunately, hardware 2014
devices have not kept pace with the rapidly in-
2015
creasing memory requirements. To cope with this,
2016
researchers have proposed various techniques in-
2017
cluding spilling, recomputation, reduced preci-
2018
sion training, model pruning, and so on. However,
2019
these approaches suffer from various limitations:
they can increase training time, affect model ac- 2020
1
MODeL: Memory Optimizations for Deep Learning
0 20 40 60 80
Our work makes the following novel contributions:
Popular deep learning frameworks such as PyTorch (Paszke • We study empirically the practicality and effectiveness
et al., 2019) and TensorFlow (Abadi et al., 2016) do not fully of our solution on a wide variety of DNNs, which
utilize the limited memory available. Similar to traditional achieves average memory savings exceeding 30% in a
dynamic memory allocators such as tcmalloc (Google) and median time of less than 7 seconds.
jemalloc (Evans), these frameworks maintain a pool of free • We provide an open source implementation of MODeL
blocks of memory at runtime. To serve memory requests, at https://github.com/facebookresearch/model opt.
they look for a large enough memory block in the mem-
ory pool, or allocate it from the physical memory if none
is available. This results in memory fragmentation when 2. Related Work
free memory blocks do not exactly match the size of an
Various approaches, complementary to ours, have been pro-
allocation request, which occurs frequently.
posed to break the “memory wall” and train larger networks.
Furthermore, DNN frameworks do not optimize tensor life- The first technique distributes the computation for a sin-
times. PyTorch (Paszke et al., 2019) executes operations in gle forward-backward iteration over several hardware de-
the order in which they are defined in the program. Tensor- vices, thus making more memory available overall. How-
Flow (Abadi et al., 2016) keeps a queue of operators that ever, this approach, known as model parallelism (Karakus
are ready to run, and executes them on a first-come, first- et al., 2021), significantly increases the financial cost of
served basis. As a result, tensors can be allocated earlier training deep neural networks since it requires access to ad-
than required, or freed later than necessary, wasting valuable ditional expensive compute accelerators and fast networks.
memory. Furthermore, partitioning a deep neural network efficiently
to balance communication and computation remains an open
Our method overcomes these two limitations of existing
problem still actively researched (Gholami et al., 2018; Jia
deep learning frameworks. We model the computations per-
et al., 2018; Mirhoseini et al., 2018).
formed to train a deep neural network as a dataflow graph
of operations. We analyze this graph to find a topological In parallel, the research community has developed numerous
ordering of the nodes that adjusts the lifetime of the tensors solutions to reduce the memory footprint of neural networks:
generated by these operations to minimize the peak amount
of memory that needs to be allocated (Figure 3). Further- • Novel neural network architectures reduce the number
more, we find an optimal packing of these tensors, which of parameters needed to achieve a given level of accu-
2
MODeL: Memory Optimizations for Deep Learning
racy (Iandola et al., 2017; Tan & Le, 2019). Further- • More recently, combining several of these techniques
more, automated search techniques known as neural has been proposed to increase their effectiveness and
architecture search (Tan et al., 2019) have been pro- mitigate their drawbacks (Beaumont et al., 2021; Patil
posed to automatically design memory efficient models. et al., 2022) without fully eliminating them.
The main drawbacks of these methods are that they are
time consuming to deploy, and fail to match the result Additionally, some techniques developed primarily to in-
quality of state of the art DNNs. crease execution speed are also beneficial for memory:
• Model compression methods (Blalock et al., 2020) • Operator fusion can reduce memory footprint by avoid-
prune (Louizos et al., 2018; Frankle & Carbin, 2018; ing the need to materialize large intermediate buffers
Molchanov et al., 2019; Elkerdawy et al., 2022; He and keep them around for backpropagation (Niu et al.,
et al., 2020) or share (Dehghani et al., 2019) weights to 2021).
improve the efficiency of the model parameterization.
However, the majority of these techniques require train- • Machine learning frameworks such as PyTorch (Paszke
ing the unpruned neural network first, and are therefore et al., 2019) and TensorFlow (Abadi et al., 2016) allow
most useful for inference. some of their operators to store the data they gener-
ate in one of their input tensors, thus avoiding the
• Training using reduced precision arithmetic on 16-bit need to allocate an output tensor. This is known as in-
floating point or even quantized representations (Wang place-update, and saves memory. However, users must
et al., 2018; Zhu et al., 2020; Kalamkar et al., 2019) sig- manually modify their neural networks to leverage this
nificantly reduces memory (Fan et al., 2020; Lin et al., capability, and it can lead to correctness issues if used
2016). However, these techniques can compromise the indiscriminately (Paszke et al., 2017).
accuracy of the neural networks, make training unsta-
ble, and require careful implementation to be deployed Optimizing the location of tensors in memory to reduce
successfully (NVidia; Lin & Talathi, 2016). fragmentation, also known as the dynamic storage alloca-
tion problem, is NP-hard (Garey & Johnson., 1979). This
problem has been studied in the context of deep learning by
Several efforts have looked at the problem from a systems
other researchers (Sekiyama et al., 2018) who proposed an
perspective, and presented solutions to reduce pressure on
exact formulation to minimize the memory fragmentation
the memory subsystem. These techniques encompass:
of deep neural networks. However, their approach scaled
poorly and only succeeded in optimizing two small neural
• In-memory tensor compression, which can result networks in inference mode. As a result, they ultimately
in minimal accuracy loss in many DNN applica- advocated for a heuristics based approach.
tions (Chen et al., 2021; Jain et al., 2018). However,
this comes with a runtime penalty, since the data must Improving the lifetime of tensors has also been studied be-
be compressed and uncompressed on the fly. fore. Liberis et al. (Liberis & Lane, 2020) and Serenity (Ahn
et al., 2020) looked for a memory-optimal execution sched-
• Rematerialization, also known as checkpointing, dis- ule by enumerating the topological orders of the DNN graph
cards activations in the forward pass to save memory, and calculating their peak memory usage. To speed things
and recomputes those values as needed when com- up, they both proposed dynamic programming based op-
puting the gradients. Numerous strategies to identify timizations to prune the number of orderings they needed
which activations to discard have been proposed (Jain to consider. However, the complexity of their algorithms
et al., 2020; Zheng et al., 2020; Chen et al., 2016; remains prohibitive at O(|V | ∗ 2|V | ) in both cases, and they
Griewank & Walther, 2000; Shah et al., 2021). While only managed to make them work for inference on tiny
effective at reducing memory usage, these techniques graphs. Lin et al. (Lin et al., 2022) also mentioned reorder-
add extra computations, which increases the training ing computations as a way to enable operator fusion and
time. reduce the peak memory footprint while training. Unfortu-
nately, they didn’t describe the algorithm they used to find a
• Paging, aka spilling, consists of moving data between
suitable node ordering.
a small but high bandwidth and low latency memory
pool, and a large but slow external memory. This has
been demonstrated to effectively offload the data stored 3. Background
on a GPU device onto the host memory (Peng et al.,
3.1. Representing Neural Networks as Dataflow Graphs
2020; Hildebrand et al., 2020; Meng et al., 2017), but
again increases training time due to extra memory trans- Deep neural networks can be represented using dataflow
fers. graphs, as pioneered by TensorFlow (Abadi et al., 2016).
3
MODeL: Memory Optimizations for Deep Learning
The nodes of the graph encode the computations to be per- e3:20Mb e5:5Mb
v3
formed (e.g. matrix multiplications, convolutions, activation e1:10Mb
functions), while the edges represent the data (aka tensor or v1 v4
e6:10Mb
array) that is produced by an operation and transferred to e2:10Mb v2 e4:30Mb
consumer nodes.
Resident Sets:
Due to the producer-consumer relation between connected
Order #1, Peak=60Mb Order #2, Peak=45Mb
nodes, edges are oriented. Each edge has exactly one source,
v1 e1, e2, e3 40Mb v1 e1, e2, e3 40Mb
which is the operator that generated the corresponding ten- v2 e2, e3, e4 60Mb v3 e2, e3, e5 35Mb
sor. Since a tensor can be consumed by more than one node, v3 e3, e4, e5 55Mb v2 e2, e4, e5 45Mb
edges can have multiple sinks. v4 e4, e5, e6 45Mb v4 e4, e5, e6 45Mb
Operators can have multiple incoming (aka fanin) edges.
Typically, one of these incoming edges will be the tensor Figure 3. Node execution orders can impact peak memory usage.
generated by the previous layer, and another one will be a Edges are annotated with the size of their corresponding tensors,
weight tensor. Similarly, operators can have multiple out- and the two feasible node orders are annotated with the set of
going (aka fanout) edges: while most operations generate a tensors resident in memory at each step. Running v3 before v2 is
single output tensor, some may create two or more. Opera- significantly more memory efficient.
tors with no fanout edges are used to model the final outputs
of the neural network. Operators without fanin edges can
model random number generators, constants, weights, or
initial inputs to the neural network. efficient. However, as demonstrated in prior works (Bern-
stein et al., 1989; Bruno & Sethi, 1976), finding an optimal
In the remainder of this paper, we assume that the graphs scheduling for a generic DAG is an NP-complete problem,
are acyclic. In practice, this is not a significant limitation which cannot be solved with a simple greedy approach.
since recurrent neural networks such as LSTM (Hochreiter
& Schmidhuber, 1997) have been eclipsed by transform- 3.3. Optimizing Tensor Locations in Memory
ers (Vaswani et al., 2017). Furthermore, their loops can be
unrolled to avoid the problem altogether. Similar to malloc-style memory allocators, the tensor allo-
cation schemes used by typical deep learning frameworks
3.2. Optimizing Tensor Lifetimes operate online and suffers from fragmentation. Indeed, free
memory is often segregated into small blocks and inter-
For an operator to run, all its input tensors must be resident spersed by memory allocated to live tensors. As a result,
in memory, and its output tensors must have been allocated a significant fraction of the total memory is effectively un-
so that they can be written to while the node executes. Ad- usable because it is divided into pieces that are not large
ditionally, to avoid recomputing tensors, once a tensor is enough to fit a tensor. Figure 4 illustrates this phenomenon
generated it must be preserved in memory until all its con- and demonstrates how planning the location of each ten-
sumers have been run. sor ahead of time can significantly reduce the overall peak
We define the resident set RS(s) at a given step s in the memory usage.
execution of a neural network as the set of tensors that need
to be kept in memory at that point in time. It comprises 4. Formulation
the tensors in the fanin and fanout of the operator that is
scheduled for execution at step s, as well as all the other We propose to take advantage of the predictability of neural
tensors that were previously generated but need to be kept network computations to proactively optimize the lifetime
in memory to be able to run subsequent operators. The peak and location of tensors in memory.
resident set is the largest resident set over the execution of We formulate the problem of optimizing the ordering of com-
the network. putations (which determines the tensor lifetimes) and the
The order in which nodes are executed impacts the lifetime location of tensors in memory (which determines the amount
of the tensors, and therefore the peak working set. Figure 3 of memory fragmentation) of generic data-flow graphs, in-
illustrates a simple example in which changing the operator cluding those used in neural network training. We encode
ordering noticeably improves memory usage. the problem as an integer linear program (Wikipedia, 2023)
and use an off-the-shelf ILP solver to find a solution that
Among all possible node orderings, those prioritizing the minimizes the peak memory required to run the dataflow
execution of nodes that free large amounts of data while gen- graph.
erating little output data themselves, are likely to be more
We solve the ILP problem ahead of time, before the training
4
MODeL: Memory Optimizations for Deep Learning
Input Tensor A
Op 1 Tensor B
We leverage a set of linear constraints to ensure that the se-
Op 2 Tensor C quence of tensor creations and preservations reflects a valid
Op 3 Tensor D execution sequence of the neural network corresponding to
Out a feasible topological ordering of the DAG.
Address Space
First, a tensor e can either be created or preserved at each
step s, but not both (equation 1). Note that it’s possible
Figure 4. Memory fragmentation can significantly increase the
memory needed to store tensors. A greedy allocator (top) would for both Ce, s and Pe, s to be false, which indicates that the
not leave any room between tensors A and B, thus making it im- tensor does not reside in memory at this point in time.
possible to reuse the space left once tnsor A is freed to store tensor
C. MODeL (bottom) leaves a gap between tensor A and B to enable ∀e ∈ E, ∀s ∈ S Ce, s + Pe, s ≤ 1 (1)
the reuse of the memory freed by tensor A and fits all the tensors
in less memory.
Second, a tensor e can be preserved in memory at step s if
and only if it was created or preserved at the previous step
process starts. This results in a small one-time initial cost, (equation 2).
which is negligible compared to the time it takes to train a
neural network (see section 5.3). ∀e ∈ E, ∀s ∈ S Pe, s ≤ Pe, s−1 + Ce, s−1 (2)
5
MODeL: Memory Optimizations for Deep Learning
We use these variables to prevent the overlap of tensors that Tensors are stored in a shared preallocated buffer B sized to
reside in memory at the same time in equations 7a and 7b. accommodate the peak memory usage. The value of each
Ae variable represents the offset location of tensor e in B.
∀(i, j) ∈ E 2 Ai + Si − Aj ≤ (1 − ai,j ) ∗ M (7a) We can map memory allocation requests to addresses over
multiple iterations of the training loop as follow. We’ll
∀(i, j) ∈ E 2 Ai − Aj − Sj ≥ (bi,j − 1) ∗ M (7b) assume that each operator generates a single output tensor
for the sake of simplicity, but our approach generalizes to
If ai,j takes the value 1, equation 7a degenerates into handle operators with multiple outputs. The kth memory
Ai + Si ≤ Aj . This forces tensor i to reside below allocation request corresponds to the tensor generated by
tensor j in memory. Similarly, equation 7b degenerates into the operator located at position k mod |V | in the execution
Ai ≥ Aj + Sj when bi,j takes the value 1, which forces sequence ES. This tensor e is to be located at address AB +
tensor i to be placed above tensor j. On the other hand, if Ae , where AB is the base address of buffer B. Memory
ai,j and bi,j take the value 0, equations 7a and 7b hold for deallocation requests are no-ops.
any value of Ai and Aj in the range [0, M ]. In other words,
they don’t impose further restrictions on the location of ei
and ej . 5. Experiments
Put altogether, constraints 6, 7a, and 7b ensure that ten- We measured the impact of MODeL on the memory usage of
sors can share the same memory space if and only if their DNN training. We tried to answer the following questions:
lifetimes do not overlap.
• How effective is our algorithm at reducing peak mem-
4.4. Minimizing Peak Memory Usage ory usage?
We track the peak memory usage by introducing a variable • How practical are our algorithms? Can they be applied
peak mem that we constrain as follow: to large neural networks in a reasonable amount of
time?
∀e ∈ E Ae + Se ≤ peak mem (8)
• What are the respective contributions of our two strate-
gies of node reordering and address generation to the
We find the schedule of operators and memory location
overall memory reduction?
of tensors that optimizes the memory usage of the neural
network by feeding program 9 to an ILP solver.
5.1. Experimental Setup
arg min peak mem We implemented MODeL on top of PyTorch version
C,P,A
1.11 (Paszke et al., 2019) with torchtext 0.12 and torchvision
subject to (1), (2), (3), (4), (5), (9) 0.12. We leveraged torch.FX to convert neural networks into
(6), (7a), (7b), (8) executable sequences of operator calls, and reconstructed
6
MODeL: Memory Optimizations for Deep Learning
the computation graphs from the operator arguments. We 5.3. Overall Memory Improvement
encoded and solved the memory optimizations problem
Optimizing for both operator ordering and the memory loca-
(equation 9) using Gurobi version 9.1.1 (Gurobi Optimiza-
tion of tensors using equation 9 results in a reduction in peak
tion, LLC, 2022). We translated the Gurobi results into
memory usage ranging from 8 to 45% at batch size 1, and
optimized execution sequences and memory locations as
12 to 68% at batch size 32. The average saving was 31.4%
described in section 4.5.
for batch size 1 and 32.8% for batch size 32 (Figure 5).
We leveraged several techniques to optimize the implemen-
tation of our approach. We describe the most impactful
Batch Size 1 Batch Size 32
optimizations in appendix A.
80%
We ran all our experiments on a workstation featuring a Intel
Xeon Gold 6138 CPU running at 2.0 GHz and a NVidia 60%
A100 GPU. We show how to integrate MODeL in a regular
40%
training flow in appendix B.
20%
5.2. Methodology
0%
We evaluated MODeL on a comprehensive set of neural
et
et
et
3D
et
er
T
-R
G
T
R
N
VI
N
VG
rm
M
et
BE
ex
AS
ile
es
networks. We included the ResNet (He et al., 2016) and
XL
N
fo
ob
Al
es
N
s
an
M
R
Transformer (Vaswani et al., 2017) models since they are
Tr
ubiquitous and used in many downstream tasks: the former
Figure 5. Total reduction in peak memory usage (in %) during
introduced the concept of residual connection, and the later
training at various batch sizes compared to PyTorch.
popularized the attention mechanism. We also included neu-
ral networks designed for specific tasks, such as computer
vision (AlexNet (Krizhevsky et al., 2012), VGG (Simonyan The memory optimization process takes an average of
& Zisserman, 2015), video understanding (ResNet3D (Tran 7.4 ± 0.7 seconds. In the worst case, our algorithm needs
et al., 2018)), and large language models (BERT (Devlin 18.1 seconds to run, and the best case is 100 millisec-
et al., 2018), XLM-R (Conneau et al., 2019)). onds (Figure 6). This process is run only once before train-
ing the model. It introduces a negligible overhead to the
In addition to these models that were designed to run on total training time yet significantly reduces the peak memory
datacenter hardware, we also evaluated our approach on usage.
MobileNet (Howard et al., 2017). This neural network was
tailored to run in resource constrained environments such as
edge devices. Additionally, we trained the neural networks
Batch size 1 Batch size 32
at batch size 1 and 32. Batch size 1 is commonly used
when training a model on devices with limited memory 15
-R
et
et
3D
er
T
VI
R
N
VG
N
M
et
BE
ex
AS
ile
es
or
XL
N
ob
sf
N
es
an
M
Tr
7
MODeL: Memory Optimizations for Deep Learning
We measured the ability of our memory optimizer to reduce Batch size 1 Batch size 32
memory fragmentation in two scenarios: first, when our
optimizer is free to reorder operators, and second when
30%
it is forced to honor the PyTorch operator ordering. The
second scenario is implemented by constraining the values
20%
of the Ce, s variables to be consistent with the PyTorch node
ordering. In both cases, we found that our address generator 10%
was able to completely eliminate memory fragmentation on
all the models of our benchmark suite. By contrast, PyTorch 0%
suffered from an average fragmentation of 7.8% at batch
et
et
et
3D
et
er
T
-R
G
T
R
N
VI
N
VG
rm
M
et
BE
ex
AS
ile
es
XL
N
size 1, and 21.3% at batch size 32 (Figure 7). The PyTorch
fo
ob
Al
es
N
s
an
M
Tr
memory allocator uses a different strategy for small and
large objects, which could explain why fragmentation is
Figure 8. Reduction (in %) in ideal peak memory usage compared
significantly worse for the larger batch size. However, it is to PyTorch as a result of our node reordering. Ideal memory usage
unclear whether it could be modified to better handle large assumes that there is no fragmentation.
tensors without introducing other drawbacks.
Batch size 1 Batch size 32 the weights offers a great deal of flexibility, which MODeL
40% leverages to decrease the memory usage of the backward
pass. However, the gradients with respect to the weights are
30%
roughly smaller than the activations by a factor of batch size.
20%
Therefore, at large batch sizes, a larger percentage of the
total memory is used to store activations, while at smaller
10% batch sizes these gradients represent a larger fraction of
the total. As a result operator reordering tends to be more
0%
effective at small batch sizes.
et
et
-R
et
et
3D
e
T
VI
R
N
VG
N
M
et
BE
or
ex
AS
ile
es
XL
N
sf
Al
ob
R
N
es
an
M
6. Limitations
R
Tr
Figure 7. PyTorch memory fragmentation (in %) during training at Our approach suffers from three main limitations. First, the
various batch sizes. Our method fully eliminates fragmentation. neural network we want to optimize must be representable
using a dataflow graph. This is the case for all the major
neural architectures, so we do not believe that this is a sig-
5.5. Impact of Operator Reordering
nificant drawback in practice. Second, the sizes of all the
To evaluate the impact of our tensor lifetime optimization tensors must be known ahead of time. In the case where
on the overall result, we compared the ideal peak memory sizes are variable (for example, in the case of a language
necessary to run various neural networks when using the model operating on sentences of variable length) one can op-
PyTorch node ordering and the node ordering determined timize for the worst case scenario (e.g the longest sentence).
by our algorithm. For these measurements, we eliminated Third, our formulation assumes that tensors are stored in a
the impact of memory fragmentation by recording the peak single contiguous memory space. It can be extended to sup-
memory PyTorch operators need to request from the system port multiple memory spaces (for example, when multiple
to run these models under both node orderings instead of devices are used to train a large neural network), but this is
the memory actually used to run the models. beyond the scope of this paper.
We find that optimizing the order in which operators are
run reduces peak memory usage by up to 38% compared 7. Conclusion
to PyTorch (Figure 8). On average, our solution achieves a
The limited memory capacity of the hardware used by deep
reduction of 23.9% at batch size 1 and 11.7% at batch size
learning practitioners is one of the main challenges to train
32.
state-of-the art neural networks. This ”memory wall” lim-
The activations generated during the forward pass are pre- its the size of the neural networks that can be trained, and
served in memory for the backward pass of the training. As a ultimately impacts the quality of their predictions. Further-
result, MODeL has limited ability to decrease the memory us- more, as memory needs increase much faster than memory
age of the forward pass. On the other hand, the order of the capacity, we expect this memory bottleneck to worsen over
computation and application of the gradients with respect to time.
8
MODeL: Memory Optimizations for Deep Learning
Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., and Guttag, J. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
What is the state of neural network pruning? In Dhillon, Pre-training of deep bidirectional transformers for lan-
I., Papailiopoulos, D., and Sze, V. (eds.), Proceedings of guage understanding, 2018. URL https://arxiv.org/abs/
Machine Learning and Systems, volume 2, pp. 129–146, 1810.04805.
9
MODeL: Memory Optimizations for Deep Learning
Dong, C., Loy, C. C., He, K., and Tang, X. Image super- Hard, A., Rao, K., Mathews, R., Beaufays, F., Augenstein,
resolution using deep convolutional networks. IEEE S., Eichner, H., Kiddon, C., and Ramage, D. Feder-
Transactions on Pattern Analysis and Machine Intelli- ated learning for mobile keyboard prediction. ArXiv,
gence, 38:295–307, 2016. abs/1811.03604, 2018.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, learning for image recognition. In 2016 IEEE Conference
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, on Computer Vision and Pattern Recognition (CVPR), pp.
N. An image is worth 16x16 words: Transformers for 770–778, 2016. doi: 10.1109/CVPR.2016.90.
image recognition at scale, 2020. URL https://arxiv.org/
abs/2010.11929. He, Y., Ding, Y., Liu, P., Zhu, L., Zhang, H., and Yang, Y.
Learning filter pruning criteria for deep convolutional
Elkerdawy, S., Elhoushi, M., Zhang, H., and Ray, N. Fire neural networks acceleration. In Proceedings of the
together wire together: A dynamic pruning approach with IEEE/CVF Conference on Computer Vision and Pattern
self-supervised mask prediction. In Proceedings of the Recognition (CVPR), June 2020.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2022. Hildebrand, M., Khan, J., Trika, S., Lowe-Power, J., and
Akella, V. Autotm: Automatic tensor movement in het-
Elsken, T., Metzen, J. H., and Hutter, F. Neural archi- erogeneous memory systems using integer linear pro-
tecture search: A survey. Journal of Machine Learning gramming. In Proceedings of the Twenty-Fifth Interna-
Research, 20(55):1–21, 2019. URL http://jmlr.org/papers/ tional Conference on Architectural Support for Program-
v20/18-598.html. ming Languages and Operating Systems, ASPLOS ’20,
pp. 875–890, New York, NY, USA, 2020. Association
Evans, J. jemalloc. https://github.com/jemalloc/jemalloc.
for Computing Machinery. ISBN 9781450371025. doi:
Fan, A., Stock, P., Graham, B., Grave, E., Gribonval, R., 10.1145/3373376.3378465. URL https://doi.org/10.1145/
Jegou, H., and Joulin, A. Training with quantization 3373376.3378465.
noise for extreme model compression. arXiv preprint
arXiv:2004.07320, 2020. Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfast
networks for video recognition. In Proceedings of the Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
IEEE/CVF International Conference on Computer Vision W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
(ICCV), October 2019. Efficient convolutional neural networks for mobile vision
applications. ArXiv, abs/1704.04861, 2017.
Frankle, J. and Carbin, M. The lottery ticket hy-
pothesis: Finding sparse, trainable neural networks, Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K.,
2018. URL http://arxiv.org/abs/1803.03635. cite Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-
arxiv:1803.03635Comment: ICLR camera ready. level accuracy with 50x fewer parameters and <0.5MB
model size, 2017. URL https://openreview.net/forum?id=
Garey, M. R. and Johnson., D. S. Computers and intractabil- S1xh5sYgx.
ity: A guide to the theory of np-completeness. Journal of
Symbolic Logic, 1979. Jain, A., Phanishayee, A., Mars, J., Tang, L., and Pekhi-
menko, G. Gist: Efficient data encoding for deep neu-
Gholami, A., Azad, A., Jin, P. H., Keutzer, K., and Buluç, ral network training. In Proceedings of the 45th An-
A. Integrated model, batch, and domain parallelism in nual International Symposium on Computer Architec-
training neural networks. Proceedings of the 30th on Sym- ture, ISCA ’18, pp. 776–789. IEEE Press, 2018. ISBN
posium on Parallelism in Algorithms and Architectures, 9781538659847. doi: 10.1109/ISCA.2018.00070. URL
2018. https://doi.org/10.1109/ISCA.2018.00070.
Google. Tcmalloc. https://github.com/google/tcmalloc. Jain, P., Jain, A., Nrusimha, A., Gholami, A., Abbeel,
Griewank, A. and Walther, A. Algorithm 799: revolve: an P., Gonzalez, J., Keutzer, K., and Stoica, I. Check-
implementation of checkpointing for the reverse or ad- mate: Breaking the memory wall with optimal ten-
joint mode of computational differentiation. ACM Trans. sor rematerialization. In Dhillon, I., Papailiopou-
Math. Softw., 26:19–45, 2000. los, D., and Sze, V. (eds.), Proceedings of Ma-
chine Learning and Systems, volume 2, pp. 497–511,
Gurobi Optimization, LLC. Gurobi Optimizer Reference 2020. URL https://proceedings.mlsys.org/paper/2020/
Manual, 2022. URL https://www.gurobi.com. file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper.pdf.
10
MODeL: Memory Optimizations for Deep Learning
Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring hid- Molchanov, P., Mallya, A., Tyree, S., Frosio, I., and Kautz,
den dimensions in accelerating convolutional neural net- J. Importance estimation for neural network pruning. In
works. In Dy, J. and Krause, A. (eds.), Proceedings of Proceedings of the IEEE Conference on Computer Vision
the 35th International Conference on Machine Learn- and Pattern Recognition, 2019.
ing, volume 80 of Proceedings of Machine Learning Re-
search, pp. 2274–2283. PMLR, 10–15 Jul 2018. URL Niu, W., Guan, J., Wang, Y., Agrawal, G., and Ren, B.
https://proceedings.mlr.press/v80/jia18a.html. Dnnfusion: Accelerating deep neural networks execu-
tion with advanced operator fusion. In Proceedings
Kalamkar, D. D., Mudigere, D., Mellempudi, N., Das, D., of the 42nd ACM SIGPLAN International Conference
Banerjee, K., Avancha, S., Vooturi, D. T., Jammala- on Programming Language Design and Implementa-
madaka, N., Huang, J., Yuen, H., Yang, J., Park, J., tion, PLDI 2021, pp. 883–898, New York, NY, USA,
Heinecke, A., Georganas, E., Srinivasan, S., Kundu, 2021. Association for Computing Machinery. ISBN
A., Smelyanskiy, M., Kaul, B., and Dubey, P. A 9781450383912. doi: 10.1145/3453483.3454083. URL
study of BFLOAT16 for deep learning training. CoRR, https://doi.org/10.1145/3453483.3454083.
abs/1905.12322, 2019. URL http://arxiv.org/abs/1905.
12322. NVidia. Mixed precision training. https://docs.nvidia.com/
deeplearning/performance/mixed-precision-training/
Karakus, C., Huilgol, R., Wu, F., Subramanian, A., Daniel, index.html. [Online; accessed 13-Oct-2022].
C., Çavdar, D., Xu, T., Chen, H., Rahnama, A., and
Quintela, L. Amazon sagemaker model parallelism: A Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
general and flexible framework for large model training. DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
CoRR, abs/2111.05972, 2021. URL https://arxiv.org/abs/ A. Automatic differentiation in pytorch. In NIPS 2017
2111.05972. Workshop on Autodiff, 2017. URL https://openreview.net/
forum?id=BJJsrmfCZ.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Communications of the ACM, 60:84 – 90, 2012. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
Liberis, E. and Lane, N. D. Neural networks on microcon- M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
trollers: saving memory at inference via operator reorder- Bai, J., and Chintala, S. Pytorch: An imperative style,
ing, 2020. high-performance deep learning library. In Advances
in Neural Information Processing Systems 32, pp. 8024–
Lin, D., Talathi, S., and Annapureddy, S. Fixed point quan- 8035. Curran Associates, Inc., 2019.
tization of deep convolutional networks. In International
conference on machine learning, pp. 2849–2858. PMLR, Patil, S. G., Jain, P., Dutta, P., Stoica, I., and Gonzalez, J.
2016. POET: Training neural networks on tiny devices with in-
tegrated rematerialization and paging. In Chaudhuri, K.,
Lin, D. D. and Talathi, S. S. Overcoming challenges in fixed Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato,
point training of deep convolutional networks. ArXiv, S. (eds.), Proceedings of the 39th International Confer-
abs/1607.02241, 2016. ence on Machine Learning, volume 162 of Proceedings
Lin, J., Zhu, L., Chen, W.-M., Wang, W.-C., Gan, C., and of Machine Learning Research, pp. 17573–17583. PMLR,
Han, S. On-device training under 256kb memory, 2022. 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/
URL https://arxiv.org/abs/2206.15472. patil22b.html.
Louizos, C., Welling, M., and Kingma, D. P. Learning Paulik, M., Seigel, M., Mason, H., Telaar, D., Kluivers,
sparse neural networks through l0 regularization, 2018. J., van Dalen, R., Lau, C. W., Carlson, L., Granqvist,
F., Vandevelde, C., Agarwal, S., Freudiger, J., Byde, A.,
Meng, C., Sun, M., Yang, J., Qiu, M., and Gu, Y. Train- Bhowmick, A., Kapoor, G., Beaumont, S., Cahill, A.,
ing deeper models by gpu memory optimization on Hughes, D., Javidbakht, O., Dong, F., Rishi, R., and
tensorflow. In NIPS 2017 Workshop on ML Systems, Hung, S. Federated evaluation and tuning for on-device
2017. URL http://learningsys.org/nips17/assets/papers/ personalization: System design and applications, 2021.
paper 18.pdf. URL https://arxiv.org/abs/2102.08503.
Mirhoseini, A., Goldie, A., Pham, H., Steiner, B., Le, Q. V., Peng, X., Shi, X., Dai, H., Jin, H., Ma, W., Xiong, Q.,
and Dean, J. A hierarchical model for device placement. Yang, F., and Qian, X. Capuchin: Tensor-based gpu
In ICLR, 2018. memory management for deep learning. In Proceedings
11
MODeL: Memory Optimizations for Deep Learning
of the Twenty-Fifth International Conference on Archi- Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,
tectural Support for Programming Languages and Op- and Garnett, R. (eds.), Advances in Neural Information
erating Systems, ASPLOS ’20, pp. 891–905, New York, Processing Systems, volume 30. Curran Associates, Inc.,
NY, USA, 2020. Association for Computing Machinery. 2017. URL https://proceedings.neurips.cc/paper/2017/
ISBN 9781450371025. doi: 10.1145/3373376.3378505. file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
URL https://doi.org/10.1145/3373376.3378505.
Wang, N., Choi, J., Brand, D., Chen, C.-Y., and Gopalakr-
Sekiyama, T., Imamichi, T., Imai, H., and Raymond, R. ishnan, K. Training deep neural networks with 8-bit
Profile-guided memory optimization for deep neural net- floating point numbers. In Bengio, S., Wallach, H.,
works, 2018. URL https://arxiv.org/abs/1804.10001. Larochelle, H., Grauman, K., Cesa-Bianchi, N., and
Garnett, R. (eds.), Advances in Neural Information Pro-
Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, cessing Systems, volume 31. Curran Associates, Inc.,
M., and Villalobos, P. Compute trends across three 2018. URL https://proceedings.neurips.cc/paper/2018/
eras of machine learning. 2022 International Joint Con- file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf.
ference on Neural Networks (IJCNN), Jul 2022. doi:
10.1109/ijcnn55064.2022.9891914. URL http://dx.doi. Wikipedia. Integer programming. https://en.wikipedia.org/
org/10.1109/IJCNN55064.2022.9891914. wiki/Integer programming, 2023. [Online; accessed 15-
Mar-2023].
Shah, A., Wu, C.-Y., Mohan, J., Chidambaram, V., and
Kraehenbuehl, P. Memory optimization for deep net- Zheng, B., Vijaykumar, N., and Pekhimenko, G. Echo:
works. In International Conference on Learning Repre- Compiler-based gpu memory footprint reduction for lstm
sentations, 2021. URL https://openreview.net/forum?id= rnn training. In 2020 ACM/IEEE 47th Annual Interna-
bnY0jm4l59. tional Symposium on Computer Architecture (ISCA), pp.
1089–1102, 2020. doi: 10.1109/ISCA45697.2020.00092.
Simonyan, K. and Zisserman, A. Very deep convolutional
networks for large-scale image recognition. In Bengio, Zhu, F., Gong, R., Yu, F., Liu, X., Wang, Y., Li, Z., Yang, X.,
Y. and LeCun, Y. (eds.), 3rd International Conference and Yan, J. Towards unified int8 training for convolutional
on Learning Representations, ICLR 2015, San Diego, neural network. In IEEE/CVF Conference on Computer
CA, USA, May 7-9, 2015, Conference Track Proceedings, Vision and Pattern Recognition (CVPR), June 2020.
2015. URL http://arxiv.org/abs/1409.1556.
Smith, S. L., Kindermans, P.-J., and Le, Q. V. Don’t decay
the learning rate, increase the batch size. In International
Conference on Learning Representations, 2018. URL
https://openreview.net/forum?id=B1Yy1BxCZ.
Tai, Y., Yang, J., and Liu, X. Image super-resolution via
deep recursive residual network. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), July 2017.
Tan, M. and Le, Q. V. Efficientnet: Rethinking model
scaling for convolutional neural networks. ArXiv,
abs/1905.11946, 2019.
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M.,
Howard, A., and Le, Q. V. Mnasnet: Platform-aware
neural architecture search for mobile, 2019.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and
Paluri, M. A closer look at spatiotemporal convolutions
for action recognition. In 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 6450–
6459, 2018. doi: 10.1109/CVPR.2018.00675.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. At-
tention is all you need. In Guyon, I., Luxburg, U. V.,
12
MODeL: Memory Optimizations for Deep Learning
Furthermore, a tensor only needs to be preserved in memory until all its sink operators have run. This enables us to define
the Maximum Useful Lifetime (MUL) range of a tensor, and set the variable Pe,s for a tensor e to false outside of this range
(Equation 11).
Additionally, tensors must be preserved in memory from the time they are created until their last sink node has run. Therefore,
Pe,s must always be true from the last step at which e can be created until the earliest step at which its last sink can run
(Equation 12).
This enables us to reduce the number of steps to track for each tensor. In the best case scenario, where a neural network
is a linear sequence of operators, the span of each node v is reduced to a single step, and we can derive the values of all
the Ce,s and Pe,s purely from the structure of the graph. However, in the opposite extreme case where a neural network
consists exclusively of operators that can run in parallel, we cannot infer any of the values of the Ce,s and Pe,s variables.
The structure of real neural networks lies somewhere between these two extremes.
13
MODeL: Memory Optimizations for Deep Learning
v3
e1 e2
v1 v2
v4
Figure 9. Edge precedence: e1 ≺prec e2 since the sinks v3 and v4 of e1 are both in the transitive fanin of the source node of e2 , and e1
and e2 have no vertex in common.
We complement this first condition with a precedence analysis. If a vertex v2 is reachable from another vertex v1 (i.e. if
v1 is in the transitive fanin of v2 ), the corresponding operator v1 must be run before operator v2 . Therefore, if all the sink
vertices of an edge e1 are in the transitive fanin of the source vertex of an edge e2 , e1 and e2 can only be present in memory
if there is a vertex v such that e1 is one of the fanout edges of v and e2 is one of its fanin edges (Figure 9). We call this
condition ≺prec , and if either condition e1 ≺prec e2 or e2 ≺prec e1 holds e1 and e2 can never reside together in memory.
We use a simple depth-first search (Function 2) to determine whether a vertex v2 is reachable from a vertex v1 . We leverage
memoization to ensure that answering the query for a pair (v1 , v2 ) yields to constant time queries for all future queries
(v, v2 ) that involve a vertex v on a path from v1 to v2 .
14
MODeL: Memory Optimizations for Deep Learning
B. Usage
MODeL can be integrated in a PyTorch training script with minimal effort. We demonstrate how to do this on a simple
example.
import model_opt
import torch, torchvision
model = torchvision.models.resnet50().cuda()
sample_input = torch.rand(1, 3, 224, 224).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = torch.nn.CrossEntropyLoss()
train_loader = torch.utils.data.DataLoader(...)
correct = 0
for example, target in train_loader:
# model_trainer encapsulates the forward pass, the backward pass,
# and the weight update step
output = model_trainer(example)
correct += accuracy(output, target)
15