1+ """
2+ Dynamic Quantization
3+ ====================
4+
5+ --------------
6+
7+ In this recipe you will see how to take advantage of Dynamic
8+ Quantization to accelerate inference on an LSTM-style recurrent neural
9+ network. This reduces the size of the model weights and speeds up model
10+ execution.
11+
12+ **Introduction**
13+
14+ There are a number of trade-offs that can be made when designing neural
15+ networks. During model developmenet and training you can alter the
16+ number of layers and number of parameters in a recurrent neural network
17+ and trade-off accuracy against model size and/or model latency or
18+ throughput. Such changes can take lot of time and compute resources
19+ because you are iterating over the model training. Quantization gives
20+ you a way to make a similar trade off between performance and model
21+ accuracy with a known model after training is completed.
22+
23+ You can give it a try in a single session and you will certainly reduce
24+ your model size significantly and may get a significant latency
25+ reduction without losing a lot of accuracy.
26+
27+ **What is dynamic quantization?**
28+
29+ Quantizing a network means converting it to use a reduced precision
30+ integer representation for the weights and/or activations. This saves on
31+ model size and allows the use of higher throughput math operations on
32+ your CPU or GPU.
33+
34+ When converting from floating point to integer values you are
35+ essentially multiplying the floating point value by some scale factor
36+ and rounding the result to a whole number. The various quantization
37+ approaches differ in the way they approach determining that scale
38+ factor.
39+
40+ The key idea with dynamic quantization as described here is that we are
41+ going to determine the scale factor for activations dynamically based on
42+ the data range observed at runtime. This ensures that the scale factor
43+ is "tuned" so that as much signal as possible about each observed
44+ dataset is preserved.
45+
46+ The model parameters on the other hand are known during model conversion
47+ and they are converted ahead of time and stored in INT8 form.
48+
49+ Arithmetic in the quantized model is done using vectorized INT8
50+ instructions. Accumulation is typically done with INT16 or INT32 to
51+ avoid overflow. This higher precision value is scaled back to INT8 if
52+ the next layer is quantized or converted to FP32 for output.
53+
54+ Dynamic quantization is relatively free of tuning parameters which makes
55+ it well suited to be added into production pipelines as a standard part
56+ of converting LSTM models to deployment.
57+
58+ --------------
59+
60+ **Note: Limitations on the approach taken here**
61+
62+ This recipe provides a quick introduction to the dynamic quantization
63+ features in PyTorch and the workflow for using it. Our focus is on
64+ explaining the specific functions used to convert the model. We will
65+ make a number of significant simplifications in the interest of brevity
66+ and clarity
67+
68+ 1. You will start with a minimal LSTM network
69+ 2. You are simply going to initialize the network with a random hidden
70+ state
71+ 3. You are going to test the network with random inputs
72+ 4. You are not going to train the network in this tutorial
73+ 5. You will see that the quantized form of this network is smaller and
74+ runs faster than the floating point network we started with
75+ 6. You will see that the output values are generally in the same
76+ ballpark as the output of the FP32 network, but we are not
77+ demonstrating here the expected accuracy loss on a real trained
78+ network
79+
80+ You will see how dynamic quantization is done and be able to see
81+ suggestive reductions in memory use and latency times. Providing a
82+ demonstration that the technique can preserve high levels of model
83+ accuracy on a trained LSTM is left to a more advanced tutorial. If you
84+ want to move right away to that more rigorous treatment please proceed
85+ to the `advanced dynamic quantization
86+ tutorial <https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html>`__.
87+
88+ --------------
89+
90+ **Recipe Structure**
91+
92+ This recipe has 5 steps.
93+
94+ 1. Set up
95+
96+ Here you define a very simple LSTM, import modules, and establish
97+ some random input tensors.
98+
99+ 2. Do the quantization
100+
101+ Here you instantiate a floating point model and then create quantized
102+ version of it.
103+
104+ 3. Look at model size
105+
106+ Here you show that the model size gets smaller.
107+
108+ 4. Look at latency
109+
110+ Here you run the two models and compare model runtime (latency).
111+
112+ 5. Look at accuracy
113+
114+ Here you run the two models and compare outputs.
115+
116+ **Step 1: set up**
117+
118+ This is a straightfoward bit of code to set up for the rest of the
119+ recipe.
120+
121+ The unique module we are importing here is torch.quantization which
122+ includes PyTorch's quantized operators and conversion functions. We also
123+ define a very simple LSTM model and set up some inputs.
124+
125+ """
126+
127+ # import the modules used here in this recipe
128+ import torch
129+ import torch .quantization
130+ import torch .nn as nn
131+ import copy
132+ import os
133+ import time
134+
135+ # define a very, very simple LSTM for demonstration purposes
136+ # in this case, we are wrapping nn.LSTM, one layer, no pre or post processing
137+ # inspired by
138+ # https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html, by Robert Guthrie
139+ # and https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html
140+ class lstm_for_demonstration (nn .Module ):
141+ """Elementary Long Short Term Memory style model which simply wraps nn.LSTM
142+ Not to be used for anything other than demonstration.
143+ """
144+ def __init__ (self ,in_dim ,out_dim ,depth ):
145+ super (lstm_for_demonstration ,self ).__init__ ()
146+ self .lstm = nn .LSTM (in_dim ,out_dim ,depth )
147+
148+ def forward (self ,inputs ,hidden ):
149+ out ,hidden = self .lstm (inputs ,hidden )
150+ return out , hidden
151+
152+
153+ torch .manual_seed (29592 ) # set the seed for reproducibility
154+
155+ #shape parameters
156+ model_dimension = 8
157+ sequence_length = 20
158+ batch_size = 1
159+ lstm_depth = 1
160+
161+ # random data for input
162+ inputs = torch .randn (sequence_length ,batch_size ,model_dimension )
163+ # hidden is actually is a tuple of the initial hidden state and the initial cell state
164+ hidden = (torch .randn (lstm_depth ,batch_size ,model_dimension ), torch .randn (lstm_depth ,batch_size ,model_dimension ))
165+
166+
167+ ######################################################################
168+ # **Step 2: Do the quantization**
169+ #
170+ # Now we get to the fun part. First we create an instance of the model
171+ # called float\_lstm then we are going to quantize it. We're going to use
172+ # the
173+ #
174+ # ::
175+ #
176+ # torch.quantization.quantize_dynamic()
177+ #
178+ # function here (`see
179+ # documentation <https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic>`__)
180+ # which takes the model, then a list of the submodules which we want to
181+ # have quantized if they appear, then the datatype we are targeting. This
182+ # function returns a quantized version of the original model as a new
183+ # module.
184+ #
185+ # That's all it takes.
186+ #
187+
188+ # here is our floating point instance
189+ float_lstm = lstm_for_demonstration (model_dimension , model_dimension ,lstm_depth )
190+
191+ # this is the call that does the work
192+ quantized_lstm = torch .quantization .quantize_dynamic (
193+ float_lstm , {nn .LSTM , nn .Linear }, dtype = torch .qint8
194+ )
195+
196+ # show the changes that were made
197+ print ('Here is the floating point version of this module:' )
198+ print (float_lstm )
199+ print ('' )
200+ print ('and now the quantized version:' )
201+ print (quantized_lstm )
202+
203+
204+ ######################################################################
205+ # **Step 4. Look at the model size**
206+ #
207+ # Ok, so we've quantized the model. What does that get us? Well the first
208+ # benefit is that we've replaced the FP32 model parameters with INT8
209+ # values (and some recorded scale factors). This means about 75% less data
210+ # to store and move around. With the default values the reduction shown
211+ # below will be less than 75% but if you increase the model size above
212+ # (for example you can set model dimension to something like 80) this will
213+ # converge towards 4x smaller as the stored model size dominated more and
214+ # more by the parameter values.
215+ #
216+
217+ def print_size_of_model (model , label = "" ):
218+ torch .save (model .state_dict (), "temp.p" )
219+ size = os .path .getsize ("temp.p" )
220+ print ("model: " ,label ,' \t ' ,'Size (KB):' , size / 1e3 )
221+ os .remove ('temp.p' )
222+ return size
223+
224+ # compare the sizes
225+ f = print_size_of_model (float_lstm ,"fp32" )
226+ q = print_size_of_model (quantized_lstm ,"int8" )
227+ print ("{0:.2f} times smaller" .format (f / q ))
228+
229+ # note that this value is wrong in PyTorch 1.4 due to https://github.com/pytorch/pytorch/issues/31468
230+ # this will be fixed in 1.5 with https://github.com/pytorch/pytorch/pull/31540
231+
232+
233+ ######################################################################
234+ # **Step 4: Look at latency**
235+ #
236+ # The second benefit is that the quantized model will typically run
237+ # faster. This is due to a combinations of effects including at least:
238+ #
239+ # 1. Less time spent moving parameter data in
240+ # 2. Faster INT8 operations
241+ #
242+ # As you will see the quantized version of this super-simple network runs
243+ # faster. This will generally be true of more complex networks but as they
244+ # say "your milage may vary" depending on a number of factors including
245+ # the structure of the model and the hardware you are running on.
246+ #
247+
248+ # compare the performance
249+ print ("Floating point FP32" )
250+ # %timeit float_lstm.forward(inputs, hidden)
251+
252+ print ("Quantized INT8" )
253+ # %timeit quantized_lstm.forward(inputs,hidden)
254+
255+
256+ ######################################################################
257+ # **Step 5: Look at accuracy**
258+ #
259+ # We are not going to do a careful look at accuracy here because we are
260+ # working with a randomly initialized network rather than a properly
261+ # trained one. However, I think it is worth quickly showing that the
262+ # quantized network does produce output tensors that are "in the same
263+ # ballpark" as the original one.
264+ #
265+ # For a more detailed analysis please see the more advanced tutorials
266+ # referenced at the end of this recipe.
267+ #
268+
269+ # run the float model
270+ out1 , hidden1 = float_lstm (inputs , hidden )
271+ mag1 = torch .mean (abs (out1 )).item ()
272+ print ('mean absolute value of output tensor values in the FP32 model is {0:.5f} ' .format (mag1 ))
273+
274+ # run the quantized model
275+ out2 , hidden2 = quantized_lstm (inputs , hidden )
276+ mag2 = torch .mean (abs (out2 )).item ()
277+ print ('mean absolute value of output tensor values in the INT8 model is {0:.5f}' .format (mag2 ))
278+
279+ # compare them
280+ mag3 = torch .mean (abs (out1 - out2 )).item ()
281+ print ('mean absolute value of the difference between the output tensors is {0:.5f} or {1:.2f} percent' .format (mag3 ,mag3 / mag1 * 100 ))
282+
283+
284+ ######################################################################
285+ # **Summary**
286+ #
287+ # We've explained what dynamic quantization is, what beenefits it brings,
288+ # and you have used the ``torch.quantization.quantize_dynamic()`` function
289+ # to quickly quantize a simple LSTM model.
290+ #
291+ # **To continue learning about dynamic quantization**
292+ #
293+ # This was a fast and high level treatment of this material; for more
294+ # detail please continue learning with:
295+ #
296+ # https://pytorch.org/tutorials/advanced/dynamic\_quantization\_tutorial.html
297+ #
298+ # **Other resources:**
299+ #
300+ # Docs
301+ #
302+ # https://pytorch.org/docs/stable/quantization.html
303+ #
304+ # Tutorials
305+ #
306+ # https://pytorch.org/tutorials/intermediate/dynamic\_quantization\_bert\_tutorial.html
307+ #
308+ # https://pytorch.org/tutorials/advanced/dynamic\_quantization\_tutorial.html
309+ #
310+ # Blogs
311+ #
312+ # https://pytorch.org/blog/introduction-to-quantization-on-pytorch/
313+ #
0 commit comments