|
1 | 1 | Profiling PyTorch RPC-Based Workloads |
2 | 2 | ====================================== |
3 | 3 |
|
4 | | -In this recipe, you will learn: |
| 4 | +This tutorial has been deprecated. |
5 | 5 |
|
6 | | -- An overview of the `Distributed RPC Framework`_ |
7 | | -- An overview of the `PyTorch Profiler`_ |
8 | | -- How to use the profiler to profile RPC-based workloads |
| 6 | +Redirecting to homepage... |
9 | 7 |
|
10 | | -Requirements |
11 | | ------------- |
| 8 | +.. raw:: html |
12 | 9 |
|
13 | | -- PyTorch 1.6 |
14 | | - |
15 | | -The instructions for installing PyTorch are |
16 | | -available at `pytorch.org`_. |
17 | | - |
18 | | -What is the Distributed RPC Framework? |
19 | | ---------------------------------------- |
20 | | - |
21 | | -The **Distributed RPC Framework** provides mechanisms for multi-machine model |
22 | | -training through a set of primitives to allow for remote communication, and a |
23 | | -higher-level API to automatically differentiate models split across several machines. |
24 | | -For this recipe, it would be helpful to be familiar with the `Distributed RPC Framework`_ |
25 | | -as well as the `RPC Tutorials`_. |
26 | | - |
27 | | -What is the PyTorch Profiler? |
28 | | ---------------------------------------- |
29 | | -The profiler is a context manager based API that allows for on-demand profiling of |
30 | | -operators in a model's workload. The profiler can be used to analyze various aspects |
31 | | -of a model including execution time, operators invoked, and memory consumption. For a |
32 | | -detailed tutorial on using the profiler to profile a single-node model, please see the |
33 | | -`Profiler Recipe`_. |
34 | | - |
35 | | - |
36 | | - |
37 | | -How to use the Profiler for RPC-based workloads |
38 | | ------------------------------------------------ |
39 | | - |
40 | | -The profiler supports profiling of calls made of RPC and allows the user to have a |
41 | | -detailed view into the operations that take place on different nodes. To demonstrate an |
42 | | -example of this, let's first set up the RPC framework. The below code snippet will initialize |
43 | | -two RPC workers on the same host, named ``worker0`` and ``worker1`` respectively. The workers will |
44 | | -be spawned as subprocesses, and we set some environment variables required for proper |
45 | | -initialization. |
46 | | - |
47 | | -:: |
48 | | - |
49 | | - import torch |
50 | | - import torch.distributed.rpc as rpc |
51 | | - import torch.autograd.profiler as profiler |
52 | | - import torch.multiprocessing as mp |
53 | | - import os |
54 | | - import logging |
55 | | - import sys |
56 | | - |
57 | | - logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) |
58 | | - logger = logging.getLogger() |
59 | | - |
60 | | - def random_tensor(): |
61 | | - return torch.rand((3, 3), requires_grad=True) |
62 | | - |
63 | | - |
64 | | - def worker(rank, world_size): |
65 | | - os.environ["MASTER_ADDR"] = "localhost" |
66 | | - os.environ["MASTER_PORT"] = "29500" |
67 | | - worker_name = f"worker{rank}" |
68 | | - |
69 | | - # Initialize RPC framework. |
70 | | - rpc.init_rpc( |
71 | | - name=worker_name, |
72 | | - rank=rank, |
73 | | - world_size=world_size |
74 | | - ) |
75 | | - logger.debug(f"{worker_name} successfully initialized RPC.") |
76 | | - |
77 | | - pass # to be continued below |
78 | | - |
79 | | - logger.debug(f"Rank {rank} waiting for workers and shutting down RPC") |
80 | | - rpc.shutdown() |
81 | | - logger.debug(f"Rank {rank} shutdown RPC") |
82 | | - |
83 | | - |
84 | | - if __name__ == '__main__': |
85 | | - # Run 2 RPC workers. |
86 | | - world_size = 2 |
87 | | - mp.spawn(worker, args=(world_size,), nprocs=world_size) |
88 | | - |
89 | | -Running the above program should present you with the following output: |
90 | | - |
91 | | -:: |
92 | | - |
93 | | - DEBUG:root:worker1 successfully initialized RPC. |
94 | | - DEBUG:root:worker0 successfully initialized RPC. |
95 | | - DEBUG:root:Rank 0 waiting for workers and shutting down RPC |
96 | | - DEBUG:root:Rank 1 waiting for workers and shutting down RPC |
97 | | - DEBUG:root:Rank 1 shutdown RPC |
98 | | - DEBUG:root:Rank 0 shutdown RPC |
99 | | - |
100 | | -Now that we have a skeleton setup of our RPC framework, we can move on to |
101 | | -sending RPCs back and forth and using the profiler to obtain a view of what's |
102 | | -happening under the hood. Let's add to the above ``worker`` function: |
103 | | - |
104 | | -:: |
105 | | - |
106 | | - def worker(rank, world_size): |
107 | | - # Above code omitted... |
108 | | - if rank == 0: |
109 | | - dst_worker_rank = (rank + 1) % world_size |
110 | | - dst_worker_name = f"worker{dst_worker_rank}" |
111 | | - t1, t2 = random_tensor(), random_tensor() |
112 | | - # Send and wait RPC completion under profiling scope. |
113 | | - with profiler.profile() as prof: |
114 | | - fut1 = rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2)) |
115 | | - fut2 = rpc.rpc_async(dst_worker_name, torch.mul, args=(t1, t2)) |
116 | | - # RPCs must be awaited within profiling scope. |
117 | | - fut1.wait() |
118 | | - fut2.wait() |
119 | | - |
120 | | - print(prof.key_averages().table()) |
121 | | - |
122 | | -The aformented code creates 2 RPCs, specifying ``torch.add`` and ``torch.mul``, respectively, |
123 | | -to be run with two random input tensors on worker 1. Since we use the ``rpc_async`` API, |
124 | | -we are returned a ``torch.futures.Future`` object, which must be awaited for the result |
125 | | -of the computation. Note that this wait must take place within the scope created by |
126 | | -the profiling context manager in order for the RPC to be accurately profiled. Running |
127 | | -the code with this new worker function should result in the following output: |
128 | | - |
129 | | -:: |
130 | | - |
131 | | - # Some columns are omitted for brevity, exact output subject to randomness |
132 | | - ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- |
133 | | - Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID |
134 | | - ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- |
135 | | - rpc_async#aten::add(worker0 -> worker1) 0.00% 0.000us 0 20.462ms 20.462ms 1 0 |
136 | | - rpc_async#aten::mul(worker0 -> worker1) 0.00% 0.000us 0 5.712ms 5.712ms 1 0 |
137 | | - rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul 1.84% 206.864us 2.69% 302.162us 151.081us 2 1 |
138 | | - rpc_async#aten::add(worker0 -> worker1)#remote_op: add 1.41% 158.501us 1.57% 176.924us 176.924us 1 1 |
139 | | - rpc_async#aten::mul(worker0 -> worker1)#remote_op: output_nr 0.04% 4.980us 0.04% 4.980us 2.490us 2 1 |
140 | | - rpc_async#aten::mul(worker0 -> worker1)#remote_op: is_leaf 0.07% 7.806us 0.07% 7.806us 1.952us 4 1 |
141 | | - rpc_async#aten::add(worker0 -> worker1)#remote_op: empty 0.16% 18.423us 0.16% 18.423us 18.423us 1 1 |
142 | | - rpc_async#aten::mul(worker0 -> worker1)#remote_op: empty 0.14% 15.712us 0.14% 15.712us 15.712us 1 1 |
143 | | - ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- |
144 | | - Self CPU time total: 11.237ms |
145 | | - |
146 | | -Here we can see that the profiler has profiled our ``rpc_async`` calls made to ``worker1`` |
147 | | -from ``worker0``. In particular, the first 2 entries in the table show details (such as |
148 | | -the operator name, originating worker, and destination worker) about each RPC call made |
149 | | -and the ``CPU total`` column indicates the end-to-end latency of the RPC call. |
150 | | - |
151 | | -We also have visibility into the actual operators invoked remotely on worker 1 due RPC. |
152 | | -We can see operations that took place on ``worker1`` by checking the ``Node ID`` column. For |
153 | | -example, we can interpret the row with name ``rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul`` |
154 | | -as a ``mul`` operation taking place on the remote node, as a result of the RPC sent to ``worker1`` |
155 | | -from ``worker0``, specifying ``worker1`` to run the builtin ``mul`` operator on the input tensors. |
156 | | -Note that names of remote operations are prefixed with the name of the RPC event that resulted |
157 | | -in them. For example, remote operations corresponding to the ``rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))`` |
158 | | -call are prefixed with ``rpc_async#aten::mul(worker0 -> worker1)``. |
159 | | - |
160 | | -We can also use the profiler gain insight into user-defined functions that are executed over RPC. |
161 | | -For example, let's add the following to the above ``worker`` function: |
162 | | - |
163 | | -:: |
164 | | - |
165 | | - # Define somewhere outside of worker() func. |
166 | | - def udf_with_ops(): |
167 | | - import time |
168 | | - time.sleep(1) |
169 | | - t1, t2 = random_tensor(), random_tensor() |
170 | | - torch.add(t1, t2) |
171 | | - torch.mul(t1, t2) |
172 | | - |
173 | | - def worker(rank, world_size): |
174 | | - # Above code omitted |
175 | | - with profiler.profile() as p: |
176 | | - fut = rpc.rpc_async(dst_worker_name, udf_with_ops, args=()) |
177 | | - fut.wait() |
178 | | - print(p.key_averages().table()) |
179 | | - |
180 | | -The above code creates a user-defined function that sleeps for 1 second, and then executes various |
181 | | -operators. Similar to what we've done above, we send an RPC to the remote worker, specifying it to |
182 | | -run our user-defined function. Running this code should result in the following output: |
183 | | - |
184 | | -:: |
185 | | - |
186 | | - # Exact output subject to randomness |
187 | | - -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- |
188 | | - Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID |
189 | | - -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- |
190 | | - rpc_async#udf_with_ops(worker0 -> worker1) 0.00% 0.000us 0 1.008s 1.008s 1 0 |
191 | | - rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: rand 12.58% 80.037us 47.09% 299.589us 149.795us 2 1 |
192 | | - rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: empty 15.40% 98.013us 15.40% 98.013us 24.503us 4 1 |
193 | | - rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: uniform_ 22.85% 145.358us 23.87% 151.870us 75.935us 2 1 |
194 | | - rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_complex 1.02% 6.512us 1.02% 6.512us 3.256us 2 1 |
195 | | - rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: add 25.80% 164.179us 28.43% 180.867us 180.867us 1 1 |
196 | | - rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: mul 20.48% 130.293us 31.43% 199.949us 99.975us 2 1 |
197 | | - rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: output_nr 0.71% 4.506us 0.71% 4.506us 2.253us 2 1 |
198 | | - rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_leaf 1.16% 7.367us 1.16% 7.367us 1.842us 4 1 |
199 | | - -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- |
200 | | - |
201 | | -Here we can see that the user-defined function has successfully been profiled with its name |
202 | | -``(rpc_async#udf_with_ops(worker0 -> worker1))``, and has the CPU total time we would roughly expect |
203 | | -(slightly greater than 1s given the ``sleep``). Similar to the above profiling output, we can see the |
204 | | -remote operators that have been executed on worker 1 as part of executing this RPC request. |
205 | | - |
206 | | -Lastly, we can visualize remote execution using the tracing functionality provided by the profiler. |
207 | | -Let's add the following code to the above ``worker`` function: |
208 | | - |
209 | | -:: |
210 | | - |
211 | | - def worker(rank, world_size): |
212 | | - # Above code omitted |
213 | | - # Will generated trace for above profiling output |
214 | | - trace_file = "/tmp/trace.json" |
215 | | - prof.export_chrome_trace(trace_file) |
216 | | - logger.debug(f"Wrote trace to {trace_file}") |
217 | | - |
218 | | -Now, we can load the trace file in Chrome (``chrome://tracing``). We should see output similar to |
219 | | -the following: |
220 | | - |
221 | | -.. image:: ../_static/img/rpc_trace_img.png |
222 | | - :scale: 25 % |
223 | | - |
224 | | -As we can see, we have traced our RPC requests and can also visualize traces of the remote operations, |
225 | | -in this case, given in the trace column for ``node_id: 1``. |
226 | | - |
227 | | -Putting it all together, we have the following code for this recipe: |
228 | | - |
229 | | -:: |
230 | | - |
231 | | - import torch |
232 | | - import torch.distributed.rpc as rpc |
233 | | - import torch.autograd.profiler as profiler |
234 | | - import torch.multiprocessing as mp |
235 | | - import os |
236 | | - import logging |
237 | | - import sys |
238 | | - |
239 | | - logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) |
240 | | - logger = logging.getLogger() |
241 | | - |
242 | | - def random_tensor(): |
243 | | - return torch.rand((3, 3), requires_grad=True) |
244 | | - |
245 | | - def udf_with_ops(): |
246 | | - import time |
247 | | - time.sleep(1) |
248 | | - t1, t2 = random_tensor(), random_tensor() |
249 | | - torch.add(t1, t2) |
250 | | - torch.mul(t1, t2) |
251 | | - |
252 | | - def worker(rank, world_size): |
253 | | - os.environ["MASTER_ADDR"] = "localhost" |
254 | | - os.environ["MASTER_PORT"] = "29500" |
255 | | - worker_name = f"worker{rank}" |
256 | | - |
257 | | - # Initialize RPC framework. |
258 | | - rpc.init_rpc( |
259 | | - name=worker_name, |
260 | | - rank=rank, |
261 | | - world_size=world_size |
262 | | - ) |
263 | | - logger.debug(f"{worker_name} successfully initialized RPC.") |
264 | | - |
265 | | - if rank == 0: |
266 | | - dst_worker_rank = (rank + 1) % world_size |
267 | | - dst_worker_name = f"worker{dst_worker_rank}" |
268 | | - t1, t2 = random_tensor(), random_tensor() |
269 | | - # Send and wait RPC completion under profiling scope. |
270 | | - with profiler.profile() as prof: |
271 | | - fut1 = rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2)) |
272 | | - fut2 = rpc.rpc_async(dst_worker_name, torch.mul, args=(t1, t2)) |
273 | | - # RPCs must be awaited within profiling scope. |
274 | | - fut1.wait() |
275 | | - fut2.wait() |
276 | | - print(prof.key_averages().table()) |
277 | | - |
278 | | - with profiler.profile() as p: |
279 | | - fut = rpc.rpc_async(dst_worker_name, udf_with_ops, args=()) |
280 | | - fut.wait() |
281 | | - |
282 | | - print(p.key_averages().table()) |
283 | | - |
284 | | - trace_file = "/tmp/trace.json" |
285 | | - prof.export_chrome_trace(trace_file) |
286 | | - logger.debug(f"Wrote trace to {trace_file}") |
287 | | - |
288 | | - |
289 | | - logger.debug(f"Rank {rank} waiting for workers and shutting down RPC") |
290 | | - rpc.shutdown() |
291 | | - logger.debug(f"Rank {rank} shutdown RPC") |
292 | | - |
293 | | - |
294 | | - |
295 | | - if __name__ == '__main__': |
296 | | - # Run 2 RPC workers. |
297 | | - world_size = 2 |
298 | | - mp.spawn(worker, args=(world_size,), nprocs=world_size) |
299 | | - |
300 | | - |
301 | | -Learn More |
302 | | -------------------- |
303 | | - |
304 | | -- `pytorch.org`_ for installation instructions, and more documentation |
305 | | - and tutorials. |
306 | | -- `Distributed RPC Framework`_ for RPC framework and API reference. |
307 | | -- `Full profiler documentation`_ for profiler documentation. |
308 | | - |
309 | | -.. _pytorch.org: https://pytorch.org/ |
310 | | -.. _Full profiler documentation: https://pytorch.org/docs/stable/autograd.html#profiler |
311 | | -.. _Pytorch Profiler: https://pytorch.org/docs/stable/autograd.html#profiler |
312 | | -.. _Distributed RPC Framework: https://pytorch.org/docs/stable/rpc.html |
313 | | -.. _RPC Tutorials: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html |
314 | | -.. _Profiler Recipe: https://pytorch.org/tutorials/recipes/recipes/profiler.html |
| 10 | + <meta http-equiv="Refresh" content="2; url='https://pytorch.org/tutorials'" /> |
0 commit comments