@@ -3,14 +3,15 @@ Profiling PyTorch RPC-Based Workloads
33
44In this recipe, you will learn:
55
6- - An overview of the `Distributed RPC Framework `_
7- - An overview of the `PyTorch Profiler `_
8- - How to use the profiler to profile RPC-based workloads
6+ - An overview of the `Distributed RPC Framework `_.
7+ - An overview of the `PyTorch Profiler `_.
8+ - How to use the profiler to profile RPC-based workloads.
9+ - A short example showcasing how to use the profiler to tune RPC parameters.
910
1011Requirements
1112------------
1213
13- - PyTorch 1.6
14+ - PyTorch 1.6+
1415
1516The instructions for installing PyTorch are
1617available at `pytorch.org `_.
@@ -119,7 +120,7 @@ happening under the hood. Let's add to the above ``worker`` function:
119120
120121 print(prof.key_averages().table())
121122
122- The aformentioned code creates 2 RPCs, specifying ``torch.add `` and ``torch.mul ``, respectively,
123+ The aforementioned code creates 2 RPCs, specifying ``torch.add `` and ``torch.mul ``, respectively,
123124to be run with two random input tensors on worker 1. Since we use the ``rpc_async `` API,
124125we are returned a ``torch.futures.Future `` object, which must be awaited for the result
125126of the computation. Note that this wait must take place within the scope created by
@@ -148,7 +149,7 @@ from ``worker0``. In particular, the first 2 entries in the table show details (
148149the operator name, originating worker, and destination worker) about each RPC call made
149150and the ``CPU total `` column indicates the end-to-end latency of the RPC call.
150151
151- We also have visibility into the actual operators invoked remotely on worker 1 due RPC.
152+ We also have visibility into the actual operators invoked remotely on worker 1 due to RPC.
152153We can see operations that took place on ``worker1 `` by checking the ``Node ID `` column. For
153154example, we can interpret the row with name ``rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul ``
154155as a ``mul `` operation taking place on the remote node, as a result of the RPC sent to ``worker1 ``
@@ -203,7 +204,7 @@ Here we can see that the user-defined function has successfully been profiled wi
203204(slightly greater than 1s given the ``sleep ``). Similar to the above profiling output, we can see the
204205remote operators that have been executed on worker 1 as part of executing this RPC request.
205206
206- Lastly , we can visualize remote execution using the tracing functionality provided by the profiler.
207+ In addition , we can visualize remote execution using the tracing functionality provided by the profiler.
207208Let's add the following code to the above ``worker `` function:
208209
209210::
@@ -224,6 +225,108 @@ the following:
224225As we can see, we have traced our RPC requests and can also visualize traces of the remote operations,
225226in this case, given in the trace row for ``node_id: 1 ``.
226227
228+
229+ Example: Using profiler to tune RPC initialization parameters
230+ --------------------------------------------------------------
231+
232+ The following exercise is intended to be a simple example into how one can use statistics and traces
233+ from the profiler to guide tuning RPC initialization parameters. In particular, we will focus on tuning
234+ the ``num_worker_threads `` parameter used during RPC initialization. First, we modify our ``rpc.init_rpc ``
235+ call to the following:
236+
237+ ::
238+
239+ # Initialize RPC framework.
240+ num_worker_threads = 1
241+ rpc.init_rpc(
242+ name=worker_name,
243+ rank=rank,
244+ world_size=world_size,
245+ rpc_backend_options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=num_worker_threads)
246+ )
247+
248+ This will initialize the [TensorPipe RPC backend](https://pytorch.org/docs/stable/rpc.html#tensorpipe-backend) with only one thread for processing RPC requests. Next, add
249+ the following function somewhere outside of the ``worker `` main function:
250+
251+ ::
252+
253+ def num_workers_udf_with_ops():
254+ t = torch.randn((100, 100))
255+ for i in range(10):
256+ t.mul(t)
257+ t.add(t)
258+ t = t.relu()
259+ t = t.sigmoid()
260+ return t
261+
262+ This function is mainly intended to be a dummy CPU-intensive function for demonstration purposes. Next, we add the
263+ following RPC and profiling code to our main ``worker `` function:
264+
265+ ::
266+
267+ with profiler.profile() as p:
268+ futs = []
269+ for i in range(4):
270+ fut = rpc.rpc_async(dst_worker_name, num_workers_udf_with_ops)
271+ futs.append(fut)
272+ for f in futs:
273+ f.wait()
274+
275+ print(p.key_averages().table())
276+
277+ trace_file = "/tmp/trace.json"
278+ # Export the trace.
279+ p.export_chrome_trace(trace_file)
280+ logger.debug(f"Wrote trace to {trace_file}")
281+
282+ Running the code should return the following profiling statistics (exact output subject to randomness):
283+
284+ ::
285+
286+ ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
287+ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Node ID
288+ ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
289+ aten::zeros 0.33% 143.557us 0.47% 203.125us 50.781us 4 0
290+ aten::empty 0.24% 101.487us 0.24% 101.487us 12.686us 8 0
291+ aten::zero_ 0.04% 17.758us 0.04% 17.758us 4.439us 4 0
292+ rpc_async#num_workers_udf_with_ops(worker0 -> worker... 0.00% 0.000us 0 189.757ms 47.439ms 4 0
293+ # additional columns omitted for brevity
294+ ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
295+
296+ We can see that there were 4 RPC calls as expected taking a total of 190ms. Let's now tune the ``num_worker_threads ``
297+ parameter we set earlier, by changing it to ``num_worker_threads = 8 ``. Running the code with that change should return
298+ the following profiling statistics (exact output subject to randomness):
299+
300+ ::
301+
302+ ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
303+ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Node ID
304+ ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
305+ aten::zeros 0.31% 127.320us 0.53% 217.203us 54.301us 4 0
306+ aten::empty 0.27% 113.529us 0.27% 113.529us 14.191us 8 0
307+ aten::zero_ 0.04% 18.032us 0.04% 18.032us 4.508us 4 0
308+ rpc_async#num_workers_udf_with_ops(worker0 -> worker... 0.00% 0.000us 0 94.776ms 23.694ms 4 0
309+
310+
311+ We see a clear ~2x speedup, and hypothesize that this speedup is due to exploiting parallelism on the server due
312+ to the additional cores available. However, how can we ensure that this speedup is due to the increase in cores?
313+ Taking a look at the trace visualization helps with this. Below is the trace when we set ``num_worker_threads=1 ``:
314+
315+ .. image :: ../_static/img/oneworker.png
316+ :scale: 25 %
317+
318+ Focusing on the trace for ``node 1 ``, we can see that the RPCs are ran serially on the server.
319+
320+ Next, the following is the trace where we set ``num_worker_threads=8 ``:
321+
322+ .. image :: ../_static/img/8_workers.png
323+ :scale: 25 %
324+
325+ Based on the latter trace, we can see ``node 1 `` was able to execute the RPCs in parallel on the server, due to having additional
326+ worker threads. To summarize, we were able to leverage both the profiler's output report and trace to pick an appropriate
327+ ``num_worker_threads `` parameter for RPC initialization in this simple exercise.
328+
329+
227330Putting it all together, we have the following code for this recipe:
228331
229332::
@@ -249,16 +352,27 @@ Putting it all together, we have the following code for this recipe:
249352 torch.add(t1, t2)
250353 torch.mul(t1, t2)
251354
355+ def num_workers_udf_with_ops():
356+ t = torch.randn((100, 100))
357+ for i in range(10):
358+ t.mul(t)
359+ t.add(t)
360+ t = t.relu()
361+ t = t.sigmoid()
362+ return t
363+
252364 def worker(rank, world_size):
253365 os.environ["MASTER_ADDR"] = "localhost"
254366 os.environ["MASTER_PORT"] = "29500"
255367 worker_name = f"worker{rank}"
256368
257369 # Initialize RPC framework.
370+ num_worker_threads =8
258371 rpc.init_rpc(
259- name=worker_name,
260- rank=rank,
261- world_size=world_size
372+ name=worker_name,
373+ rank=rank,
374+ world_size=world_size,
375+ rpc_backend_options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=num_worker_threads),
262376 )
263377 logger.debug(f"{worker_name} successfully initialized RPC.")
264378
@@ -267,22 +381,27 @@ Putting it all together, we have the following code for this recipe:
267381 dst_worker_name = f"worker{dst_worker_rank}"
268382 t1, t2 = random_tensor(), random_tensor()
269383 # Send and wait RPC completion under profiling scope.
270- with profiler.profile() as prof:
271- fut1 = rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))
272- fut2 = rpc.rpc_async(dst_worker_name, torch.mul, args=(t1, t2))
273- # RPCs must be awaited within profiling scope.
274- fut1.wait()
275- fut2.wait()
384+ with profiler.profile() as prof:
385+ fut1 = rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))
386+ fut2 = rpc.rpc_async(dst_worker_name, torch.mul, args=(t1, t2))
387+ # RPCs must be awaited within profiling scope.
388+ fut1.wait()
389+ fut2.wait()
276390 print(prof.key_averages().table())
277391
278392 with profiler.profile() as p:
279- fut = rpc.rpc_async(dst_worker_name, udf_with_ops)
280- fut.wait()
393+ futs = []
394+ for i in range(4):
395+ fut = rpc.rpc_async(dst_worker_name, num_workers_udf_with_ops)
396+ futs.append(fut)
397+ for f in futs:
398+ f.wait()
281399
282400 print(p.key_averages().table())
283401
284402 trace_file = "/tmp/trace.json"
285- prof.export_chrome_trace(trace_file)
403+ # Export the trace.
404+ p.export_chrome_trace(trace_file)
286405 logger.debug(f"Wrote trace to {trace_file}")
287406
288407
0 commit comments