Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 203099a

Browse files
rohan-varmabrianjo
andauthored
Adding more detail to profiling recipe (#1320)
* Adding more detail to profiling recipe * Fix * Fix TP link * convert all indents to spaces * Address rest of the comments * Spacing Co-authored-by: Brian Johnson <[email protected]>
1 parent 7045082 commit 203099a

3 files changed

Lines changed: 138 additions & 19 deletions

File tree

_static/img/8_workers.png

319 KB
Loading

_static/img/oneworker.png

120 KB
Loading

recipes_source/distributed_rpc_profiling.rst

Lines changed: 138 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,15 @@ Profiling PyTorch RPC-Based Workloads
33

44
In this recipe, you will learn:
55

6-
- An overview of the `Distributed RPC Framework`_
7-
- An overview of the `PyTorch Profiler`_
8-
- How to use the profiler to profile RPC-based workloads
6+
- An overview of the `Distributed RPC Framework`_.
7+
- An overview of the `PyTorch Profiler`_.
8+
- How to use the profiler to profile RPC-based workloads.
9+
- A short example showcasing how to use the profiler to tune RPC parameters.
910

1011
Requirements
1112
------------
1213

13-
- PyTorch 1.6
14+
- PyTorch 1.6+
1415

1516
The instructions for installing PyTorch are
1617
available at `pytorch.org`_.
@@ -119,7 +120,7 @@ happening under the hood. Let's add to the above ``worker`` function:
119120

120121
print(prof.key_averages().table())
121122

122-
The aformentioned code creates 2 RPCs, specifying ``torch.add`` and ``torch.mul``, respectively,
123+
The aforementioned code creates 2 RPCs, specifying ``torch.add`` and ``torch.mul``, respectively,
123124
to be run with two random input tensors on worker 1. Since we use the ``rpc_async`` API,
124125
we are returned a ``torch.futures.Future`` object, which must be awaited for the result
125126
of the computation. Note that this wait must take place within the scope created by
@@ -148,7 +149,7 @@ from ``worker0``. In particular, the first 2 entries in the table show details (
148149
the operator name, originating worker, and destination worker) about each RPC call made
149150
and the ``CPU total`` column indicates the end-to-end latency of the RPC call.
150151

151-
We also have visibility into the actual operators invoked remotely on worker 1 due RPC.
152+
We also have visibility into the actual operators invoked remotely on worker 1 due to RPC.
152153
We can see operations that took place on ``worker1`` by checking the ``Node ID`` column. For
153154
example, we can interpret the row with name ``rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul``
154155
as a ``mul`` operation taking place on the remote node, as a result of the RPC sent to ``worker1``
@@ -203,7 +204,7 @@ Here we can see that the user-defined function has successfully been profiled wi
203204
(slightly greater than 1s given the ``sleep``). Similar to the above profiling output, we can see the
204205
remote operators that have been executed on worker 1 as part of executing this RPC request.
205206

206-
Lastly, we can visualize remote execution using the tracing functionality provided by the profiler.
207+
In addition, we can visualize remote execution using the tracing functionality provided by the profiler.
207208
Let's add the following code to the above ``worker`` function:
208209

209210
::
@@ -224,6 +225,108 @@ the following:
224225
As we can see, we have traced our RPC requests and can also visualize traces of the remote operations,
225226
in this case, given in the trace row for ``node_id: 1``.
226227

228+
229+
Example: Using profiler to tune RPC initialization parameters
230+
--------------------------------------------------------------
231+
232+
The following exercise is intended to be a simple example into how one can use statistics and traces
233+
from the profiler to guide tuning RPC initialization parameters. In particular, we will focus on tuning
234+
the ``num_worker_threads`` parameter used during RPC initialization. First, we modify our ``rpc.init_rpc``
235+
call to the following:
236+
237+
::
238+
239+
# Initialize RPC framework.
240+
num_worker_threads = 1
241+
rpc.init_rpc(
242+
name=worker_name,
243+
rank=rank,
244+
world_size=world_size,
245+
rpc_backend_options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=num_worker_threads)
246+
)
247+
248+
This will initialize the [TensorPipe RPC backend](https://pytorch.org/docs/stable/rpc.html#tensorpipe-backend) with only one thread for processing RPC requests. Next, add
249+
the following function somewhere outside of the ``worker`` main function:
250+
251+
::
252+
253+
def num_workers_udf_with_ops():
254+
t = torch.randn((100, 100))
255+
for i in range(10):
256+
t.mul(t)
257+
t.add(t)
258+
t = t.relu()
259+
t = t.sigmoid()
260+
return t
261+
262+
This function is mainly intended to be a dummy CPU-intensive function for demonstration purposes. Next, we add the
263+
following RPC and profiling code to our main ``worker`` function:
264+
265+
::
266+
267+
with profiler.profile() as p:
268+
futs = []
269+
for i in range(4):
270+
fut = rpc.rpc_async(dst_worker_name, num_workers_udf_with_ops)
271+
futs.append(fut)
272+
for f in futs:
273+
f.wait()
274+
275+
print(p.key_averages().table())
276+
277+
trace_file = "/tmp/trace.json"
278+
# Export the trace.
279+
p.export_chrome_trace(trace_file)
280+
logger.debug(f"Wrote trace to {trace_file}")
281+
282+
Running the code should return the following profiling statistics (exact output subject to randomness):
283+
284+
::
285+
286+
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
287+
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Node ID
288+
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
289+
aten::zeros 0.33% 143.557us 0.47% 203.125us 50.781us 4 0
290+
aten::empty 0.24% 101.487us 0.24% 101.487us 12.686us 8 0
291+
aten::zero_ 0.04% 17.758us 0.04% 17.758us 4.439us 4 0
292+
rpc_async#num_workers_udf_with_ops(worker0 -> worker... 0.00% 0.000us 0 189.757ms 47.439ms 4 0
293+
# additional columns omitted for brevity
294+
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
295+
296+
We can see that there were 4 RPC calls as expected taking a total of 190ms. Let's now tune the ``num_worker_threads``
297+
parameter we set earlier, by changing it to ``num_worker_threads = 8``. Running the code with that change should return
298+
the following profiling statistics (exact output subject to randomness):
299+
300+
::
301+
302+
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
303+
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Node ID
304+
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------
305+
aten::zeros 0.31% 127.320us 0.53% 217.203us 54.301us 4 0
306+
aten::empty 0.27% 113.529us 0.27% 113.529us 14.191us 8 0
307+
aten::zero_ 0.04% 18.032us 0.04% 18.032us 4.508us 4 0
308+
rpc_async#num_workers_udf_with_ops(worker0 -> worker... 0.00% 0.000us 0 94.776ms 23.694ms 4 0
309+
310+
311+
We see a clear ~2x speedup, and hypothesize that this speedup is due to exploiting parallelism on the server due
312+
to the additional cores available. However, how can we ensure that this speedup is due to the increase in cores?
313+
Taking a look at the trace visualization helps with this. Below is the trace when we set ``num_worker_threads=1``:
314+
315+
.. image:: ../_static/img/oneworker.png
316+
:scale: 25 %
317+
318+
Focusing on the trace for ``node 1``, we can see that the RPCs are ran serially on the server.
319+
320+
Next, the following is the trace where we set ``num_worker_threads=8``:
321+
322+
.. image:: ../_static/img/8_workers.png
323+
:scale: 25 %
324+
325+
Based on the latter trace, we can see ``node 1`` was able to execute the RPCs in parallel on the server, due to having additional
326+
worker threads. To summarize, we were able to leverage both the profiler's output report and trace to pick an appropriate
327+
``num_worker_threads`` parameter for RPC initialization in this simple exercise.
328+
329+
227330
Putting it all together, we have the following code for this recipe:
228331

229332
::
@@ -249,16 +352,27 @@ Putting it all together, we have the following code for this recipe:
249352
torch.add(t1, t2)
250353
torch.mul(t1, t2)
251354

355+
def num_workers_udf_with_ops():
356+
t = torch.randn((100, 100))
357+
for i in range(10):
358+
t.mul(t)
359+
t.add(t)
360+
t = t.relu()
361+
t = t.sigmoid()
362+
return t
363+
252364
def worker(rank, world_size):
253365
os.environ["MASTER_ADDR"] = "localhost"
254366
os.environ["MASTER_PORT"] = "29500"
255367
worker_name = f"worker{rank}"
256368

257369
# Initialize RPC framework.
370+
num_worker_threads =8
258371
rpc.init_rpc(
259-
name=worker_name,
260-
rank=rank,
261-
world_size=world_size
372+
name=worker_name,
373+
rank=rank,
374+
world_size=world_size,
375+
rpc_backend_options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=num_worker_threads),
262376
)
263377
logger.debug(f"{worker_name} successfully initialized RPC.")
264378

@@ -267,22 +381,27 @@ Putting it all together, we have the following code for this recipe:
267381
dst_worker_name = f"worker{dst_worker_rank}"
268382
t1, t2 = random_tensor(), random_tensor()
269383
# Send and wait RPC completion under profiling scope.
270-
with profiler.profile() as prof:
271-
fut1 = rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))
272-
fut2 = rpc.rpc_async(dst_worker_name, torch.mul, args=(t1, t2))
273-
# RPCs must be awaited within profiling scope.
274-
fut1.wait()
275-
fut2.wait()
384+
with profiler.profile() as prof:
385+
fut1 = rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))
386+
fut2 = rpc.rpc_async(dst_worker_name, torch.mul, args=(t1, t2))
387+
# RPCs must be awaited within profiling scope.
388+
fut1.wait()
389+
fut2.wait()
276390
print(prof.key_averages().table())
277391

278392
with profiler.profile() as p:
279-
fut = rpc.rpc_async(dst_worker_name, udf_with_ops)
280-
fut.wait()
393+
futs = []
394+
for i in range(4):
395+
fut = rpc.rpc_async(dst_worker_name, num_workers_udf_with_ops)
396+
futs.append(fut)
397+
for f in futs:
398+
f.wait()
281399

282400
print(p.key_averages().table())
283401

284402
trace_file = "/tmp/trace.json"
285-
prof.export_chrome_trace(trace_file)
403+
# Export the trace.
404+
p.export_chrome_trace(trace_file)
286405
logger.debug(f"Wrote trace to {trace_file}")
287406

288407

0 commit comments

Comments
 (0)