-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Closed
Closed
Copy link
Labels
high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizerRelated to RPC, distributed autograd, RRef, and distributed optimizermodule: tensorpipeRelated to Tensorpipe RPC AgentRelated to Tensorpipe RPC Agentoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
π Bug
test_ddp_under_dist_autograd is failling on the release/1.6 branch with no additional changes.
To Reproduce
Steps to reproduce the behavior:
Ran
python test/distributed/rpc/tensorpipe/test_ddp_under_dist_autograd.py
Expected behavior
Passing test.
ERROR: test_ddp_dist_autograd_local_vs_remote_gpu (__main__.TestDdpComparisonTensorPipe)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
self._join_processes(fn)
File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
self._check_return_codes(elapsed_time)
File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 339, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Processes 4 5 exited with error code 10
Environment
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
OS: Amazon Linux 2
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)
CMake version: version 3.13.3
Python version: 3.7
Is CUDA available: N/A
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 440.33.01
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
Versions of relevant libraries:
[pip3] numpy==1.18.5
[conda] blas 1.0 mkl
[conda] mkl 2020.0 166
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.15 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] numpy 1.18.1 py37h4f9e942_0
[conda] numpy-base 1.18.1 py37hde5b4d6_1
[conda] numpydoc 0.9.2 py_0
[conda] torch 1.6.0a0+cefb9e0 pypi_0 pypi
cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @jjlilley @lw @beauby
Metadata
Metadata
Assignees
Labels
high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizerRelated to RPC, distributed autograd, RRef, and distributed optimizermodule: tensorpipeRelated to Tensorpipe RPC AgentRelated to Tensorpipe RPC Agentoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module