Thanks to visit codestin.com
Credit goes to github.com

Skip to content

distributed/rpc/tensorpipe/test_ddp_under_dist_autograd fails on release/1.6 branchΒ #41365

@choidongyeon

Description

@choidongyeon

πŸ› Bug

test_ddp_under_dist_autograd is failling on the release/1.6 branch with no additional changes.

To Reproduce

Steps to reproduce the behavior:

Ran

python test/distributed/rpc/tensorpipe/test_ddp_under_dist_autograd.py 

Expected behavior

Passing test.

ERROR: test_ddp_dist_autograd_local_vs_remote_gpu (__main__.TestDdpComparisonTensorPipe)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
    self._join_processes(fn)
  File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/home/ec2-user/anaconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 339, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Processes 4 5 exited with error code 10

Environment

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A

OS: Amazon Linux 2
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6)
CMake version: version 3.13.3

Python version: 3.7
Is CUDA available: N/A
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 440.33.01
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5

Versions of relevant libraries:
[pip3] numpy==1.18.5
[conda] blas                      1.0                         mkl  
[conda] mkl                       2020.0                      166  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.0.15           py37ha843d7b_0  
[conda] mkl_random                1.1.0            py37hd6b4f25_0  
[conda] numpy                     1.18.1           py37h4f9e942_0  
[conda] numpy-base                1.18.1           py37hde5b4d6_1  
[conda] numpydoc                  0.9.2                      py_0  
[conda] torch                     1.6.0a0+cefb9e0          pypi_0    pypi

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @jjlilley @lw @beauby

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizermodule: tensorpipeRelated to Tensorpipe RPC Agentoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions