-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Description
We get a segfault when we destroy the current ncclComm:
#0 0x00007fdba8a61838 in transportDestroyProxy (comm=comm@entry=0x5604b2781850) from /opt/anaconda3/envs/dlipy2/lib/python2.7/site-packages/torch/lib/../../../../libnccl.so.2
#1 0x00007fdba8a5d82b in commDestroy (comm=0x5604b2781850) from /opt/anaconda3/envs/dlipy2/lib/python2.7/site-packages/torch/lib/../../../../libnccl.so.2
#2 ncclCommDestroy (comm=0x5604b2781850) at /tmp/tmpxft_0000756e_00000000-5_init.compute_70.cudafe1.cpp:1158
while (proxyState->pools != NULL) { struct ncclProxyPool *next = proxyState->pools->next; free(proxyState->pools); proxyState->pools = next; }
attempting to dereference:
(gdb) p * & comm->proxyState->pools->next
Cannot access memory at address 0x21f0
ENV
Lenovo x3650 M5 (Intel x64_64)
48core / 377GB
Ubuntu 18.04.2 LTS
uname -r: 4.15.0-45-generic
CUDA: 10.1
NCCL: 2.4.2-1
Note: the same code works fine with NCCL 2.3.5-5
(We built a version of the NCCL libraries with ENABLE_TRACE, however, we do not see any additional output beyond that enabled by "NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL)
Metadata
Metadata
Assignees
Labels
No labels