Thanks to visit codestin.com
Credit goes to github.com

Skip to content

segfault at transportDestroyProxy line 241 #191

@ashuatibm

Description

@ashuatibm

We get a segfault when we destroy the current ncclComm:

#0 0x00007fdba8a61838 in transportDestroyProxy (comm=comm@entry=0x5604b2781850) from /opt/anaconda3/envs/dlipy2/lib/python2.7/site-packages/torch/lib/../../../../libnccl.so.2
#1 0x00007fdba8a5d82b in commDestroy (comm=0x5604b2781850) from /opt/anaconda3/envs/dlipy2/lib/python2.7/site-packages/torch/lib/../../../../libnccl.so.2
#2 ncclCommDestroy (comm=0x5604b2781850) at /tmp/tmpxft_0000756e_00000000-5_init.compute_70.cudafe1.cpp:1158

in transportDestroyProxy:

  while (proxyState->pools != NULL) {
    struct ncclProxyPool *next = proxyState->pools->next;
    free(proxyState->pools);
    proxyState->pools = next;
}

attempting to dereference:

(gdb) p * & comm->proxyState->pools->next
Cannot access memory at address 0x21f0

ENV
    Lenovo x3650 M5 (Intel x64_64)
        48core / 377GB
    Ubuntu 18.04.2 LTS
        uname -r: 4.15.0-45-generic
    CUDA: 10.1
    NCCL: 2.4.2-1
        Note: the same code works fine with NCCL 2.3.5-5

(We built a version of the NCCL libraries with ENABLE_TRACE, however, we do not see any additional output beyond that enabled by "NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions