-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Description
Hi NCCL team,
I'd like to report a bug: during NCCL initialization, running the RAS command can cause NCCL to crash.
Reproduction Steps:
Run the NCCL all-to-all test on two hosts.
Execute the RAS command during NCCL initialization.
Then, you can observe that NCCL crashes.
Output:
rtptest1621:405884:407462 [0] NCCL INFO NCCL_IB_FIFO_TC set by environment to 224.
rtptest1625:1886888:1890510 [7] NCCL INFO NET/IB: IbDev 4 Port 1 qpn 103534 set_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0}
rtptest1625:1886888:1890510 [7] NCCL INFO NET/IB: IbDev 4 Port 1 qpn 103554 set_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0}
rtptest1625:1886888:1890510 [7] NCCL INFO NET/IB: IbDev 4 Port 1 qpn 103555 set_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0}
rtptest1625:1886888:1890510 [7] NCCL INFO NET/IB: IbDev 4 Port 1 qpn 103556 set_ece={supported=1, vendor_id=0x15b3, options=0x0, comp_mask=0x0}
rtptest1620:935572:937467 [0] NCCL INFO RAS client listening socket at ::1<28028>
rtptest1625:1886882:1890516 [6] NCCL INFO RAS client listening socket at ::1<28028>
rtptest1625:1886888:1890510 [7] NCCL INFO RAS client listening socket at ::1<28028>
rtptest1621:405884:407462 [0] NCCL INFO RAS client listening socket at ::1<28028>
[rtptest1621:405884] *** Process received signal ***
[rtptest1621:405884] Signal: Segmentation fault (11)
[rtptest1621:405884] Signal code: Address not mapped (1)
[rtptest1621:405884] Failing at address: 0x10
[rtptest1621:405884] [ 0] /usr/local/XXX/platform010/lib/libc.so.6(+0x44560)[0x7f0bfb044560]
[rtptest1621:405884] [ 1] nccl_alltoall_perf_2_26[0x30f6a7]
[rtptest1621:405884] [ 2] nccl_alltoall_perf_2_26[0x30a079]
[rtptest1621:405884] [ 3] nccl_alltoall_perf_2_26[0x30ec44]
[rtptest1621:405884] [ 4] nccl_alltoall_perf_2_26[0x3062a4]
[rtptest1621:405884] [ 5] /usr/local/XXX/platform010/lib/libc.so.6(+0x9abc9)[0x7f0bfb09abc9]
[rtptest1621:405884] [ 6] /usr/local/XXX/platform010/lib/libc.so.6(+0x12ce4c)[0x7f0bfb12ce4c]
[rtptest1621:405884] *** End of error message ***
rtptest1621:405885:407432 [1] NCCL INFO RAS client listening socket at ::1<28028>
rtptest1621:405888:407469 [4] NCCL INFO RAS client listening socket at ::1<28028>
rtptest1621:405886:407470 [2] NCCL INFO RAS client listening socket at ::1<28028>
rtptest1621:405898:407433 [7] NCCL INFO RAS client listening socket at ::1<28028>
rtptest1621:405887:407352 [3] NCCL INFO RAS client listening socket at ::1<28028>
rtptest1621:405892:407460 [5] NCCL INFO RAS client listening socket at ::1<28028>
rtptest1621:405894:407471 [6] NCCL INFO RAS client listening socket at ::1<28028>
thanks.