-
Notifications
You must be signed in to change notification settings - Fork 6.7k
mxnet gets stuck on cudaMemGetInfo #6281
Description
Environment info
Operating System: CentOS with cuda V8.0.61
Compiler: g++ 5.3.1
MXNet commit hash (git rev-parse HEAD): 3d545d7
Steps to reproduce
- cd cpp-package/example
- ./get_mnist.sh
- make mlp_gpu && ./mlp_gpu
Part of gdb backtrace:
#0 0x00007fff5d718990 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#1 0x00007fff5d718ac6 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#2 0x00007fff5d778e8a in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#3 0x00007fff5d71fecb in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#4 0x00007fff5d99becf in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#5 0x00007fff5d99bf39 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#6 0x00007fff5d5eed6d in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#7 0x00007fff5d5f64f8 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#8 0x00007fff5dbf140d in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#9 0x00007fff5d5f9b94 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#10 0x00007fff5d5fb2e9 in ?? () from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#11 0x00007fff5d5f1abc in _cuda_CallJitEntryPoint ()
from /usr/lib64/nvidia/libnvidia-ptxjitcompiler.so.375.26
#12 0x00007fffc4bff582 in fatBinaryCtl_Compile ()
from /usr/lib64/nvidia/libnvidia-fatbinaryloader.so.375.26
#13 0x00007fffd3625e42 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#14 0x00007fffd36269c3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#15 0x00007fffd357f35e in ?? () from /usr/lib64/nvidia/libcuda.so.1
#16 0x00007fffd357f640 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#17 0x00007fffe30dfa5d in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#18 0x00007fffe30d3e60 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#19 0x00007fffe30decc6 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#20 0x00007fffe30e3401 in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#21 0x00007fffe30d672e in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#22 0x00007fffe30c3e8e in ?? () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#23 0x00007fffe30f417c in cudaMemGetInfo () from /usr/local/cuda-8.0/lib64/libcudart.so.8.0
#24 0x00007fffe652aea5 in mxnet::storage::GPUPooledStorageManager::Alloc (this=0xa5fe80,
raw_size=401408) at src/storage/./pooled_storage_manager.h:77
#25 0x00007fffe652b3f9 in mxnet::StorageImpl::Alloc (this=0x7fff6c0052d0, size=401408, ctx=...)
at src/storage/storage.cc:86
#26 0x00007fffe6010bfa in mxnet::NDArray::Chunk::CheckAndAlloc (this=0xa6c790)
at include/mxnet/./ndarray.h:391
#27 0x00007fffe6010bb5 in mxnet::NDArray::Chunk::Chunk (this=0xa6c790, size=100352, ctx=...,
delay_alloc=false, dtype=0) at include/mxnet/./ndarray.h:386
It only stuck on cuda 8.0.61. I tried another machine with cuda 8.0.44 and it worked well.