-
Couldn't load subscription status.
- Fork 261
Description
I found that if I use 4 GPUs out of 8 GPUs , this will cause training failed.
#caffe train --solver=models/bvlc_googlenet/solver_fp16_4.prototxt -gpu=4,5,6,7
Error message:
F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)
The alternative workaround is add "CUDA_VISIBLE_DEVICES=4,5,6,7" before "caffe train ..."
Note: I have checked there's no out of memory , because if I choose "-gpu=0,1,2,3" , it works fine.
I hope someone could check this issue. Thanks in advance.
Info:
NVIDIA Docker: Caffe:19.06
NVCaffe: 0.17.3
CuDNN: 7.6.0
NCCL : 2.4.7
Model : bvlc_googlenet
Batch size : 256
More logs:
I0717 00:12:36.297857 545 data_layer.cpp:107] [n0.d4.r0] Transformer threads: 4 (auto)
I0717 00:12:36.389331 609 internal_thread.cpp:78] Started internal thread 609 on device 4, rank 0
I0717 00:12:36.389572 609 db_lmdb.cpp:36] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb
I0717 00:12:36.399473 600 internal_thread.cpp:78] Started internal thread 600 on device 4, rank 0
I0717 00:12:36.405875 599 internal_thread.cpp:78] Started internal thread 599 on device 4, rank 0
I0717 00:12:36.408145 598 internal_thread.cpp:78] Started internal thread 598 on device 4, rank 0
I0717 00:12:36.409735 601 internal_thread.cpp:78] Started internal thread 601 on device 4, rank 0
F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)
*** Check failure stack trace: ***
I0717 00:12:37.488199 597 blocking_queue.cpp:40] Waiting for datum
F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encounteredF0717 00:12:37.506527 593 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8d0359052 caffe::Blob::CopyFrom()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa8d07dbbcb caffe::BatchTransformer<>::InternalThreadEntry()
@ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost()
@ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu()
@ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu()
@ 0x7fa8d02bdbb2 caffe::InternalThread::entry()
@ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data()
@ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data()
@ 0x7fa8d02bfc2f boost::detail::thread_data<>::run()
@ 0x7fa8cdcaf5d5 (unknown)
@ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch()
@ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch()
@ 0x7fa8cd5686ba start_thread
@ 0x7fa8cdfcb41d clone
@ (nil) (unknown)