Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered - Training by 4 of 8 GPUs will fail #574

@beanliao

Description

@beanliao

I found that if I use 4 GPUs out of 8 GPUs , this will cause training failed.
#caffe train --solver=models/bvlc_googlenet/solver_fp16_4.prototxt -gpu=4,5,6,7
Error message:
F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)

The alternative workaround is add "CUDA_VISIBLE_DEVICES=4,5,6,7" before "caffe train ..."
Note: I have checked there's no out of memory , because if I choose "-gpu=0,1,2,3" , it works fine.

I hope someone could check this issue. Thanks in advance.

Info:
NVIDIA Docker: Caffe:19.06
NVCaffe: 0.17.3
CuDNN: 7.6.0
NCCL : 2.4.7
Model : bvlc_googlenet
Batch size : 256

More logs:
I0717 00:12:36.297857 545 data_layer.cpp:107] [n0.d4.r0] Transformer threads: 4 (auto)
I0717 00:12:36.389331 609 internal_thread.cpp:78] Started internal thread 609 on device 4, rank 0
I0717 00:12:36.389572 609 db_lmdb.cpp:36] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb
I0717 00:12:36.399473 600 internal_thread.cpp:78] Started internal thread 600 on device 4, rank 0
I0717 00:12:36.405875 599 internal_thread.cpp:78] Started internal thread 599 on device 4, rank 0
I0717 00:12:36.408145 598 internal_thread.cpp:78] Started internal thread 598 on device 4, rank 0
I0717 00:12:36.409735 601 internal_thread.cpp:78] Started internal thread 601 on device 4, rank 0
F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)
*** Check failure stack trace: ***
I0717 00:12:37.488199 597 blocking_queue.cpp:40] Waiting for datum
F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encounteredF0717 00:12:37.506527 593 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8d0359052 caffe::Blob::CopyFrom()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa8d07dbbcb caffe::BatchTransformer<>::InternalThreadEntry()
@ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost()
@ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu()
@ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu()
@ 0x7fa8d02bdbb2 caffe::InternalThread::entry()
@ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data()
@ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data()
@ 0x7fa8d02bfc2f boost::detail::thread_data<>::run()
@ 0x7fa8cdcaf5d5 (unknown)
@ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch()
@ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch()
@ 0x7fa8cd5686ba start_thread
@ 0x7fa8cdfcb41d clone
@ (nil) (unknown)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions