Thanks to visit codestin.com
Credit goes to github.com

Skip to content

failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED #18

@IWamelink

Description

@IWamelink

Hi Reuben,

I am trying to get your U-HVED model to train, but I am running into an error that I cannot shake.
My node has an 80GB GPU and I am currently using 100 GB RAM. The environment I am using uses tf 1.12 and niftinet 0.5.0 as prescribed in the requirements.txt

Have you seen this error before?

[Layer] VAE/ConvDecoderImg/final_conv_seg_4 [Trainable] conv_/w (16)
seeg
INFO:niftynet: Cross entropy loss function calls tf.nn.sparse_softmax_cross_entropy_with_logits which always performs a softmax internally.
output
Tensor("worker_0/concat:0", shape=(1, 112, 112, 112, 4), dtype=float32, device=/device:GPU:0)
output_seg
Tensor("worker_0/VAE/ConvDecoderImg/final_conv_seg_4/conv_/conv:0", shape=(1, 112, 112, 112, 4), dtype=float32, device=/device:GPU:0)
gt
Tensor("worker_0/train/Squeeze_1:0", shape=(1, 112, 112, 112, 1), dtype=float32, device=/device:GPU:0)
WARNING:niftynet: Tried to colocate op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/add/Const' (defined at /data/U-HVED/extensions/u_hved/application.py:253) having device '/device:CPU:0' with op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/Shape' (defined at /data/U-HVED/extensions/u_hved/application.py:253) which had an incompatible device '/device:GPU:0'.

Node-device colocations active during op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/add/Const' creation:
with tf.colocate_with(worker_0/loss_function_1/map/while/range_1): </data/tfEnv/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py:1004>
with tf.colocate_with(worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/Shape): </data/tfEnv/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:80>
No device assignments were active during op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/add/Const' creation.

No node-device colocations were active during op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/Shape' creation.
Device assignments active during op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/Shape' creation:
with tf.device(/gpu:0): </data/tfEnv/lib/python3.6/site-packages/niftynet/engine/application_driver.py:267>
with tf.device(/cpu:0): </data/tfEnv/lib/python3.6/site-packages/niftynet/engine/application_driver.py:249>
WARNING:niftynet: Tried to colocate op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/add/f_acc' (defined at /data/U-HVED/extensions/u_hved/application.py:253) having device '/device:CPU:0' with op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/Shape' (defined at /data/U-HVED/extensions/u_hved/application.py:253) which had an incompatible device '/device:GPU:0'.

Node-device colocations active during op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/add/f_acc' creation:
with tf.colocate_with(worker_0/loss_function_1/map/while/range_1): </data/tfEnv/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py:1004>
with tf.colocate_with(worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/Shape): </data/tfEnv/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:80>
No device assignments were active during op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/add/f_acc' creation.

No node-device colocations were active during op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/Shape' creation.
Device assignments active during op 'worker_0/gradients/worker_0/loss_function_1/map/while/Mean_grad/Shape' creation:
with tf.device(/gpu:0): </data/tfEnv/lib/python3.6/site-packages/niftynet/engine/application_driver.py:267>
with tf.device(/cpu:0): </data/tfEnv/lib/python3.6/site-packages/niftynet/engine/application_driver.py:249>
2025-05-29 17:17:14.168112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2025-05-29 17:17:14.168215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2025-05-29 17:17:14.168230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2025-05-29 17:17:14.168240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2025-05-29 17:17:14.168392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 76618 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:48:00.0, compute capability: 8.0)
INFO:niftynet: Parameters from random initialisations ...
2025-05-29 17:18:25.202462: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 1 of 30
2025-05-29 17:18:25.431713: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:136] Shuffle buffer filled.
2025-05-29 17:18:25.725596: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
INFO:niftynet: cleaning up...
2025-05-29 17:18:26.074386: I tensorflow/stream_executor/stream.cc:2076] [stream=0x555ac32b1a80,impl=0x555aa752aab0] did not wait for [stream=0x555ac27f3f10,impl=0x555aa752a620]
2025-05-29 17:18:26.074480: I tensorflow/stream_executor/stream.cc:5011] [stream=0x555ac32b1a80,impl=0x555aa752aab0] did not memcpy device-to-host; source: 0x7fac8a293600
2025-05-29 17:18:26.074557: I tensorflow/stream_executor/stream.cc:2076] [stream=0x555ac32b1a80,impl=0x555aa752aab0] did not wait for [stream=0x555ac27f3f10,impl=0x555aa752a620]
2025-05-29 17:18:26.074500: I tensorflow/stream_executor/stream.cc:2076] [stream=0x555ac32b1a80,impl=0x555aa752aab0] did not wait for [stream=0x555ac27f3f10,impl=0x555aa752a620]
2025-05-29 17:18:26.074589: F tensorflow/core/common_runtime/gpu/gpu_util.cc:292] GPU->CPU Memcpy failed
2025-05-29 17:18:26.074607: I tensorflow/stream_executor/stream.cc:5011] [stream=0x555ac32b1a80,impl=0x555aa752aab0] did not memcpy device-to-host; source: 0x7fac8aee2200
Aborted (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions