-
Notifications
You must be signed in to change notification settings - Fork 119
Description
Bug description
nvidia-smi memory polling introduced in #911 doesn't respect visibility masking (eg output of tf.config.list_logical_devices("GPU"))
This can cause some issues with the auto-selection of gpus.
Expected behavior
Auto-detected GPU picks from the masked gpus.
Actual behaviour
Here, CUDA_VISIBLE_DEVICES is set to "0", but the auto-selection tries to pick "1"
IndexError is because index 1 doesn't exist in TF's listed logical devices.
(Note that if it picks GPU0, it will actually work)
INFO:sleap.nn.training:
INFO:sleap.nn.training:Auto-selected GPU 1 with [81066, 81069, 81069, 81069, 81069, 81069, 81069, 81069] MiB of free memory.
Traceback (most recent call last):
File "/usr/local/bin/sleap-train", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/sleap/nn/training.py", line 1962, in main
sleap.nn.system.use_gpu(gpu_ind)
File "/usr/local/lib/python3.8/dist-packages/sleap/nn/system.py", line 59, in use_gpu
tf.config.set_visible_devices(gpus[device_ind], "GPU")
IndexError: list index out of range
Your personal set up
- Singularity start point
- nvcr.io/nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04
- OS:
- ubuntu 20.04
- Version(s):
- SLEAP v1.2.8, python 3.8.10
- SLEAP installation method (listed here):
How to reproduce
On a computer with multiple GPUs, run the following:
export CUDA_VISIBLE_DEVICES=0
sleap-train multi_instance.json labels.slp
Potential Solutions
Basic workaround
If the masking only has 1 gpu, using any of the following will fix it (essentially disabling auto-mode)
sleap-train multi_instance.json labels.slp --first-gpu
sleap-train multi_instance.json labels.slp --last-gpu
sleap-train multi_instance.json labels.slp --gpu 0
Potential Fix
Check if ${CUDA_VISIBLE_DEVICES} is set eg 'CUDA_VISIBLE_DEVICES' in os.environ.keys()
If it is, add the parameter -i os.environ["CUDA_VISIBLE_DEVICES"] to the nvidia-smi command
Otherwise, run as normal.