Thanks to visit codestin.com
Credit goes to github.com

Skip to content

GPU Memory polling ignores GPU masking #1028

@SkepticRaven

Description

@SkepticRaven

Bug description

nvidia-smi memory polling introduced in #911 doesn't respect visibility masking (eg output of tf.config.list_logical_devices("GPU"))
This can cause some issues with the auto-selection of gpus.

Expected behavior

Auto-detected GPU picks from the masked gpus.

Actual behaviour

Here, CUDA_VISIBLE_DEVICES is set to "0", but the auto-selection tries to pick "1"
IndexError is because index 1 doesn't exist in TF's listed logical devices.
(Note that if it picks GPU0, it will actually work)

INFO:sleap.nn.training:
INFO:sleap.nn.training:Auto-selected GPU 1 with [81066, 81069, 81069, 81069, 81069, 81069, 81069, 81069] MiB of free memory.
Traceback (most recent call last):
File "/usr/local/bin/sleap-train", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/sleap/nn/training.py", line 1962, in main
sleap.nn.system.use_gpu(gpu_ind)
File "/usr/local/lib/python3.8/dist-packages/sleap/nn/system.py", line 59, in use_gpu
tf.config.set_visible_devices(gpus[device_ind], "GPU")
IndexError: list index out of range

Your personal set up

  • Singularity start point
    • nvcr.io/nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04
  • OS:
    • ubuntu 20.04
  • Version(s):
    • SLEAP v1.2.8, python 3.8.10
  • SLEAP installation method (listed here):

How to reproduce

On a computer with multiple GPUs, run the following:

export CUDA_VISIBLE_DEVICES=0
sleap-train multi_instance.json labels.slp

Potential Solutions

Basic workaround

If the masking only has 1 gpu, using any of the following will fix it (essentially disabling auto-mode)

sleap-train multi_instance.json labels.slp --first-gpu
sleap-train multi_instance.json labels.slp --last-gpu
sleap-train multi_instance.json labels.slp --gpu 0

Potential Fix

Check if ${CUDA_VISIBLE_DEVICES} is set eg 'CUDA_VISIBLE_DEVICES' in os.environ.keys()
If it is, add the parameter -i os.environ["CUDA_VISIBLE_DEVICES"] to the nvidia-smi command
Otherwise, run as normal.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingfixed in future releaseFix or feature is merged into develop and will be available in future release.good first issueThis issue is relatively self-contained.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions