cloud-instance.sh script fails to run install_rh_nvidia_drivers function when there are multiple kernel core releases/versions available

**Describe the bug**

The [cloud-instance.sh](https://github.com/instructlab/instructlab/blob/main/scripts/infra/cloud-instance.sh) utility script fails to run `install_rh_nvidia_drivers` function when there are multiple kernel core releases/versions available.

**To Reproduce**
Steps to reproduce the behavior:
1. Provision any Cloud/Bare metal server and install a Linux OS distribution.
2. Run through the steps mentioned [here](https://github.com/instructlab/instructlab/tree/main/scripts/infra#cloud-instancesh).
3. Make sure, you install multiple `kernel-core` versions on the host node, before you run the `install_rh_nvidia_drivers` step.
4. Notice the error in the console logs.

**Expected behavior**
We would expect and desire the `cloud-instance.sh` and the `nvidia-setup.sh` scripts to handle the situation when there are multiple kernel-core versions available. It kind of does in this [step](https://github.com/instructlab/instructlab/blob/main/scripts/infra/nvidia-setup.sh#L150), but it could fail even before that over [here](https://github.com/instructlab/instructlab/blob/main/scripts/infra/nvidia-setup.sh#L51).

**Logs**

```
Dependencies resolved.
Nothing to do.
Complete!
+ '[' '' == '' ']'
++ dnf info --installed kernel-core
++ awk -F: '/^Release/{print $2}'
++ tr -d '[:blank:]'
+ RELEASE='412.el9
547.el9'
++ dnf info --installed kernel-core
++ awk -F: '/^Version/{print $2}'
++ tr -d '[:blank:]'
+ VERSION='5.14.0
5.14.0'
+ export 'KERNEL_VERSION=5.14.0
5.14.0-412.el9
547.el9'
+ KERNEL_VERSION='5.14.0
5.14.0-412.el9
547.el9'
+ dnf install -y 'kernel-devel-5.14.0
5.14.0-412.el9
547.el9'
Last metadata expiration check: 0:23:44 ago on Mon 27 Jan 2025 03:54:29 PM EST.
No match for argument: kernel-devel-5.14.0
5.14.0-412.el9
547.el9
Error: Unable to find a match: kernel-devel-5.14.0
5.14.0-412.el9
547.el9
+ CUDA_REPO_ARCH=
+ '[' '' == aarch64 ']'
+ cp -a /etc/dnf/dnf.conf /etc/dnf/dnf.conf.tmp
+ mv /etc/dnf/dnf.conf.tmp /etc/dnf/dnf.conf
+ dnf config-manager --best --nodocs --setopt=install_weak_deps=False --save
+ dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel//cuda-rhel.repo
Adding repo from: https://developer.download.nvidia.com/compute/cuda/repos/rhel//cuda-rhel.repo
Status code: 404 for https://developer.download.nvidia.com/compute/cuda/repos/rhel//cuda-rhel.repo (IP: 23.205.107.71)
Error: Configuration of repo failed
+ systemctl daemon-reload
+ systemctl enable --now nvidia-toolkit-setup.service
Job for nvidia-toolkit-setup.service failed because the control process exited with error code.
See "systemctl status nvidia-toolkit-setup.service" and "journalctl -xeu nvidia-toolkit-setup.service" for details.
```

**Device Info (please complete the following information):**
 - Hardware Specs: IBM Cloud - gx3-24x120x1l40s
 - OS Version: CentOS Stream release 9 (Red Hat Enterprise Linux 9)
 - Python Version: Python 3.11.11
 - InstructLab Version: NA

**Additional context**
We would have to handle the situation where the right kernel-core version is initially used for the login and then handle the deletion of the additional kernel-core versions and then re-install the nvidia drivers or use a pre-loaded image with everything installed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud-instance.sh script fails to run install_rh_nvidia_drivers function when there are multiple kernel core releases/versions available #3016

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cloud-instance.sh script fails to run install_rh_nvidia_drivers function when there are multiple kernel core releases/versions available #3016

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions