-
Notifications
You must be signed in to change notification settings - Fork 450
Description
Describe the bug
The cloud-instance.sh utility script fails to run install_rh_nvidia_drivers function when there are multiple kernel core releases/versions available.
To Reproduce
Steps to reproduce the behavior:
- Provision any Cloud/Bare metal server and install a Linux OS distribution.
- Run through the steps mentioned here.
- Make sure, you install multiple
kernel-coreversions on the host node, before you run theinstall_rh_nvidia_driversstep. - Notice the error in the console logs.
Expected behavior
We would expect and desire the cloud-instance.sh and the nvidia-setup.sh scripts to handle the situation when there are multiple kernel-core versions available. It kind of does in this step, but it could fail even before that over here.
Logs
Dependencies resolved.
Nothing to do.
Complete!
+ '[' '' == '' ']'
++ dnf info --installed kernel-core
++ awk -F: '/^Release/{print $2}'
++ tr -d '[:blank:]'
+ RELEASE='412.el9
547.el9'
++ dnf info --installed kernel-core
++ awk -F: '/^Version/{print $2}'
++ tr -d '[:blank:]'
+ VERSION='5.14.0
5.14.0'
+ export 'KERNEL_VERSION=5.14.0
5.14.0-412.el9
547.el9'
+ KERNEL_VERSION='5.14.0
5.14.0-412.el9
547.el9'
+ dnf install -y 'kernel-devel-5.14.0
5.14.0-412.el9
547.el9'
Last metadata expiration check: 0:23:44 ago on Mon 27 Jan 2025 03:54:29 PM EST.
No match for argument: kernel-devel-5.14.0
5.14.0-412.el9
547.el9
Error: Unable to find a match: kernel-devel-5.14.0
5.14.0-412.el9
547.el9
+ CUDA_REPO_ARCH=
+ '[' '' == aarch64 ']'
+ cp -a /etc/dnf/dnf.conf /etc/dnf/dnf.conf.tmp
+ mv /etc/dnf/dnf.conf.tmp /etc/dnf/dnf.conf
+ dnf config-manager --best --nodocs --setopt=install_weak_deps=False --save
+ dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel//cuda-rhel.repo
Adding repo from: https://developer.download.nvidia.com/compute/cuda/repos/rhel//cuda-rhel.repo
Status code: 404 for https://developer.download.nvidia.com/compute/cuda/repos/rhel//cuda-rhel.repo (IP: 23.205.107.71)
Error: Configuration of repo failed
+ systemctl daemon-reload
+ systemctl enable --now nvidia-toolkit-setup.service
Job for nvidia-toolkit-setup.service failed because the control process exited with error code.
See "systemctl status nvidia-toolkit-setup.service" and "journalctl -xeu nvidia-toolkit-setup.service" for details.
Device Info (please complete the following information):
- Hardware Specs: IBM Cloud - gx3-24x120x1l40s
- OS Version: CentOS Stream release 9 (Red Hat Enterprise Linux 9)
- Python Version: Python 3.11.11
- InstructLab Version: NA
Additional context
We would have to handle the situation where the right kernel-core version is initially used for the login and then handle the deletion of the additional kernel-core versions and then re-install the nvidia drivers or use a pre-loaded image with everything installed.