Thanks to visit codestin.com
Credit goes to github.com

Skip to content

cloud-instance.sh script fails to run install_rh_nvidia_drivers function when there are multiple kernel core releases/versions available #3016

@kami619

Description

@kami619

Describe the bug

The cloud-instance.sh utility script fails to run install_rh_nvidia_drivers function when there are multiple kernel core releases/versions available.

To Reproduce
Steps to reproduce the behavior:

  1. Provision any Cloud/Bare metal server and install a Linux OS distribution.
  2. Run through the steps mentioned here.
  3. Make sure, you install multiple kernel-core versions on the host node, before you run the install_rh_nvidia_drivers step.
  4. Notice the error in the console logs.

Expected behavior
We would expect and desire the cloud-instance.sh and the nvidia-setup.sh scripts to handle the situation when there are multiple kernel-core versions available. It kind of does in this step, but it could fail even before that over here.

Logs

Dependencies resolved.
Nothing to do.
Complete!
+ '[' '' == '' ']'
++ dnf info --installed kernel-core
++ awk -F: '/^Release/{print $2}'
++ tr -d '[:blank:]'
+ RELEASE='412.el9
547.el9'
++ dnf info --installed kernel-core
++ awk -F: '/^Version/{print $2}'
++ tr -d '[:blank:]'
+ VERSION='5.14.0
5.14.0'
+ export 'KERNEL_VERSION=5.14.0
5.14.0-412.el9
547.el9'
+ KERNEL_VERSION='5.14.0
5.14.0-412.el9
547.el9'
+ dnf install -y 'kernel-devel-5.14.0
5.14.0-412.el9
547.el9'
Last metadata expiration check: 0:23:44 ago on Mon 27 Jan 2025 03:54:29 PM EST.
No match for argument: kernel-devel-5.14.0
5.14.0-412.el9
547.el9
Error: Unable to find a match: kernel-devel-5.14.0
5.14.0-412.el9
547.el9
+ CUDA_REPO_ARCH=
+ '[' '' == aarch64 ']'
+ cp -a /etc/dnf/dnf.conf /etc/dnf/dnf.conf.tmp
+ mv /etc/dnf/dnf.conf.tmp /etc/dnf/dnf.conf
+ dnf config-manager --best --nodocs --setopt=install_weak_deps=False --save
+ dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel//cuda-rhel.repo
Adding repo from: https://developer.download.nvidia.com/compute/cuda/repos/rhel//cuda-rhel.repo
Status code: 404 for https://developer.download.nvidia.com/compute/cuda/repos/rhel//cuda-rhel.repo (IP: 23.205.107.71)
Error: Configuration of repo failed
+ systemctl daemon-reload
+ systemctl enable --now nvidia-toolkit-setup.service
Job for nvidia-toolkit-setup.service failed because the control process exited with error code.
See "systemctl status nvidia-toolkit-setup.service" and "journalctl -xeu nvidia-toolkit-setup.service" for details.

Device Info (please complete the following information):

  • Hardware Specs: IBM Cloud - gx3-24x120x1l40s
  • OS Version: CentOS Stream release 9 (Red Hat Enterprise Linux 9)
  • Python Version: Python 3.11.11
  • InstructLab Version: NA

Additional context
We would have to handle the situation where the right kernel-core version is initially used for the login and then handle the deletion of the additional kernel-core versions and then re-install the nvidia drivers or use a pre-loaded image with everything installed.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions