Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ktdreyer
Copy link
Contributor

@ktdreyer ktdreyer commented Mar 4, 2025

This PR improves kernel and package handling for cloud-instance.sh.

  • Sort kernel versions by build time, so we ensure we're building for the very lastest
  • Bail early on package install failures
  • Improve logging for instance name and ID

When choosing the latest KERNEL_VERSION to build the nvidia driver,
select the kernel package that has the most recent build time. This
integer is easy to sort, and we avoid kernel version string sorting
nuances.

Signed-off-by: Ken Dreyer <[email protected]>
@ktdreyer ktdreyer requested review from booxter and danmcp March 4, 2025 17:34
@mergify mergify bot added the ci-failure PR has at least one CI failure label Mar 4, 2025
@ktdreyer
Copy link
Contributor Author

ktdreyer commented Mar 4, 2025

Here is the set of commands I use to test this. I'm using EC2_AMI_ID="ami-09fc0e32ec75bfe3f", a slightly older CentOS Stream 9 image with kernel-5.14.0-565.el9 that allows me to dnf update to install the newer CentOS Stream 9 kernel-5.14.0-570.el9.

./cloud-instance.sh ec2 launch
./cloud-instance.sh ec2 ssh sudo dnf -y update kernel-core
./cloud-instance.sh ec2 ssh rpm -q kernel-core
./cloud-instance.sh ec2 install-rh-nvidia-drivers
./cloud-instance.sh ec2 ssh sudo reboot
./cloud-instance.sh ec2 ssh uname -r
./cloud-instance.sh ec2 ssh systemctl status nvidia-toolkit-setup

When you specify multiple packages to DNF in RHEL 9, if some are
available and some are missing, DNF will print a warning and continue
on. As a result, the install-nvidia.sh script can proceed without
installing all dependencies, and when the user reboots, the nvidia
drivers may not function.

Reconfigure DNF to error on any missing packages. This helps us catch
problems sooner.

Combine the two dnf config-manager commands into one invocation at the
top of the script so the settings take effect for all subsequent DNF
operations.

Remove the "--best --nodocs" options, because they only affect dnf
install operations, not config-manager.

Signed-off-by: Ken Dreyer <[email protected]>
@mergify mergify bot removed the ci-failure PR has at least one CI failure label Mar 4, 2025
@mergify mergify bot added the one-approval PR has one approval from a maintainer label Mar 4, 2025
ktdreyer added 2 commits March 4, 2025 15:28
This script can fail at various points. We want the user to have this
piece of debugging information if that happens.

Signed-off-by: Ken Dreyer <[email protected]>
This script handles multiple kernels transparently now, so we should not
uninstall them.

Advise users to update to the latest kernel instead.

Signed-off-by: Ken Dreyer <[email protected]>
We only want to compute KERNEL_VERSION if the user has not already set
that variable (in other words, it is not null/empty).

In Bash, we can use -z to test if a string is null/empty.

Signed-off-by: Ken Dreyer <[email protected]>
@mergify mergify bot merged commit 1db59e8 into instructlab:main Mar 4, 2025
8 checks passed
@mergify mergify bot removed the one-approval PR has one approval from a maintainer label Mar 4, 2025
@ktdreyer ktdreyer deleted the nvidia-kernels branch March 6, 2025 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants