Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@booxter
Copy link
Contributor

@booxter booxter commented Feb 26, 2025

  • fix(infra): honor 2+ kernels when installing nvidia drivers
  • fix(infra): use -open nvidia module
  • fix(infra): fix nvidia installation due to missing package

The script was broken due to some changes on the nvidia rpm repo side.

In addition, I'm tackling the problem with the script not handling multiple
kernel versions installed at the same time. In this case, the script will pick
the latest version installed.

Resolves #3016

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Functional tests have been added, if necessary.
  • E2E Workflow tests have been added, if necessary.

@booxter
Copy link
Contributor Author

booxter commented Feb 26, 2025

@ktdreyer FYI

(Sorry for stepping on your toes, I needed this script fixed myself, so I couldn't wait for your fix...)

@mergify mergify bot added the one-approval PR has one approval from a maintainer label Feb 26, 2025
@booxter booxter requested review from a team and danmcp February 26, 2025 17:32
&& dnf config-manager --best --nodocs --setopt=install_weak_deps=False --save \
&& dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel${OS_VERSION_MAJOR}/${CUDA_REPO_ARCH}/cuda-rhel${OS_VERSION_MAJOR}.repo \
&& dnf -y module enable nvidia-driver:${DRIVER_STREAM}/default \
&& dnf -y module enable nvidia-driver:${DRIVER_STREAM}-dkms/default \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right change or should we be using a different version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume -open is the new driver: https://github.com/NVIDIA/open-gpu-kernel-modules and -dkms is the legacy / old / previous one. Correct? I assume we want to stick to the previous version we used which - I assume - is the -dkms one, and if -open is a better alternative now, then a separate change would be sent to do the switch. I don't know much about the nvidia drivers to make the decision either way, so sticking to the status quo here.

Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, the previous working one was 550 (not dkms or open):

sudo dnf module list nvidia-driver
Last metadata expiration check: 0:58:29 ago on Wed 26 Feb 2025 09:06:08 PM UTC.
cuda-rhel9-x86_64
Name                            Stream                         Profiles                                  Summary                                              
nvidia-driver                   latest                         default [d], fm, ks, src                  Nvidia driver for latest branch                      
nvidia-driver                   latest-dkms                    default [d], fm, ks                       Nvidia driver for latest-dkms branch                 
nvidia-driver                   open-dkms [d]                  default [d], fm, ks, src                  Nvidia driver for open-dkms branch                   
nvidia-driver                   515                            default [d], fm, ks, src                  Nvidia driver for 515 branch                         
nvidia-driver                   515-dkms                       default [d], fm, ks                       Nvidia driver for 515-dkms branch                    
nvidia-driver                   515-open                       default [d], fm, ks, src                  Nvidia driver for 515-open branch                    
nvidia-driver                   520                            default [d], fm, ks, src                  Nvidia driver for 520 branch                         
nvidia-driver                   520-dkms                       default [d], fm, ks                       Nvidia driver for 520-dkms branch                    
nvidia-driver                   520-open                       default [d], fm, ks, src                  Nvidia driver for 520-open branch                    
nvidia-driver                   525                            default [d], fm, ks, src                  Nvidia driver for 525 branch                         
nvidia-driver                   525-dkms                       default [d], fm, ks                       Nvidia driver for 525-dkms branch                    
nvidia-driver                   525-open                       default [d], fm, ks, src                  Nvidia driver for 525-open branch                    
nvidia-driver                   530                            default [d], fm, ks, src                  Nvidia driver for 530 branch                         
nvidia-driver                   530-dkms                       default [d], fm, ks                       Nvidia driver for 530-dkms branch                    
nvidia-driver                   530-open                       default [d], fm, ks, src                  Nvidia driver for 530-open branch                    
nvidia-driver                   535                            default [d], fm, ks, src                  Nvidia driver for 535 branch                         
nvidia-driver                   535-dkms                       default [d], fm, ks                       Nvidia driver for 535-dkms branch                    
nvidia-driver                   535-open                       default [d], fm, ks, src                  Nvidia driver for 535-open branch                    
nvidia-driver                   545                            default [d], fm, ks, src                  Nvidia driver for 545 branch                         
nvidia-driver                   545-dkms                       default [d], fm, ks                       Nvidia driver for 545-dkms branch                    
nvidia-driver                   545-open                       default [d], fm, ks, src                  Nvidia driver for 545-open branch                    
nvidia-driver                   550 [e]                        default [d], fm, ks, src                  Nvidia driver for 550 branch                         
nvidia-driver                   550-dkms                       default [d], fm, ks                       Nvidia driver for 550-dkms branch                    
nvidia-driver                   550-open                       default [d], fm, ks, src                  Nvidia driver for 550-open branch                    
nvidia-driver                   555                            default [d], fm, ks, src                  Nvidia driver for 555 branch                         
nvidia-driver                   555-dkms                       default [d], fm, ks                       Nvidia driver for 555-dkms branch                    
nvidia-driver                   555-open                       default [d], fm, ks, src                  Nvidia driver for 555-open branch                    
nvidia-driver                   560                            default [d], fm, ks, src                  Nvidia driver for 560 branch                         
nvidia-driver                   560-dkms                       default [d], fm, ks                       Nvidia driver for 560-dkms branch                    
nvidia-driver                   560-open                       default [d], fm, ks, src                  Nvidia driver for 560-open branch                    
nvidia-driver                   565                            default [d], fm, ks, src                  Nvidia driver for 565 branch                         
nvidia-driver                   565-dkms                       default [d], fm, ks                       Nvidia driver for 565-dkms branch                    
nvidia-driver                   565-open                       default [d], fm, ks, src                  Nvidia driver for 565-open branch                    
nvidia-driver                   570-dkms                       default [d], fm, ks                       Nvidia driver for 570-dkms branch                    
nvidia-driver                   570-open                       default [d], fm, ks, src                  Nvidia driver for 570-open branch    

I think up to this point we have always been using the "-null" stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but it was bumped to 570 at ecda29302 so do we want to revert this? not sure I understand the implication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I have no idea why 570 doesn't have a null stream. But if we want to stick to this version, looks like we'll have to choose one of open or dkms.

@danmcp danmcp requested a review from ktdreyer February 26, 2025 18:41
@danmcp
Copy link
Member

danmcp commented Feb 26, 2025

@fabiendupont FYI

@ktdreyer
Copy link
Contributor

I haven't used dkms in a decade, so I'm learning.

I've tested this manually on CentOS's ami-09fc0e32ec75bfe3f.

  • Running this code with single kernel-5.14.0-565.el9 installed: works
  • Running this code with single kernel-5.14.0-565.el9 installed, then: dnf update && reboot into 5.14.0-570.el9. Driver does not load. I thought dkms dynamically recompiled for us on kernel updates?

@booxter
Copy link
Contributor Author

booxter commented Feb 26, 2025

@ktdreyer did it work before? How I understand the script, we are building a kmod RPM with rpmbuild, then install it from the RPM, not from Nvidia's repo. So I doubt it ever allowed to update the kernel after the fact without re-running the script.

Maybe now that we have dkms dnf module for nvidia, we CAN avoid building a RPM package and could instead install:

dnf info kmod-nvidia-latest-dkms.x86_64
Last metadata expiration check: 0:04:34 ago on Wed 26 Feb 2025 10:08:58 PM UTC.
Available Packages
Name         : kmod-nvidia-latest-dkms
Epoch        : 3
Version      : 570.86.15
Release      : 1.el9
Architecture : x86_64
Size         : 69 M
Source       : kmod-nvidia-latest-dkms-570.86.15-1.el9.src.rpm
Repository   : cuda-rhel9-x86_64
Summary      : NVIDIA display driver kernel module
URL          : http://www.nvidia.com/object/unix.html
License      : NVIDIA License
Description  : This package provides the proprietary Nvidia kernel driver modules.
             : The modules are rebuilt through the DKMS system when a new kernel or
             : modules become available.

@booxter
Copy link
Contributor Author

booxter commented Feb 26, 2025

@ktdreyer @danmcp I have this commit in my private repo that switches the script to in-repo dkms package for kmod: bb4ef06 but I will send it after this PR here is merged since it's a different topic. (Confirmed the change works against a fresh centos9 node.)

@ktdreyer
Copy link
Contributor

@danmcp do you want to use DKMS in this script?

@danmcp
Copy link
Member

danmcp commented Feb 26, 2025

@danmcp do you want to use DKMS in this script?

My concern would be over consistency with downstream environments. At a min I think we need the "-null" option to still be available.

@booxter
Copy link
Contributor Author

booxter commented Feb 26, 2025

@danmcp what does the "-null" option to still be available. mean in the situation where the version doesn't seem to have -null available from the repo? Do you suggest we should revert to an older version in the script that would have a -null flavor? Alternatively, should we reach out to Nvidia to see why there's no -null?

I'm trying to understand what the path forward here is. Note the script is currently completely broken, so something has to be changed.

@danmcp
Copy link
Member

danmcp commented Feb 27, 2025

@booxter That would mean using an older version that still has a -null for now. But I am unclear on the original expectation on moving to 570.

@fabiendupont Can you clarify which module you were expecting to pick up with the move to 570?

@fabiendupont
Copy link
Contributor

The move to driver 570 and CUDA 12.8 is motivated by the support for NVIDIA Blackwell. For the driver, the 570 branch is a long-term support version, so what NVIDIA recommends to their customers.

Which version of RHEL is used? If RHEL 9.5, the 570-open stream will provide precompiled kernel modules. Otherwise, you'll need to install DKMS, which is not supported by RHEL. Another option remains downloading the .run installer and execute it in unattended mode.

@booxter
Copy link
Contributor Author

booxter commented Feb 27, 2025

Thanks for this @fabiendupont I will then try to switch to -open.

@booxter booxter marked this pull request as draft February 27, 2025 14:07
@fabiendupont
Copy link
Contributor

Nevermind my comment about RHEL version. The script is compiling the kernel modules from source, so you shouldn't have any issue. I then definitely recommend 570-open.

@booxter
Copy link
Contributor Author

booxter commented Feb 27, 2025

@fabiendupont but should the script compile the driver if it's included in the -open?

@fabiendupont
Copy link
Contributor

Yes. NVIDIA doesn't provide precompiled drivers for CentOS Stream, so you have to build them anyways, either with rpmbuild or DKMS. Downstream, we use rpmbuild, because DKMS is not available for RHEL.

@booxter
Copy link
Contributor Author

booxter commented Feb 27, 2025

OK.

Wondering then: Should the script then detect if it's not RHEL and if so - use DKMS dnf module to spare the rpmbuild? (otherwise use -open.) It seems like both rpmbuild path and dkms path are not the way you'd do it for RHEL, so if we can remove the rpmbuild code, it could be beneficial.

Only -open and -dkms exist. There's not suffix-less module in the repo.

We are using -open here because we are building the kmodule with
rpmbuild anyway.

Signed-off-by: Ihar Hrachyshka <[email protected]>
@booxter
Copy link
Contributor Author

booxter commented Feb 27, 2025

Switched to -open.

[ec2-user@ip-10-0-25-214 ~]$ dnf info kmod-nvidia-570.86.15-5.14.0-570.x86_64
Last metadata expiration check: 0:00:15 ago on Thu 27 Feb 2025 03:34:10 PM UTC.
Installed Packages
Name         : kmod-nvidia-570.86.15-5.14.0-570
Epoch        : 3
Version      : 570.86.15
Release      : 3.el9
Architecture : x86_64
Size         : 27 M
Source       : kmod-nvidia-570.86.15-5.14.0-570-570.86.15-3.el9.src.rpm
Repository   : @System
From repo    : @commandline
Summary      : NVIDIA graphics driver
URL          : http://www.nvidia.com/
License      : Nvidia
Description  : The NVidia 570.86.15 display driver kernel module for kernel 5.14.0-570.el9
[ec2-user@ip-10-0-25-214 ~]$ dnf module list | grep nvidia-driver
nvidia-driver latest        default [d], fm, ks, src                          Nvidia driver for latest branch
nvidia-driver latest-dkms   default [d], fm, ks                               Nvidia driver for latest-dkms branch
nvidia-driver open-dkms [d] default [d], fm, ks, src                          Nvidia driver for open-dkms branch
nvidia-driver 515           default [d], fm, ks, src                          Nvidia driver for 515 branch
nvidia-driver 515-dkms      default [d], fm, ks                               Nvidia driver for 515-dkms branch
nvidia-driver 515-open      default [d], fm, ks, src                          Nvidia driver for 515-open branch
nvidia-driver 520           default [d], fm, ks, src                          Nvidia driver for 520 branch
nvidia-driver 520-dkms      default [d], fm, ks                               Nvidia driver for 520-dkms branch
nvidia-driver 520-open      default [d], fm, ks, src                          Nvidia driver for 520-open branch
nvidia-driver 525           default [d], fm, ks, src                          Nvidia driver for 525 branch
nvidia-driver 525-dkms      default [d], fm, ks                               Nvidia driver for 525-dkms branch
nvidia-driver 525-open      default [d], fm, ks, src                          Nvidia driver for 525-open branch
nvidia-driver 530           default [d], fm, ks, src                          Nvidia driver for 530 branch
nvidia-driver 530-dkms      default [d], fm, ks                               Nvidia driver for 530-dkms branch
nvidia-driver 530-open      default [d], fm, ks, src                          Nvidia driver for 530-open branch
nvidia-driver 535           default [d], fm, ks, src                          Nvidia driver for 535 branch
nvidia-driver 535-dkms      default [d], fm, ks                               Nvidia driver for 535-dkms branch
nvidia-driver 535-open      default [d], fm, ks, src                          Nvidia driver for 535-open branch
nvidia-driver 545           default [d], fm, ks, src                          Nvidia driver for 545 branch
nvidia-driver 545-dkms      default [d], fm, ks                               Nvidia driver for 545-dkms branch
nvidia-driver 545-open      default [d], fm, ks, src                          Nvidia driver for 545-open branch
nvidia-driver 550           default [d], fm, ks, src                          Nvidia driver for 550 branch
nvidia-driver 550-dkms      default [d], fm, ks                               Nvidia driver for 550-dkms branch
nvidia-driver 550-open      default [d], fm, ks, src                          Nvidia driver for 550-open branch
nvidia-driver 555           default [d], fm, ks, src                          Nvidia driver for 555 branch
nvidia-driver 555-dkms      default [d], fm, ks                               Nvidia driver for 555-dkms branch
nvidia-driver 555-open      default [d], fm, ks, src                          Nvidia driver for 555-open branch
nvidia-driver 560           default [d], fm, ks, src                          Nvidia driver for 560 branch
nvidia-driver 560-dkms      default [d], fm, ks                               Nvidia driver for 560-dkms branch
nvidia-driver 560-open      default [d], fm, ks, src                          Nvidia driver for 560-open branch
nvidia-driver 565           default [d], fm, ks, src                          Nvidia driver for 565 branch
nvidia-driver 565-dkms      default [d], fm, ks                               Nvidia driver for 565-dkms branch
nvidia-driver 565-open      default [d], fm, ks, src                          Nvidia driver for 565-open branch
nvidia-driver 570-dkms      default [d], fm, ks                               Nvidia driver for 570-dkms branch
nvidia-driver 570-open [e]  default [d], fm, ks, src                          Nvidia driver for 570-open branch

@booxter booxter marked this pull request as ready for review February 27, 2025 15:36
@booxter booxter requested a review from danmcp February 27, 2025 16:21
Copy link
Contributor

@ktdreyer ktdreyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please take booxter#1 into your branch? This will fix #3016 completely.

I've tested this a couple times now. It solves one of the bugs I was hitting (dnf module enable fails).

I have an improvement to the KERNEL_VERSION selection that I can post after we merge this.

We now sort them and choose the last one.

Signed-off-by: Ken Dreyer <[email protected]>
@mergify mergify bot removed the one-approval PR has one approval from a maintainer label Feb 27, 2025
@mergify mergify bot merged commit 284c220 into instructlab:main Feb 27, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cloud-instance.sh script fails to run install_rh_nvidia_drivers function when there are multiple kernel core releases/versions available

5 participants