-
Notifications
You must be signed in to change notification settings - Fork 450
fix(infra): fix nvidia drivers installation script #3190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Closes: instructlab#3016 Signed-off-by: Ihar Hrachyshka <[email protected]>
|
@ktdreyer FYI (Sorry for stepping on your toes, I needed this script fixed myself, so I couldn't wait for your fix...) |
scripts/infra/nvidia-setup.sh
Outdated
| && dnf config-manager --best --nodocs --setopt=install_weak_deps=False --save \ | ||
| && dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel${OS_VERSION_MAJOR}/${CUDA_REPO_ARCH}/cuda-rhel${OS_VERSION_MAJOR}.repo \ | ||
| && dnf -y module enable nvidia-driver:${DRIVER_STREAM}/default \ | ||
| && dnf -y module enable nvidia-driver:${DRIVER_STREAM}-dkms/default \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right change or should we be using a different version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume -open is the new driver: https://github.com/NVIDIA/open-gpu-kernel-modules and -dkms is the legacy / old / previous one. Correct? I assume we want to stick to the previous version we used which - I assume - is the -dkms one, and if -open is a better alternative now, then a separate change would be sent to do the switch. I don't know much about the nvidia drivers to make the decision either way, so sticking to the status quo here.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, the previous working one was 550 (not dkms or open):
sudo dnf module list nvidia-driver
Last metadata expiration check: 0:58:29 ago on Wed 26 Feb 2025 09:06:08 PM UTC.
cuda-rhel9-x86_64
Name Stream Profiles Summary
nvidia-driver latest default [d], fm, ks, src Nvidia driver for latest branch
nvidia-driver latest-dkms default [d], fm, ks Nvidia driver for latest-dkms branch
nvidia-driver open-dkms [d] default [d], fm, ks, src Nvidia driver for open-dkms branch
nvidia-driver 515 default [d], fm, ks, src Nvidia driver for 515 branch
nvidia-driver 515-dkms default [d], fm, ks Nvidia driver for 515-dkms branch
nvidia-driver 515-open default [d], fm, ks, src Nvidia driver for 515-open branch
nvidia-driver 520 default [d], fm, ks, src Nvidia driver for 520 branch
nvidia-driver 520-dkms default [d], fm, ks Nvidia driver for 520-dkms branch
nvidia-driver 520-open default [d], fm, ks, src Nvidia driver for 520-open branch
nvidia-driver 525 default [d], fm, ks, src Nvidia driver for 525 branch
nvidia-driver 525-dkms default [d], fm, ks Nvidia driver for 525-dkms branch
nvidia-driver 525-open default [d], fm, ks, src Nvidia driver for 525-open branch
nvidia-driver 530 default [d], fm, ks, src Nvidia driver for 530 branch
nvidia-driver 530-dkms default [d], fm, ks Nvidia driver for 530-dkms branch
nvidia-driver 530-open default [d], fm, ks, src Nvidia driver for 530-open branch
nvidia-driver 535 default [d], fm, ks, src Nvidia driver for 535 branch
nvidia-driver 535-dkms default [d], fm, ks Nvidia driver for 535-dkms branch
nvidia-driver 535-open default [d], fm, ks, src Nvidia driver for 535-open branch
nvidia-driver 545 default [d], fm, ks, src Nvidia driver for 545 branch
nvidia-driver 545-dkms default [d], fm, ks Nvidia driver for 545-dkms branch
nvidia-driver 545-open default [d], fm, ks, src Nvidia driver for 545-open branch
nvidia-driver 550 [e] default [d], fm, ks, src Nvidia driver for 550 branch
nvidia-driver 550-dkms default [d], fm, ks Nvidia driver for 550-dkms branch
nvidia-driver 550-open default [d], fm, ks, src Nvidia driver for 550-open branch
nvidia-driver 555 default [d], fm, ks, src Nvidia driver for 555 branch
nvidia-driver 555-dkms default [d], fm, ks Nvidia driver for 555-dkms branch
nvidia-driver 555-open default [d], fm, ks, src Nvidia driver for 555-open branch
nvidia-driver 560 default [d], fm, ks, src Nvidia driver for 560 branch
nvidia-driver 560-dkms default [d], fm, ks Nvidia driver for 560-dkms branch
nvidia-driver 560-open default [d], fm, ks, src Nvidia driver for 560-open branch
nvidia-driver 565 default [d], fm, ks, src Nvidia driver for 565 branch
nvidia-driver 565-dkms default [d], fm, ks Nvidia driver for 565-dkms branch
nvidia-driver 565-open default [d], fm, ks, src Nvidia driver for 565-open branch
nvidia-driver 570-dkms default [d], fm, ks Nvidia driver for 570-dkms branch
nvidia-driver 570-open default [d], fm, ks, src Nvidia driver for 570-open branch
I think up to this point we have always been using the "-null" stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK but it was bumped to 570 at ecda29302 so do we want to revert this? not sure I understand the implication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I have no idea why 570 doesn't have a null stream. But if we want to stick to this version, looks like we'll have to choose one of open or dkms.
|
@fabiendupont FYI |
|
I haven't used dkms in a decade, so I'm learning. I've tested this manually on CentOS's
|
|
@ktdreyer did it work before? How I understand the script, we are building a Maybe now that we have dkms dnf module for nvidia, we CAN avoid building a RPM package and could instead install: |
|
@danmcp do you want to use DKMS in this script? |
My concern would be over consistency with downstream environments. At a min I think we need the "-null" option to still be available. |
|
@danmcp what does I'm trying to understand what the path forward here is. Note the script is currently completely broken, so something has to be changed. |
|
@booxter That would mean using an older version that still has a -null for now. But I am unclear on the original expectation on moving to 570. @fabiendupont Can you clarify which module you were expecting to pick up with the move to 570? |
|
The move to driver 570 and CUDA 12.8 is motivated by the support for NVIDIA Blackwell. For the driver, the 570 branch is a long-term support version, so what NVIDIA recommends to their customers. Which version of RHEL is used? If RHEL 9.5, the |
|
Thanks for this @fabiendupont I will then try to switch to |
|
Nevermind my comment about RHEL version. The script is compiling the kernel modules from source, so you shouldn't have any issue. I then definitely recommend |
|
@fabiendupont but should the script compile the driver if it's included in the -open? |
|
Yes. NVIDIA doesn't provide precompiled drivers for CentOS Stream, so you have to build them anyways, either with |
|
OK. Wondering then: Should the script then detect if it's not RHEL and if so - use DKMS dnf module to spare the |
Only -open and -dkms exist. There's not suffix-less module in the repo. We are using -open here because we are building the kmodule with rpmbuild anyway. Signed-off-by: Ihar Hrachyshka <[email protected]>
-NVML- is now, apparently, libnvidia-ml. https://github.com/NVIDIA/yum-packaging-nvidia-driver/blob/d06f50a507eb6e053e2e7dd4fcd7781060f8ff06/nvidia-driver.spec#L117 Signed-off-by: Ihar Hrachyshka <[email protected]>
|
Switched to |
ktdreyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now sort them and choose the last one. Signed-off-by: Ken Dreyer <[email protected]>
The script was broken due to some changes on the nvidia rpm repo side.
In addition, I'm tackling the problem with the script not handling multiple
kernel versions installed at the same time. In this case, the script will pick
the latest version installed.
Resolves #3016
Checklist:
conventional commits.