-
Notifications
You must be signed in to change notification settings - Fork 114
Description
Steps to reproduce
We're currently trying to install the NVIDIA DKMS drivers onto a Rocky 9 image which boots via Dracut, which will be booted on a Dell RX760 server with 2x Tesla T4 GPUs. While the driver installation completes successfully, and image building finishes, attempting to boot this image with the drivers ends up with the init process getting stuck, before eventually timing out on nm-wait-online-initrd.service after the period set in #1850 and halting.
(very) Roughly it seems that installing the drivers is causing some weirdness with the IP address handover, between the DHCP lease it receives during iPXE to the actual node IP it's supposed to receive during the kernel init process. Removing the drivers and rebuilding results in the image booting up just fine. We've been able to confirm this isn't a network speed issue with the image size as our test VM (albeit without any GPUs passed through) will boot the image in its entirety successfully.
I've also installed the drivers manually after booting the image without them preinstalled, and they function just fine ruling out issues with the hardware:
To reproduce:
- Pull the standard warewulf Rocky 9 image
- Enter the image shell, run:
dnf -y install https://github.com/warewulf/warewulf/releases/download/v4.6.4/warewulf-dracut-4.6.4-1.el9.noarch.rpm
dnf -y install dnf-plugins-core epel-release kernel-headers
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
dnf -y module install nvidia-driver:latest-dkms
dnf -y install datacenter-gpu-manager
dnf clean all
for dir in /usr/src/kernels/*; do dkms autoinstall --kernelver $(basename $dir); done
dkms status (returns Nvidia driver 580 installed)
exit 0- Run
wwctl image exec rockylinux-9 -- /usr/bin/dracut --force --no-hostonly --add wwinit --regenerate-all - Tell the node to boot from dracut:
wwctl node set node-name --tagadd IPXEMenuEntry=dracut - Rebuild the overlay with
wwctl overlay build node-nameand reboot the target node
Error message
I've attached a screenshot from the console output of our server's iDRAC, which shows the timeout.
Information on your system
wwctl version:
wwctl version: 4.6.4-1
rpc version: apiPrefix:"rc1" apiVersion:"1" warewulfVersion:"4.6.4-1"
/etc/os-release:
NAME="AlmaLinux"
VERSION="9.6 (Sage Margay)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.6"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.6 (Sage Margay)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"
ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.6"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.6"
SUPPORT_END=2032-06-01
General information
- I have run
wwctl versionand reported the contents of/etc/os-release - I have searched the issues of this repo and believe this is not a duplicate
- I have captured and reported relevant error messages and logs