Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Installing nvidia-driver:latest-dkms results in nm-wait-online-initrd.service timing out #2050

@dev-sda1

Description

@dev-sda1

Steps to reproduce

We're currently trying to install the NVIDIA DKMS drivers onto a Rocky 9 image which boots via Dracut, which will be booted on a Dell RX760 server with 2x Tesla T4 GPUs. While the driver installation completes successfully, and image building finishes, attempting to boot this image with the drivers ends up with the init process getting stuck, before eventually timing out on nm-wait-online-initrd.service after the period set in #1850 and halting.

(very) Roughly it seems that installing the drivers is causing some weirdness with the IP address handover, between the DHCP lease it receives during iPXE to the actual node IP it's supposed to receive during the kernel init process. Removing the drivers and rebuilding results in the image booting up just fine. We've been able to confirm this isn't a network speed issue with the image size as our test VM (albeit without any GPUs passed through) will boot the image in its entirety successfully.

I've also installed the drivers manually after booting the image without them preinstalled, and they function just fine ruling out issues with the hardware:
Image

To reproduce:

  1. Pull the standard warewulf Rocky 9 image
  2. Enter the image shell, run:
dnf -y install https://github.com/warewulf/warewulf/releases/download/v4.6.4/warewulf-dracut-4.6.4-1.el9.noarch.rpm
dnf -y install dnf-plugins-core epel-release kernel-headers
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
dnf -y module install nvidia-driver:latest-dkms
dnf -y install datacenter-gpu-manager
dnf clean all
for dir in /usr/src/kernels/*; do dkms autoinstall --kernelver $(basename $dir); done
dkms status (returns Nvidia driver 580 installed)
exit 0
  1. Run wwctl image exec rockylinux-9 -- /usr/bin/dracut --force --no-hostonly --add wwinit --regenerate-all
  2. Tell the node to boot from dracut: wwctl node set node-name --tagadd IPXEMenuEntry=dracut
  3. Rebuild the overlay with wwctl overlay build node-name and reboot the target node

Error message

I've attached a screenshot from the console output of our server's iDRAC, which shows the timeout.

Image

Information on your system

wwctl version:

wwctl version: 4.6.4-1
rpc version: apiPrefix:"rc1" apiVersion:"1" warewulfVersion:"4.6.4-1"

/etc/os-release:

NAME="AlmaLinux"
VERSION="9.6 (Sage Margay)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.6"
PLATFORM_ID="platform:el9"
PRETTY_NAME="AlmaLinux 9.6 (Sage Margay)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:9::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-9"
ALMALINUX_MANTISBT_PROJECT_VERSION="9.6"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.6"
SUPPORT_END=2032-06-01

General information

  • I have run wwctl version and reported the contents of /etc/os-release
  • I have searched the issues of this repo and believe this is not a duplicate
  • I have captured and reported relevant error messages and logs

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions