Hi CRI-O Team,
We deploy the nvidia-container-toolkit as a daemonset which overrides the crio config toml and adds a new runtime handler(this is defined as the nvidia runtime). Once the nvidia-container-toolkit is deployed, we spin up other components that leverage the newly declared runtime classes.
Where our problems arise is during the tearing down/undeployment of these components. We have logic in our nvidia-container-toolkit where all of previously applied crio config overrides (including runtime handler declarations) are reverted as part of its termination routine. We have found that this introduces a bad race condition where the other components that leverage the nvidia runtime are stuck indefinitely in the Terminating state due to KillPodSandboxErrors. These KillPodSandboxErrors happen as the nvidia runtime that is referenced no longer exists.
I've attached the screenshot of this error for your reference. Please let me know you if need more details or if anything in this issue description is unclear. Thank you!
