Change Request: crio should not block pod termination when the runtime class of the pod no longer exists

Hi CRI-O Team,

We deploy the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) as a daemonset which overrides the crio config toml and adds a new runtime handler(this is defined as the `nvidia` runtime). Once the nvidia-container-toolkit is deployed, we spin up other components that leverage the newly declared runtime classes.

Where our problems arise is during the tearing down/undeployment of these components. We have logic in our `nvidia-container-toolkit` where all of previously applied crio config overrides (including runtime handler declarations) are reverted as part of its termination routine. We have found that this introduces a bad race condition where the other components that leverage the `nvidia` runtime are stuck indefinitely in the Terminating state due to `KillPodSandboxErrors`. These `KillPodSandboxErrors` happen as the `nvidia` runtime that is referenced no longer exists. 

I've attached the screenshot of this error for your reference. Please let me know you if need more details or if anything in this issue description is unclear. Thank you!


<img width="1591" height="526" alt="Image" src="https://github.com/user-attachments/assets/06060535-3cb2-4a23-a4a8-5de5032be0c4" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change Request: crio should not block pod termination when the runtime class of the pod no longer exists #9521

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Change Request: crio should not block pod termination when the runtime class of the pod no longer exists #9521

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions