Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Change Request: crio should not block pod termination when the runtime class of the pod no longer exists #9521

@tariq1890

Description

@tariq1890

Hi CRI-O Team,

We deploy the nvidia-container-toolkit as a daemonset which overrides the crio config toml and adds a new runtime handler(this is defined as the nvidia runtime). Once the nvidia-container-toolkit is deployed, we spin up other components that leverage the newly declared runtime classes.

Where our problems arise is during the tearing down/undeployment of these components. We have logic in our nvidia-container-toolkit where all of previously applied crio config overrides (including runtime handler declarations) are reverted as part of its termination routine. We have found that this introduces a bad race condition where the other components that leverage the nvidia runtime are stuck indefinitely in the Terminating state due to KillPodSandboxErrors. These KillPodSandboxErrors happen as the nvidia runtime that is referenced no longer exists.

I've attached the screenshot of this error for your reference. Please let me know you if need more details or if anything in this issue description is unclear. Thank you!

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions