Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CRI-O should be smarter about context timeouts #4221

@haircommander

Description

@haircommander

Description
On nodes under load, we run into issues where ContainerCreate and RunPodSandbox requests time out. The most common cases are i/o load, network creation taking a long time, SELinux relabelling of volumes taking a while

Right now, ContainerCreate and RunPodSandbox run to completion, but check at the end that we timed out. if so, we remove the resource we recently created, and try again. We do this because the kubelet works as follows:

Kubelet makes a request
CRI-O handles the request, reserving the resource name as asked by the kubelet
kubelet request times out, and it begins polling CRI-O, continuously trying to create the resource it first tries.
note, it does not bump the attempt number, as this would cause CRI-O to simultaneously try to create multiples of the same container, further adding load to the already overloaded node
Eventually, CRI-O learns the kubelet timed out, and cleans up after itself
note: we clean up here so the next request will pass. if we didn't, the kubelet would say "give me the same container" and CRI-O would say "I already created that container!", even though the kubelet's original request failed
Kubelet finally asks for the same resource again, and one day, assuming the node's load reduces, CRI-O succeeds in creating the container and returns it to the kubelet.

Notice, the kubelet is asking for the same resource the whole time (name, attempt, sandbox, etc), and yet, CRI-O has the possibility of creating it infinite number of times (if it keeps timing out on the request that actually is handling the creation of the resource), while returning errors while it's trying to create that resource ("name is reserved" errors)

we should do better

Instead of removing the resource upon finding out we timed out, CRI-O should put that newly created resource into a cache, report an error to the kubelet. When the kubelet eventually re-requests that resource (as it would have been doing upon the timeout), we should pull that resource from the cache, and return it, without creating it a second time.

There are a couple of gotchas we need to handle:
What if the kubelet bumps the attempt number?
If the kubelet asks to create attempt 0, and then the api-server tells it to create a new version, CRI-O should abandon attempt 0.
What if the kubelet gives up on that container?
CRI-O should abandon the old container, as it's no longer needed.
What if CRI-O is restarted before the container is correctly returned?
CRI-O should be prepared to clean up that container if the kubelet doesn't ask for it.

So here's the proposed solution:
At the end of a CreateContainer or RunPodSandbox request, before marking that newly creating resource as successfully created, CRI-O should check that the request did not time out. If it did, it should save the resource in a ResourceCache. There will be one ResourceCache for pods and one for containers. CRI-O should not report these resources as created (with a PodSandboxList or ContainerList request), as the kubelet doesn't think that resource has been created yet.

When the kubelet asks for the resource again, CRI-O should verify the request is the same (deep comparison against the Request object), and if so, mark the value of the ResourceCache as created, move it to its state, and remove it from the cache

If the resource is not asked for after a certain amount of time (a couple of minutes?), CRI-O should clean up the resource from the cache.

This issue is for discussion of possible other gotchas and failure modes. This will be a pretty substantial change of CRI-O's state machine, so we need to make sure there are no cases we confuse the kubelet.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions