-
Notifications
You must be signed in to change notification settings - Fork 41.4k
kubelet: do not rerun init containers if any main containers have status #96572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/kind bug |
/assign |
/milestone v1.20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need to work backwards if this has an impact on pod restarts.
can we add a node e2e that reproduces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to not regress the following:
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#pod-restart-reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have always allowed init containers to be rerun. A key point is we should not rerun them while main status containers are running. We must re run them on reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add this to only consider main containers if they are running
diff --git a/pkg/kubelet/kuberuntime/kuberuntime_container.go b/pkg/kubelet/kuberuntime/kuberuntime_container.go
index cb1ad6b306c..48bd0a9ddb2 100644
--- a/pkg/kubelet/kuberuntime/kuberuntime_container.go
+++ b/pkg/kubelet/kuberuntime/kuberuntime_container.go
@@ -754,6 +754,9 @@ func findNextInitContainerToRun(pod *v1.Pod, podStatus *kubecontainer.PodStatus)
if status == nil {
continue
}
+ if status.State != kubecontainer.ContainerStateRunning {
+ continue
+ }
return nil, nil, true
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the pod sandbox is lost, we guarantee that we rerun init containers today (iirc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekwaynecarr also, it would seem that we don't adhere to our own documented behavior already #88886
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a node e2e that reproduces?
Note to self, the test to verify this case would 1) start a pod with init container 2) while main container is running, call delete container on the init container 3) watch to make sure the init container does not run again.
Not currently sure how to do that or if node e2e can runtime level access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjenning the new logic makes sense to me to check primary container status before checking status of any init contianer.
we should also double check if kubelet driven image gc is preferring to not gc containers whose pods are still active. we cant guarantee it wont happen or need to happen, but we should just sanity check that its not happening more often than needed. |
/test pull-kubernetes-bazel-test |
b8a5c7a
to
dba23c0
Compare
I pushed to change to limit the check to only main container that have status and are also Still need to check the sandbox (re)creation path. There is some code here that does it that I need to think though. kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go Lines 523 to 546 in 3b2746c
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be very much off here.
This logic was added recently (#92614):
kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go
Lines 531 to 538 in 6715318
// We should not create a sandbox for a Pod if initialization is done and there is no container to start. | |
if len(containersToStart) == 0 { | |
_, _, done := findNextInitContainerToRun(pod, podStatus) | |
if done { | |
changes.CreateSandbox = false | |
return changes | |
} | |
} |
Looking at when createPodSandbox
can be true, I wonder if one of the main containers may have a running state (
klog.V(2).Infof("Multiple sandboxes are ready for Pod %q. Need to reconcile them", format.Pod(pod)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some text correction advice:
// If any of the main containers have status, then all init containers must | |
// after executed at some point in the past. However, they could be removed | |
// from the container runtime now, and if we proceed, it would appear as if they | |
// never ran and will re-execute improperly. | |
// If any of the main containers have status, then all init containers must | |
// have been executed at some point in the past. However, they could have been removed | |
// from the container runtime by now by GC, and if we proceed, it would appear as if they | |
// never ran and will re-execute improperly. |
Hi 👋🏽 I'm from the Bug Triage team. We've crossed the Code Freeze for 1.20 release on 12th November. As this PR is tagged with 1.20, I'm sending a final reminder to either move the milestone to 1.21 or clear the milestone. |
/retest |
/milestone v1.21 We've passed the Test Freeze and this issue, while good to fix, isn't release blocking. It can be cherry picked back into 1.20.z going forward, so I've bumped it to the next release milestone. Thanks everyone 👋 |
/lgtm |
/lgtm cancel |
dba23c0
to
d0f859a
Compare
d0f859a
to
c8d02f7
Compare
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, sjenning The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
for i := range pod.Spec.Containers { | ||
container := &pod.Spec.Containers[i] | ||
status := podStatus.FindContainerStatusByName(container.Name) | ||
if status != nil && status.State == kubecontainer.ContainerStateRunning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev I modified the check here for both container status and that a container is Running
. Does this address your concern?
/retest Review the full test history for this PR. Silence the bot with an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjenning @derekwaynecarr is this the PR you mentioned today on the SIG-node meeting? I was trying to guess, to have a look too :)
// never ran and will re-execute improperly. | ||
for i := range pod.Spec.Containers { | ||
container := &pod.Spec.Containers[i] | ||
status := podStatus.FindContainerStatusByName(container.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjenning If instead of FindContainerStatusByName()
we use podutil.GetContainerStatus()
(from k8s.io/kubernetes/pkg/api/v1/pod) we get this type that has more information instead: https://pkg.go.dev/k8s.io/api/core/v1#ContainerStatus. Seems confusing but is not really the same, we get it from pod.Status.ContainerStatuses or pod.Status.InitContainerStatuses.
So, if this loop instead of pod.Spec.Containers
we iterate pod.Spec.InitContainers
can't we get the status that it was already run? Or is that lost for some reason too in the bug you are chasing @sjenning?
IIUC this might be the case (untested, though) and it seems more robust to just see "have we run this? Then run". Otherwise if the gc happens while initContainers didn't finish, we might run some initContainers twice with this approach AFAIK. Also, this check (if it works as I hope :D) should work fine with the concern @derekwaynecarr had on pod restart reasons
Am I missing something? Probably I am :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is confusing.
In this code path, we are acting on the authoritative container status from the runtime i.e. kubecontainer.Status
, not what the kubelet has previously reported to the apiserver i.e. v1.ContainerStatus
, which is what GetContainerStatus()
would give you.
Additionally, there are cases in which we do want the initContainers to re-run, so just because they have run in the past don't mean they shouldn't run again. This is what Derek was saying here #96572 (comment)
That is why I expanded the main container check to not only check for status but also check that the main container is Running, since, in all situations where the init containers need to be run again, the main containers are not running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I didn't know we want to re-run initContainers in such a case, sorry! Then this makes total sense, that is what I was missing :)
xref https://bugzilla.redhat.com/show_bug.cgi?id=1770017
When init containers are GCed by the kubelet (or anything) the kubelet will re-execute them, even if the main containers are already running. This is a violation of the pod lifecycle state machine.
This PR changes the code that determines if an init container should be run to first check if the any of the main containers have status. If so, the pod is beyond the init container phase of the pod lifecycle and thus all init containers will have already run, even if the container runtime no longer has an exited container reflecting that the init container(s) ran.
@derekwaynecarr @rphillips @joelsmith
/sig node