Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

sjenning
Copy link
Contributor

@sjenning sjenning commented Nov 13, 2020

xref https://bugzilla.redhat.com/show_bug.cgi?id=1770017

When init containers are GCed by the kubelet (or anything) the kubelet will re-execute them, even if the main containers are already running. This is a violation of the pod lifecycle state machine.

This PR changes the code that determines if an init container should be run to first check if the any of the main containers have status. If so, the pod is beyond the init container phase of the pod lifecycle and thus all init containers will have already run, even if the container runtime no longer has an exited container reflecting that the init container(s) ran.

@derekwaynecarr @rphillips @joelsmith

/sig node

None

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 13, 2020
@sjenning
Copy link
Contributor Author

/kind bug
/priority important-soon
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 13, 2020
@derekwaynecarr
Copy link
Member

/assign

@derekwaynecarr
Copy link
Member

/milestone v1.20

@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Nov 13, 2020
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to work backwards if this has an impact on pod restarts.

can we add a node e2e that reproduces?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have always allowed init containers to be rerun. A key point is we should not rerun them while main status containers are running. We must re run them on reboot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add this to only consider main containers if they are running

diff --git a/pkg/kubelet/kuberuntime/kuberuntime_container.go b/pkg/kubelet/kuberuntime/kuberuntime_container.go
index cb1ad6b306c..48bd0a9ddb2 100644
--- a/pkg/kubelet/kuberuntime/kuberuntime_container.go
+++ b/pkg/kubelet/kuberuntime/kuberuntime_container.go
@@ -754,6 +754,9 @@ func findNextInitContainerToRun(pod *v1.Pod, podStatus *kubecontainer.PodStatus)
                if status == nil {
                        continue
                }
+               if status.State != kubecontainer.ContainerStateRunning {
+                       continue
+               }
                return nil, nil, true
        }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the pod sandbox is lost, we guarantee that we rerun init containers today (iirc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr also, it would seem that we don't adhere to our own documented behavior already #88886

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a node e2e that reproduces?

Note to self, the test to verify this case would 1) start a pod with init container 2) while main container is running, call delete container on the init container 3) watch to make sure the init container does not run again.

Not currently sure how to do that or if node e2e can runtime level access.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning the new logic makes sense to me to check primary container status before checking status of any init contianer.

@derekwaynecarr
Copy link
Member

we should also double check if kubelet driven image gc is preferring to not gc containers whose pods are still active. we cant guarantee it wont happen or need to happen, but we should just sanity check that its not happening more often than needed.

@dims
Copy link
Member

dims commented Nov 14, 2020

/test pull-kubernetes-bazel-test

@sjenning
Copy link
Contributor Author

I pushed to change to limit the check to only main container that have status and are also Running.

Still need to check the sandbox (re)creation path. There is some code here that does it that I need to think though.

// Get the containers to start, excluding the ones that succeeded if RestartPolicy is OnFailure.
var containersToStart []int
for idx, c := range pod.Spec.Containers {
if pod.Spec.RestartPolicy == v1.RestartPolicyOnFailure && containerSucceeded(&c, podStatus) {
continue
}
containersToStart = append(containersToStart, idx)
}
// We should not create a sandbox for a Pod if initialization is done and there is no container to start.
if len(containersToStart) == 0 {
_, _, done := findNextInitContainerToRun(pod, podStatus)
if done {
changes.CreateSandbox = false
return changes
}
}
if len(pod.Spec.InitContainers) != 0 {
// Pod has init containers, return the first one.
changes.NextInitContainerToStart = &pod.Spec.InitContainers[0]
return changes
}
changes.ContainersToStart = containersToStart
return changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be very much off here.

This logic was added recently (#92614):

// We should not create a sandbox for a Pod if initialization is done and there is no container to start.
if len(containersToStart) == 0 {
_, _, done := findNextInitContainerToRun(pod, podStatus)
if done {
changes.CreateSandbox = false
return changes
}
}

Looking at when createPodSandbox can be true, I wonder if one of the main containers may have a running state (

klog.V(2).Infof("Multiple sandboxes are ready for Pod %q. Need to reconcile them", format.Pod(pod))
) so this PR will break the logic as sandbox will not be created.

Comment on lines 747 to 750
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some text correction advice:

Suggested change
// If any of the main containers have status, then all init containers must
// after executed at some point in the past. However, they could be removed
// from the container runtime now, and if we proceed, it would appear as if they
// never ran and will re-execute improperly.
// If any of the main containers have status, then all init containers must
// have been executed at some point in the past. However, they could have been removed
// from the container runtime by now by GC, and if we proceed, it would appear as if they
// never ran and will re-execute improperly.

@sayanchowdhury
Copy link
Member

Hi 👋🏽 I'm from the Bug Triage team. We've crossed the Code Freeze for 1.20 release on 12th November. As this PR is tagged with 1.20, I'm sending a final reminder to either move the milestone to 1.21 or clear the milestone.

@joelsmith
Copy link
Contributor

/retest

@jeremyrickard
Copy link
Contributor

/milestone v1.21

We've passed the Test Freeze and this issue, while good to fix, isn't release blocking. It can be cherry picked back into 1.20.z going forward, so I've bumped it to the next release milestone.

Thanks everyone 👋

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.20, v1.21 Nov 24, 2020
@rphillips
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2020
@rphillips
Copy link
Member

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2020
@derekwaynecarr
Copy link
Member

/retest
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 3, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

for i := range pod.Spec.Containers {
container := &pod.Spec.Containers[i]
status := podStatus.FindContainerStatusByName(container.Name)
if status != nil && status.State == kubecontainer.ContainerStateRunning {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev I modified the check here for both container status and that a container is Running. Does this address your concern?

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning @derekwaynecarr is this the PR you mentioned today on the SIG-node meeting? I was trying to guess, to have a look too :)

// never ran and will re-execute improperly.
for i := range pod.Spec.Containers {
container := &pod.Spec.Containers[i]
status := podStatus.FindContainerStatusByName(container.Name)
Copy link
Member

@rata rata Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning If instead of FindContainerStatusByName() we use podutil.GetContainerStatus() (from k8s.io/kubernetes/pkg/api/v1/pod) we get this type that has more information instead: https://pkg.go.dev/k8s.io/api/core/v1#ContainerStatus. Seems confusing but is not really the same, we get it from pod.Status.ContainerStatuses or pod.Status.InitContainerStatuses.

So, if this loop instead of pod.Spec.Containers we iterate pod.Spec.InitContainers can't we get the status that it was already run? Or is that lost for some reason too in the bug you are chasing @sjenning?

IIUC this might be the case (untested, though) and it seems more robust to just see "have we run this? Then run". Otherwise if the gc happens while initContainers didn't finish, we might run some initContainers twice with this approach AFAIK. Also, this check (if it works as I hope :D) should work fine with the concern @derekwaynecarr had on pod restart reasons

Am I missing something? Probably I am :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is confusing.

In this code path, we are acting on the authoritative container status from the runtime i.e. kubecontainer.Status, not what the kubelet has previously reported to the apiserver i.e. v1.ContainerStatus, which is what GetContainerStatus() would give you.

Additionally, there are cases in which we do want the initContainers to re-run, so just because they have run in the past don't mean they shouldn't run again. This is what Derek was saying here #96572 (comment)

That is why I expanded the main container check to not only check for status but also check that the main container is Running, since, in all situations where the init containers need to be run again, the main containers are not running.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I didn't know we want to re-run initContainers in such a case, sorry! Then this makes total sense, that is what I was missing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.