Endpoints with TolerateUnready annotation, should list Pods in state terminating#37093
Conversation
|
Jenkins kops AWS e2e failed for commit c44e81f3a5e3bd7da9358790efeea8bc3d69168b. Full PR test history. The magic incantation to run this job again is |
|
Jenkins GCI GCE e2e failed for commit c44e81f3a5e3bd7da9358790efeea8bc3d69168b. Full PR test history. The magic incantation to run this job again is |
c44e81f to
d282f47
Compare
|
Jenkins GCE e2e failed for commit c44e81f3a5e3bd7da9358790efeea8bc3d69168b. Full PR test history. The magic incantation to run this job again is |
|
Jenkins GCE etcd3 e2e failed for commit c44e81f3a5e3bd7da9358790efeea8bc3d69168b. Full PR test history. The magic incantation to run this job again is |
|
I'm fine doing this, the annotation effectively means: keep all service entries and dns records for the pod around for as long as you possibly can (ignore readiness, ignore deletion grace etc). I don't think it can make 1.5 though, since we're well past the feature freeze date and this isn't a stabilization fix. Maybe we can just fold this behavior into #25283? We need to graduate the |
|
@bprashanth: tbh I just expected that behaviour when I wrote my Anyhow I won't be much around for the next weeks, @mattbates can you have a look at this PR |
|
@bprashanth just following up here. Re @simonswine's comment, can we progress this PR as a fix and merge, cherry-picking into 1.3 and 1.4 too ideally? |
bprashanth
left a comment
There was a problem hiding this comment.
I'm fine with the pr, it's low risk, I don't know if it's going to make 1.5 because of the timing. It's an alpha feature that we are with high probability going to remodel before beta (#25283), so I don't think we should be cherrypicking it into older releases.
The original annotation was to tolerate "unreadiness". This pr bends the definition of "unreadiness" from "ignore failing readiness probes" to "ignore failing readiness probes AND deletion timestamps". A slightly more correct way to do this would be to "ignore readiness AND deletion timestamps on not ready pods", but given that the feature is going to change soon, I'm not sure the distinction matters.
| continue | ||
| } | ||
| if pod.DeletionTimestamp != nil { | ||
| if !tolerateUnreadyEndpoints && pod.DeletionTimestamp != nil { |
There was a problem hiding this comment.
please augment the comment above the annotation definition with:
// Endpoints of Services bearing this annotation retain their DNS
// records and continue receiving traffic for the Service from the moment
// the kubelet starts all containers in the pod and marks it "Running", till the
// kubelet stops all containers and deletes the pod from the apiserver.
d282f47 to
e68f748
Compare
|
Please add a test and then this LGTM. You can add it to test/e2e/services.go under |
e68f748 to
a92a0d1
Compare
…n state terminating * Otherwise it prevents long running task in a preStop hook to succeed, that require DNS resolution
a92a0d1 to
b44de1e
Compare
|
@smarterclayton thanks for your input on how to prevent test flakes. I ve implemented it as you suggested by modifying the annotation of the service and waiting for the test |
|
LGTM thanks |
|
@k8s-bot test this |
|
Automatic merge from submit-queue (batch tested with PRs 39092, 39126, 37380, 37093, 39237) |
|
After reflecting on this a bit more, I think it should be possible for a consumer to request that terminating pods continue to be in the load balancer rotation independent of the annotation. Many applications that can control their shutdown (like one with a very long graceful shutdown period) to take traffic. I think that's orthogonal to the "ready immediately" setting. It should be possible in shutdown to control when traffic is diverted away, and it's not automatically at the very end of termination, not st the very beginning. I'll spawn an issue, but it may be that tolerate unready becomes a policy (EndpointInclusionPolicy) or a set of orthogonal flags. |
|
@smarterclayton: I think this sounds like that the unready handling should have more states then just true and false before being promoted to a spec field in the Service object. Is there an issue to track this effort? And another thing, do you think this could be cherry-picked into 1.5 or is this not seen as a bugfix as such? If so I think I can't initiate this as I am not able to add and remove labels |
|
I did not open an issue yet. It's reasonable to backport, tagging. |
|
Removing label |
|
Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
| // create a headless Service just for the StatefulSet, and clients shouldn't | ||
| // be using this Service for anything so unready endpoints don't matter. | ||
| // Endpoints of these Services retain their DNS records and continue | ||
| // receiving traffic for the Service from the moment the kubelet starts all |
There was a problem hiding this comment.
To make self hosted etcd reliable (an important part of self hosted k8s effort), we want to have the DNS resolvable since pod initialization phase (init container). The current implementation does not prevent us from doing that. Basically, I hope after the pod gets the IP and before the Pod terminates, the DNS can be resolvable.
What this PR does / why we need it:
We are using preStop lifecycle hooks to gracefully remove a node from a cluster. This hook is potentially long running and after the preStop hook is fired, the DNS resolution of the soon to be stopped Pod is failing, which causes a failure there.
Special notes for your reviewer:
Would be great to backport that to 1.4, 1.3
Release note:
@bprashanth
This change is