Deflake tests that need to grab metrics from controller-manager or scheduler #101960

knight42 · 2021-05-13T04:52:45Z

What type of PR is this?

/kind cleanup
/kind flake

What this PR does / why we need it:

According to the logs in this test https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/101960/pull-kubernetes-e2e-gce-ubuntu-containerd/1392808375145730048/:

I0513 12:14:41.521] I0513 12:14:37.277313   84013 metrics_proxy.go:150] [DEBUG] metrics-proxy nginx config: 
I0513 12:14:41.521] server {
I0513 12:14:41.521] 	listen 10257;
I0513 12:14:41.521] 	server_name _;
I0513 12:14:41.522] 	proxy_set_header Authorization "Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InZDZEkxeTZLbjNKUGswNk5QeWdwa3ROT0p3M244NnFfRjBza3lpQWdxVncifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJtZXRyaWNzLXByb3h5LXRva2VuLWd0cDZoIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6Im1ldHJpY3MtcHJveHkiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiIyZmE0NDA5Yy0zMGE2LTRlMWQtODJiMS1jMzhlMTFiZjM2MjUiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZS1zeXN0ZW06bWV0cmljcy1wcm94eSJ9.D28Lvn8lmqHH21r_vfWJMFSp0uRXDxhX0m-6mNA7RdX-2S202kL1GqU7RUIV7F3uqPJB_8de6yzLqDu3kaHmpNB1FN68LQcgTCsV5A-V8axwXWCwlaCzb-40ig0UvasyaxOx4wg2TMWK3HxnViSEDD3B2t0KOefbHugeexImbYdsmUJkp_6Nctj2DSK3EH67NwmONYR61Xphnyh9a_CCP3rI6qvmgGD3pbtCaGuiupoBcHW1QniwkM4SlvOoov_g5Sx8l8vLWfaFbLhjC71tRGmH2DEyoXjjvJxe8j4An7bHk5ByGyZAJ3w1kQJlRUU7N9rHUM3kag_Z8tfS4QBEjw";
I0513 12:14:41.522] 	proxy_ssl_verify off;
I0513 12:14:41.522] 	location /metrics {
I0513 12:14:41.523] 		proxy_pass https://10.40.0.2:10257;
I0513 12:14:41.523] 	}
I0513 12:14:41.523] }
I0513 12:14:41.523] 
I0513 12:14:41.523] I0513 12:14:41.473977   84013 metrics_proxy.go:198] Successfully setup metrics-proxy
...

I0513 12:19:27.432] [It] should grab all metrics from a Scheduler.
...

I0513 12:19:27.433] 
I0513 12:19:27.433] May 13 12:19:19.525: FAIL: Unexpected error:
I0513 12:19:27.433]     <*errors.StatusError | 0xc001a726e0>: {
I0513 12:19:27.434]         ErrStatus: {
I0513 12:19:27.434]             TypeMeta: {Kind: "", APIVersion: ""},
I0513 12:19:27.434]             ListMeta: {
I0513 12:19:27.434]                 SelfLink: "",
I0513 12:19:27.434]                 ResourceVersion: "",
I0513 12:19:27.434]                 Continue: "",
I0513 12:19:27.434]                 RemainingItemCount: nil,
I0513 12:19:27.434]             },
I0513 12:19:27.434]             Status: "Failure",
I0513 12:19:27.435]             Message: "the server is currently unable to handle the request (get pods metrics-proxy:10259)",
I0513 12:19:27.435]             Reason: "ServiceUnavailable",
I0513 12:19:27.435]             Details: {
I0513 12:19:27.435]                 Name: "metrics-proxy:10259",
I0513 12:19:27.435]                 Group: "",
I0513 12:19:27.435]                 Kind: "pods",
I0513 12:19:27.435]                 UID: "",
I0513 12:19:27.435]                 Causes: [
I0513 12:19:27.436]                     {
I0513 12:19:27.436]                         Type: "UnexpectedServerResponse",
I0513 12:19:27.436]                         Message: "unknown",
I0513 12:19:27.436]                         Field: "",
I0513 12:19:27.436]                     },
I0513 12:19:27.436]                 ],
I0513 12:19:27.436]                 RetryAfterSeconds: 0,
I0513 12:19:27.436]             },
I0513 12:19:27.436]             Code: 503,
I0513 12:19:27.437]         },
I0513 12:19:27.437]     }
I0513 12:19:27.437]     the server is currently unable to handle the request (get pods metrics-proxy:10259)

It seems that the scheduler pod had not shown up when we setup e2e test suite, so the nginx config for the forwarder pod was incomplete.

This PR make SetupMetricsProxy function wait for desired component pods to show up first.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2021-05-13T04:52:52Z

@knight42: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knight42 · 2021-05-13T04:53:15Z

/test pull-kubernetes-e2e-gce-csi-serial

knight42 · 2021-05-13T05:53:56Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd

knight42 · 2021-05-13T11:46:07Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd

knight42 · 2021-05-13T15:51:47Z

@liggitt @aojea Hi, I have updated this PR to make SetupMetricsProxy function wait for desired component pods to show up first, I think we could deflake the tests that need to grab metrics.

aojea · 2021-05-13T18:05:52Z

test/e2e/framework/metrics/metrics_proxy.go

this pod has a well known name, can we run it only when it is necessary?

if metricsProxyPod { return nil }

I'm afraid that we can start a precedent with this approach, new components can ask to add more "helpers" pods at the beginning of the e2e,despite most of them doesn't need it
i.e. a Conformance run will create this pod, despite it really doesn't need it IIUIC

This is pod is only required if the tests need to fetch metrics from the scheduler or controller-manager, but it is unclear to me how to know if the tests need to grab metrics here.

I was assuming this was only used by the metrivs grabber so, I maybe wrongly, assumed that we can do it as part of the initialisation of the MetricsGrabber

// NewMetricsGrabber returns new metrics which are initialized. func NewMetricsGrabber(c clientset.Interface, ec clientset.Interface, kubelets bool, scheduler bool, controllers bool, apiServer bool, clusterAutoscaler bool) (*Grabber, error) { SetupMetricsProxy(c) }

and inside SetupMetricsProxy(), only create the pod if it doesn't exist

yeah, I think we could do that.

aojea · 2021-05-13T18:08:51Z

test/e2e/framework/metrics/metrics_proxy.go

this doesn't seem to be used later

line 70 will log the components' name, but if you like, I could remove this field, since we could tell which component it is from the port number.

no no no, I miss that sorry, I thought that wasn't used at all

jingxu97 · 2021-05-13T21:39:52Z

currently in-tree storage tests for Windows are failing. Seems related to this? https://testgrid.kubernetes.io/google-windows#gce-windows-2019-containerd-master

The error is

metrics_grabber.go:103] Did not receive an external client interface. Grabbing metrics from ClusterAutoscaler is disabled.
May 13 17:56:18.470: FAIL: Error getting c-m metrics : error waiting for controller manager pod to expose metrics: timed out waiting for the condition; the server is currently unable to handle the request (get pods metrics-proxy:10257)

pohly · 2021-05-14T09:03:47Z

/cc

knight42 · 2021-05-27T06:36:17Z

/reopen

k8s-ci-robot · 2021-05-27T06:36:28Z

@knight42: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pacoxu · 2021-05-27T06:37:26Z

/priority critical-urgent
/kind failing-test
/assign @pohly

Would you update the description with Fixes #101894 to close the issue after this is merged.

pacoxu

/lgtm

pohly · 2021-05-27T07:45:07Z

/approve

Let's merge this as an interim improvement while discussing in #102050 what the long-term solution should look like.

k8s-ci-robot · 2021-05-27T07:45:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knight42, pacoxu, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/framework/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2021-05-27T07:46:09Z

Hmm, even with this PR there was a "MetricsGrabber should grab all metrics from a ControllerManager" failure in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/101960/pull-kubernetes-e2e-gce-ubuntu-containerd/1397803874319863808/

knight42 · 2021-05-27T07:57:22Z

@pohly According to the logs https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/101960/pull-kubernetes-e2e-gce-ubuntu-containerd/1397803874319863808/build-log.txt

I0527 07:04:45.055] I0527 07:04:30.446344   83656 metrics_proxy.go:70] Found components: [{kube-controller-manager-e2e-aec6e2e90c-a7d53-master 10257 10.40.0.2}]
I0527 07:04:45.055] I0527 07:04:35.612214   83656 metrics_proxy.go:69] Only 1 components found. Will retry.
I0527 07:04:45.055] I0527 07:04:35.612263   83656 metrics_proxy.go:70] Found components: [{kube-controller-manager-e2e-aec6e2e90c-a7d53-master 10257 10.40.0.2}]
I0527 07:04:45.055] I0527 07:04:40.612211   83656 metrics_proxy.go:80] Found components: [{kube-controller-manager-e2e-aec6e2e90c-a7d53-master 10257 10.40.0.2} {kube-scheduler-e2e-aec6e2e90c-a7d53-master 10259 }]
I0527 07:04:45.055] I0527 07:04:45.006129   83656 metrics_proxy.go:209] Successfully setup metrics-proxy

the scheduler pod showed up with an empty ip, so nginx failed to forward the request. It seems that we have to wait for the ips of both pods to be filled.

pacoxu · 2021-05-27T08:04:26Z

pod.Status.PodIP is not set correctly in the metrics-proxy.

pacoxu · 2021-05-27T08:05:52Z

/hold

aojea · 2021-05-27T08:15:30Z

the scheduler pod showed up with an empty ip, so nginx failed to forward the request. It seems that we have to wait for the ips of both pods to be filled.

IIRC the WaitForRunningAndReady helper will make it

pacoxu · 2021-05-27T10:07:43Z

/hold cancel
/lgtm
/retest

@pohly will update the fix in #102050. You may take the IP status as well(maybe check its running status as @aojea pointed.)

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 13, 2021

k8s-ci-robot requested review from oomichi and pohly May 13, 2021 04:53

knight42 mentioned this pull request May 13, 2021

e2e metrics grabber failing test #101953

Closed

knight42 force-pushed the fix/deflake-metrics-proxy branch from b9cfbaa to a1e1511 Compare May 13, 2021 15:44

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 13, 2021

knight42 changed the title ~~[WIP] investigating errors caused by metrics-proxy~~ Deflake tests that need to grab metrics from controller-manager or scheduler May 13, 2021

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2021

aojea reviewed May 13, 2021

View reviewed changes

k8s-ci-robot reopened this May 27, 2021

pacoxu reviewed May 27, 2021

View reviewed changes

k8s-ci-robot assigned pacoxu May 27, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2021

pacoxu approved these changes May 27, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2021

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 27, 2021

fix: skip pods with empty ip

781c65a

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2021

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 27, 2021

k8s-ci-robot merged commit c495744 into kubernetes:master May 27, 2021

k8s-ci-robot added this to the v1.22 milestone May 27, 2021

knight42 deleted the fix/deflake-metrics-proxy branch May 27, 2021 10:42

mauriciopoppe mentioned this pull request Jul 30, 2021

[sig-storage] tests are massively failing in upgrade tests #103822

Closed

pacoxu mentioned this pull request Apr 15, 2022

[sig-node] Summary API [NodeConformance] when querying /stats/summary ... networking info is nil on containerd #109082

Closed

Deflake tests that need to grab metrics from controller-manager or scheduler #101960

Deflake tests that need to grab metrics from controller-manager or scheduler #101960

Uh oh!

Conversation

knight42 commented May 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented May 13, 2021

Uh oh!

knight42 commented May 13, 2021

Uh oh!

knight42 commented May 13, 2021

Uh oh!

knight42 commented May 13, 2021

Uh oh!

knight42 commented May 13, 2021

Uh oh!

aojea May 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knight42 May 14, 2021

Choose a reason for hiding this comment

Uh oh!

aojea May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knight42 May 14, 2021

Choose a reason for hiding this comment

Uh oh!

aojea May 13, 2021

Choose a reason for hiding this comment

Uh oh!

knight42 May 14, 2021

Choose a reason for hiding this comment

Uh oh!

aojea May 14, 2021

Choose a reason for hiding this comment

Uh oh!

jingxu97 commented May 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pohly commented May 14, 2021

Uh oh!

knight42 commented May 27, 2021

Uh oh!

k8s-ci-robot commented May 27, 2021

Uh oh!

pacoxu commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pacoxu left a comment

Choose a reason for hiding this comment

Uh oh!

pohly commented May 27, 2021

Uh oh!

k8s-ci-robot commented May 27, 2021

Uh oh!

pohly commented May 27, 2021

Uh oh!

knight42 commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pacoxu commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pacoxu commented May 27, 2021

Uh oh!

aojea commented May 27, 2021

Uh oh!

pacoxu commented May 27, 2021

Uh oh!

Uh oh!

knight42 commented May 13, 2021 •

edited

Loading

aojea May 13, 2021 •

edited

Loading

aojea May 14, 2021 •

edited

Loading

jingxu97 commented May 13, 2021 •

edited

Loading

pacoxu commented May 27, 2021 •

edited

Loading

knight42 commented May 27, 2021 •

edited

Loading

pacoxu commented May 27, 2021 •

edited

Loading