Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

knight42
Copy link
Member

@knight42 knight42 commented May 13, 2021

What type of PR is this?

/kind cleanup
/kind flake

What this PR does / why we need it:

According to the logs in this test https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/101960/pull-kubernetes-e2e-gce-ubuntu-containerd/1392808375145730048/:

I0513 12:14:41.521] I0513 12:14:37.277313   84013 metrics_proxy.go:150] [DEBUG] metrics-proxy nginx config: 
I0513 12:14:41.521] server {
I0513 12:14:41.521] 	listen 10257;
I0513 12:14:41.521] 	server_name _;
I0513 12:14:41.522] 	proxy_set_header Authorization "Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InZDZEkxeTZLbjNKUGswNk5QeWdwa3ROT0p3M244NnFfRjBza3lpQWdxVncifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJtZXRyaWNzLXByb3h5LXRva2VuLWd0cDZoIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6Im1ldHJpY3MtcHJveHkiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiIyZmE0NDA5Yy0zMGE2LTRlMWQtODJiMS1jMzhlMTFiZjM2MjUiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZS1zeXN0ZW06bWV0cmljcy1wcm94eSJ9.D28Lvn8lmqHH21r_vfWJMFSp0uRXDxhX0m-6mNA7RdX-2S202kL1GqU7RUIV7F3uqPJB_8de6yzLqDu3kaHmpNB1FN68LQcgTCsV5A-V8axwXWCwlaCzb-40ig0UvasyaxOx4wg2TMWK3HxnViSEDD3B2t0KOefbHugeexImbYdsmUJkp_6Nctj2DSK3EH67NwmONYR61Xphnyh9a_CCP3rI6qvmgGD3pbtCaGuiupoBcHW1QniwkM4SlvOoov_g5Sx8l8vLWfaFbLhjC71tRGmH2DEyoXjjvJxe8j4An7bHk5ByGyZAJ3w1kQJlRUU7N9rHUM3kag_Z8tfS4QBEjw";
I0513 12:14:41.522] 	proxy_ssl_verify off;
I0513 12:14:41.522] 	location /metrics {
I0513 12:14:41.523] 		proxy_pass https://10.40.0.2:10257;
I0513 12:14:41.523] 	}
I0513 12:14:41.523] }
I0513 12:14:41.523] 
I0513 12:14:41.523] I0513 12:14:41.473977   84013 metrics_proxy.go:198] Successfully setup metrics-proxy
...

I0513 12:19:27.432] [It] should grab all metrics from a Scheduler.
...

I0513 12:19:27.433] 
I0513 12:19:27.433] May 13 12:19:19.525: FAIL: Unexpected error:
I0513 12:19:27.433]     <*errors.StatusError | 0xc001a726e0>: {
I0513 12:19:27.434]         ErrStatus: {
I0513 12:19:27.434]             TypeMeta: {Kind: "", APIVersion: ""},
I0513 12:19:27.434]             ListMeta: {
I0513 12:19:27.434]                 SelfLink: "",
I0513 12:19:27.434]                 ResourceVersion: "",
I0513 12:19:27.434]                 Continue: "",
I0513 12:19:27.434]                 RemainingItemCount: nil,
I0513 12:19:27.434]             },
I0513 12:19:27.434]             Status: "Failure",
I0513 12:19:27.435]             Message: "the server is currently unable to handle the request (get pods metrics-proxy:10259)",
I0513 12:19:27.435]             Reason: "ServiceUnavailable",
I0513 12:19:27.435]             Details: {
I0513 12:19:27.435]                 Name: "metrics-proxy:10259",
I0513 12:19:27.435]                 Group: "",
I0513 12:19:27.435]                 Kind: "pods",
I0513 12:19:27.435]                 UID: "",
I0513 12:19:27.435]                 Causes: [
I0513 12:19:27.436]                     {
I0513 12:19:27.436]                         Type: "UnexpectedServerResponse",
I0513 12:19:27.436]                         Message: "unknown",
I0513 12:19:27.436]                         Field: "",
I0513 12:19:27.436]                     },
I0513 12:19:27.436]                 ],
I0513 12:19:27.436]                 RetryAfterSeconds: 0,
I0513 12:19:27.436]             },
I0513 12:19:27.436]             Code: 503,
I0513 12:19:27.437]         },
I0513 12:19:27.437]     }
I0513 12:19:27.437]     the server is currently unable to handle the request (get pods metrics-proxy:10259)

It seems that the scheduler pod had not shown up when we setup e2e test suite, so the nginx config for the forwarder pod was incomplete.

This PR make SetupMetricsProxy function wait for desired component pods to show up first.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flaky test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 13, 2021
@k8s-ci-robot
Copy link
Contributor

@knight42: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 13, 2021
@knight42
Copy link
Member Author

/test pull-kubernetes-e2e-gce-csi-serial

@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 13, 2021
@k8s-ci-robot k8s-ci-robot requested review from oomichi and pohly May 13, 2021 04:53
@knight42
Copy link
Member Author

/test pull-kubernetes-e2e-gce-ubuntu-containerd

1 similar comment
@knight42
Copy link
Member Author

/test pull-kubernetes-e2e-gce-ubuntu-containerd

@knight42 knight42 force-pushed the fix/deflake-metrics-proxy branch from b9cfbaa to a1e1511 Compare May 13, 2021 15:44
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 13, 2021
@knight42 knight42 changed the title [WIP] investigating errors caused by metrics-proxy Deflake tests that need to grab metrics from controller-manager or scheduler May 13, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2021
@knight42
Copy link
Member Author

@liggitt @aojea Hi, I have updated this PR to make SetupMetricsProxy function wait for desired component pods to show up first, I think we could deflake the tests that need to grab metrics.

Copy link
Member

@aojea aojea May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pod has a well known name, can we run it only when it is necessary?

if metricsProxyPod {
    return nil
}

I'm afraid that we can start a precedent with this approach, new components can ask to add more "helpers" pods at the beginning of the e2e,despite most of them doesn't need it
i.e. a Conformance run will create this pod, despite it really doesn't need it IIUIC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pod is only required if the tests need to fetch metrics from the scheduler or controller-manager, but it is unclear to me how to know if the tests need to grab metrics here.

Copy link
Member

@aojea aojea May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming this was only used by the metrivs grabber so, I maybe wrongly, assumed that we can do it as part of the initialisation of the MetricsGrabber

// NewMetricsGrabber returns new metrics which are initialized.
func NewMetricsGrabber(c clientset.Interface, ec clientset.Interface, kubelets bool, scheduler bool, controllers bool, apiServer bool, clusterAutoscaler bool) (*Grabber, error) {

   SetupMetricsProxy(c)

}

and inside SetupMetricsProxy(), only create the pod if it doesn't exist

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think we could do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to be used later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 70 will log the components' name, but if you like, I could remove this field, since we could tell which component it is from the port number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no no no, I miss that sorry, I thought that wasn't used at all

@jingxu97
Copy link
Contributor

jingxu97 commented May 13, 2021

currently in-tree storage tests for Windows are failing. Seems related to this? https://testgrid.kubernetes.io/google-windows#gce-windows-2019-containerd-master

The error is

metrics_grabber.go:103] Did not receive an external client interface. Grabbing metrics from ClusterAutoscaler is disabled.
May 13 17:56:18.470: FAIL: Error getting c-m metrics : error waiting for controller manager pod to expose metrics: timed out waiting for the condition; the server is currently unable to handle the request (get pods metrics-proxy:10257)

@pohly
Copy link
Contributor

pohly commented May 14, 2021

/cc

@knight42
Copy link
Member Author

/reopen

@k8s-ci-robot k8s-ci-robot reopened this May 27, 2021
@k8s-ci-robot
Copy link
Contributor

@knight42: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pacoxu
Copy link
Member

pacoxu commented May 27, 2021

/priority critical-urgent
/kind failing-test
/assign @pohly

Would you update the description with Fixes #101894 to close the issue after this is merged.

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 27, 2021
Copy link
Member

@pacoxu pacoxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2021
@pohly
Copy link
Contributor

pohly commented May 27, 2021

/approve

Let's merge this as an interim improvement while discussing in #102050 what the long-term solution should look like.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knight42, pacoxu, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2021
@pohly
Copy link
Contributor

pohly commented May 27, 2021

Hmm, even with this PR there was a "MetricsGrabber should grab all metrics from a ControllerManager" failure in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/101960/pull-kubernetes-e2e-gce-ubuntu-containerd/1397803874319863808/

@knight42
Copy link
Member Author

knight42 commented May 27, 2021

@pohly According to the logs https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/101960/pull-kubernetes-e2e-gce-ubuntu-containerd/1397803874319863808/build-log.txt

I0527 07:04:45.055] I0527 07:04:30.446344   83656 metrics_proxy.go:70] Found components: [{kube-controller-manager-e2e-aec6e2e90c-a7d53-master 10257 10.40.0.2}]
I0527 07:04:45.055] I0527 07:04:35.612214   83656 metrics_proxy.go:69] Only 1 components found. Will retry.
I0527 07:04:45.055] I0527 07:04:35.612263   83656 metrics_proxy.go:70] Found components: [{kube-controller-manager-e2e-aec6e2e90c-a7d53-master 10257 10.40.0.2}]
I0527 07:04:45.055] I0527 07:04:40.612211   83656 metrics_proxy.go:80] Found components: [{kube-controller-manager-e2e-aec6e2e90c-a7d53-master 10257 10.40.0.2} {kube-scheduler-e2e-aec6e2e90c-a7d53-master 10259 }]
I0527 07:04:45.055] I0527 07:04:45.006129   83656 metrics_proxy.go:209] Successfully setup metrics-proxy

the scheduler pod showed up with an empty ip, so nginx failed to forward the request. It seems that we have to wait for the ips of both pods to be filled.

@pacoxu
Copy link
Member

pacoxu commented May 27, 2021

pod.Status.PodIP is not set correctly in the metrics-proxy.

@pacoxu
Copy link
Member

pacoxu commented May 27, 2021

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 27, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2021
@aojea
Copy link
Member

aojea commented May 27, 2021

the scheduler pod showed up with an empty ip, so nginx failed to forward the request. It seems that we have to wait for the ips of both pods to be filled.

IIRC the WaitForRunningAndReady helper will make it

@pacoxu
Copy link
Member

pacoxu commented May 27, 2021

/hold cancel
/lgtm
/retest

@pohly will update the fix in #102050. You may take the IP status as well(maybe check its running status as @aojea pointed.)

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 27, 2021
@k8s-ci-robot k8s-ci-robot merged commit c495744 into kubernetes:master May 27, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone May 27, 2021
@knight42 knight42 deleted the fix/deflake-metrics-proxy branch May 27, 2021 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants