Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

p0lyn0mial
Copy link
Contributor

@p0lyn0mial p0lyn0mial commented Sep 23, 2020

What type of PR is this?
/kind cleanup

What this PR does / why we need it: it stops persisting the following error in the server's log:

E0921 07:09:22.814963       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
goroutine 1088 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x2039b80, 0xc000552230)
	k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc00257dc98, 0x1, 0x1)
	k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x89
panic(0x2039b80, 0xc000552230)
	runtime/panic.go:969 +0x175
k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc0026077c0, 0xc002444460)
	k8s.io/[email protected]/pkg/server/filters/timeout.go:257 +0x1bc
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc0002df9a0, 0x26b7640, 0xc0005a7c00, 0xc00243ff00)
	k8s.io/[email protected]/pkg/server/filters/timeout.go:141 +0x2f3
k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x26b7640, 0xc0005a7c00, 0xc00243fe00)
	k8s.io/[email protected]/pkg/server/filters/waitgroup.go:59 +0x137
net/http.HandlerFunc.ServeHTTP(0xc00127d650, 0x26b7640, 0xc0005a7c00, 0xc00243fe00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x26b7640, 0xc0005a7c00, 0xc00243fd00)
	k8s.io/[email protected]/pkg/endpoints/filters/requestinfo.go:39 +0x269
net/http.HandlerFunc.ServeHTTP(0xc00127d680, 0x26b7640, 0xc0005a7c00, 0xc00243fd00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/endpoints/filters.WithAuditAnnotations.func1(0x26b7640, 0xc0005a7c00, 0xc00243fc00)
	k8s.io/[email protected]/pkg/endpoints/filters/audit_annotations.go:37 +0x142
net/http.HandlerFunc.ServeHTTP(0xc0002df9c0, 0x26b7640, 0xc0005a7c00, 0xc00243fc00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/endpoints/filters.WithWarningRecorder.func1(0x26b7640, 0xc0005a7c00, 0xc00243fb00)
	k8s.io/[email protected]/pkg/endpoints/filters/warning.go:35 +0x1a7
net/http.HandlerFunc.ServeHTTP(0xc0002df9e0, 0x26b7640, 0xc0005a7c00, 0xc00243fb00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x26b7640, 0xc0005a7c00, 0xc00243fb00)
	k8s.io/[email protected]/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
net/http.HandlerFunc.ServeHTTP(0xc0002dfa00, 0x26b7640, 0xc0005a7c00, 0xc00243fb00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x26aa080, 0xc002394110, 0xc00243fa00)
	k8s.io/[email protected]/pkg/server/httplog/httplog.go:91 +0x2f2
net/http.HandlerFunc.ServeHTTP(0xc0002dfa20, 0x26aa080, 0xc002394110, 0xc00243fa00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x26aa080, 0xc002394110, 0xc00243fa00)
	k8s.io/[email protected]/pkg/server/filters/wrap.go:51 +0xe6
net/http.HandlerFunc.ServeHTTP(0xc0002dfa40, 0x26aa080, 0xc002394110, 0xc00243fa00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc00127d710, 0x26aa080, 0xc002394110, 0xc00243fa00)
	k8s.io/[email protected]/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc002255c00, 0x26aa080, 0xc002394110, 0xc00243fa00)
	net/http/server.go:2843 +0xa3
net/http.initALPNRequest.ServeHTTP(0x26bd780, 0xc001dc88d0, 0xc00163b500, 0xc002255c00, 0x26aa080, 0xc002394110, 0xc00243fa00)
	net/http/server.go:3415 +0x8d
golang.org/x/net/http2.(*serverConn).runHandler(0xc001cd1080, 0xc002394110, 0xc00243fa00, 0xc0026076e0)
	golang.org/x/[email protected]/http2/server.go:2147 +0x8b
created by golang.org/x/net/http2.(*serverConn).processHeaders
	golang.org/x/[email protected]/http2/server.go:1881 +0x505
E0921 07:09:22.815015       1 wrap.go:39] apiserver panic'd on GET /openapi/v2

From now on the timeout handler will panic with http.ErrAbortHandler when a response to the client has been already sent and the timeout elapsed. As a result, the connection will be forcefully closed EOF and we will record requestAbortsTotal.

Previously the stack trace was persisted in the logs which caused confusion and false alarms.

Additionally, a new metric requestAbortsTotal was defined to count aborted requests. The new metric allows for aggregation for each group, version, verb, resource, subresource and scope.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer: Initially I wanted to use errors.Is and errors.As but that would require to have an additional function anyway as we need to cast err interface{} first. Other than that that would require coupling errConnKilled with http.ErrAbortHandler which I didn't like.

Does this PR introduce a user-facing change?:

A new metric `requestAbortsTotal` has been introduced that counts aborted requests for each `group`, `version`, `verb`, `resource`, `subresource` and `scope`.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 23, 2020
@p0lyn0mial
Copy link
Contributor Author

/assign @sttts @mfojtik

@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 23, 2020
@mfojtik
Copy link
Contributor

mfojtik commented Sep 23, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2020
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should at least log why we dropped the request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if errors.Is(http.ErrAbortHandler) {
  err := recover()
  klog.Error(err)
  panic(http.ErrAbortHandler)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from golang docs To abort a handler so the client sees an interrupted response but the server doesn't log an error, panic with the value ErrAbortHandler

maybe we should actually throw ErrAbortHandler instead of errConnKilled ?
that would be seen as timeout_test.go:220: Get "http://127.0.0.1:61805": EOF on the client side.

@p0lyn0mial p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch from 7f11683 to 06643be Compare September 24, 2020 08:19
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 24, 2020
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panicking with http.ErrAbortHandler (well any panic would do) will actually close the underlying connection.

Previously errConnKilled was captured, logged and the connection was left intact since calling WriteHeader multiple times is not supported by the std library.

I think that before closing the connection (before panicking) we could actually try to send a message to the client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that before closing the connection (before panicking) we could actually try to send a message to the client.

Even if we did, the msg would be ignored by the client (golang) library On error, any Response can be ignored
https://golang.org/src/net/http/client.go?s=20422:20474#L575

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we catch errConnKilled ?

@p0lyn0mial p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch from 06643be to 1cc75f8 Compare September 24, 2020 09:09
@fedebongio
Copy link
Contributor

/assign @lavalamp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a metric, or a rate limit on this log message like that in https://github.com/kubernetes/kubernetes/pull/88600/files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for metric.

What do we have now for errors like EOF or tls errors with --v=2? They are also visible, aren't they? A warning here is definitely too heavy. We won't warn about EOFs either, will we?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will we see in --v=2 for timeouts in general?

@lavalamp
Copy link
Contributor

/approve

I think my preference order is: metric, rate-limited log message, no log message, log message

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Nov 4, 2020
@p0lyn0mial p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch from 40761c2 to 5eeda83 Compare November 5, 2020 10:23
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2020
@p0lyn0mial
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, p0lyn0mial

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 5, 2020
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: due to a timeout, for each group ...

Otherwise it reads that it timed out for every group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...., but we want to supress a not helpful stacktrace in the logs.

@p0lyn0mial p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch 2 times, most recently from 3be9663 to 9001d67 Compare November 5, 2020 14:22
@p0lyn0mial
Copy link
Contributor Author

/test pull-kubernetes-bazel-test

Aborted requests are the ones that were disrupted with http.ErrAbortHandler.
For example, the timeout handler will panic with http.ErrAbortHandler when a response to the client has been already sent
and the timeout elapsed.

Additionally, a new metric requestAbortsTotal was defined to count aborted requests. The new metric allows for aggregation for each group, version, verb, resource, subresource and scope.
@p0lyn0mial p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch from 9001d67 to 057986e Compare November 9, 2020 08:24
metrics.RecordRequestAbort(req, nil)
return
}
metrics.RecordRequestAbort(req, info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a rate limited log message about this. We rate limit runtime.HandleError by default, so I'll add that here as a followup because @p0lyn0mial is in Europe and on vacation this week.

@deads2k
Copy link
Contributor

deads2k commented Nov 10, 2020

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants