stop logging killing connection/stream because serving request timed out and response had been started #95002

p0lyn0mial · 2020-09-23T12:51:54Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it: it stops persisting the following error in the server's log:

E0921 07:09:22.814963       1 runtime.go:78] Observed a panic: &errors.errorString{s:"killing connection/stream because serving request timed out and response had been started"} (killing connection/stream because serving request timed out and response had been started)
goroutine 1088 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x2039b80, 0xc000552230)
	k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa6
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0xc00257dc98, 0x1, 0x1)
	k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x89
panic(0x2039b80, 0xc000552230)
	runtime/panic.go:969 +0x175
k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc0026077c0, 0xc002444460)
	k8s.io/[email protected]/pkg/server/filters/timeout.go:257 +0x1bc
k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc0002df9a0, 0x26b7640, 0xc0005a7c00, 0xc00243ff00)
	k8s.io/[email protected]/pkg/server/filters/timeout.go:141 +0x2f3
k8s.io/apiserver/pkg/server/filters.WithWaitGroup.func1(0x26b7640, 0xc0005a7c00, 0xc00243fe00)
	k8s.io/[email protected]/pkg/server/filters/waitgroup.go:59 +0x137
net/http.HandlerFunc.ServeHTTP(0xc00127d650, 0x26b7640, 0xc0005a7c00, 0xc00243fe00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0x26b7640, 0xc0005a7c00, 0xc00243fd00)
	k8s.io/[email protected]/pkg/endpoints/filters/requestinfo.go:39 +0x269
net/http.HandlerFunc.ServeHTTP(0xc00127d680, 0x26b7640, 0xc0005a7c00, 0xc00243fd00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/endpoints/filters.WithAuditAnnotations.func1(0x26b7640, 0xc0005a7c00, 0xc00243fc00)
	k8s.io/[email protected]/pkg/endpoints/filters/audit_annotations.go:37 +0x142
net/http.HandlerFunc.ServeHTTP(0xc0002df9c0, 0x26b7640, 0xc0005a7c00, 0xc00243fc00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/endpoints/filters.WithWarningRecorder.func1(0x26b7640, 0xc0005a7c00, 0xc00243fb00)
	k8s.io/[email protected]/pkg/endpoints/filters/warning.go:35 +0x1a7
net/http.HandlerFunc.ServeHTTP(0xc0002df9e0, 0x26b7640, 0xc0005a7c00, 0xc00243fb00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/endpoints/filters.WithCacheControl.func1(0x26b7640, 0xc0005a7c00, 0xc00243fb00)
	k8s.io/[email protected]/pkg/endpoints/filters/cachecontrol.go:31 +0xa8
net/http.HandlerFunc.ServeHTTP(0xc0002dfa00, 0x26b7640, 0xc0005a7c00, 0xc00243fb00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/server/httplog.WithLogging.func1(0x26aa080, 0xc002394110, 0xc00243fa00)
	k8s.io/[email protected]/pkg/server/httplog/httplog.go:91 +0x2f2
net/http.HandlerFunc.ServeHTTP(0xc0002dfa20, 0x26aa080, 0xc002394110, 0xc00243fa00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/server/filters.withPanicRecovery.func1(0x26aa080, 0xc002394110, 0xc00243fa00)
	k8s.io/[email protected]/pkg/server/filters/wrap.go:51 +0xe6
net/http.HandlerFunc.ServeHTTP(0xc0002dfa40, 0x26aa080, 0xc002394110, 0xc00243fa00)
	net/http/server.go:2042 +0x44
k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc00127d710, 0x26aa080, 0xc002394110, 0xc00243fa00)
	k8s.io/[email protected]/pkg/server/handler.go:189 +0x51
net/http.serverHandler.ServeHTTP(0xc002255c00, 0x26aa080, 0xc002394110, 0xc00243fa00)
	net/http/server.go:2843 +0xa3
net/http.initALPNRequest.ServeHTTP(0x26bd780, 0xc001dc88d0, 0xc00163b500, 0xc002255c00, 0x26aa080, 0xc002394110, 0xc00243fa00)
	net/http/server.go:3415 +0x8d
golang.org/x/net/http2.(*serverConn).runHandler(0xc001cd1080, 0xc002394110, 0xc00243fa00, 0xc0026076e0)
	golang.org/x/[email protected]/http2/server.go:2147 +0x8b
created by golang.org/x/net/http2.(*serverConn).processHeaders
	golang.org/x/[email protected]/http2/server.go:1881 +0x505
E0921 07:09:22.815015       1 wrap.go:39] apiserver panic'd on GET /openapi/v2

From now on the timeout handler will panic with http.ErrAbortHandler when a response to the client has been already sent and the timeout elapsed. As a result, the connection will be forcefully closed EOF and we will record requestAbortsTotal.

Previously the stack trace was persisted in the logs which caused confusion and false alarms.

Additionally, a new metric requestAbortsTotal was defined to count aborted requests. The new metric allows for aggregation for each group, version, verb, resource, subresource and scope.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer: Initially I wanted to use errors.Is and errors.As but that would require to have an additional function anyway as we need to cast err interface{} first. Other than that that would require coupling errConnKilled with http.ErrAbortHandler which I didn't like.

Does this PR introduce a user-facing change?:

A new metric `requestAbortsTotal` has been introduced that counts aborted requests for each `group`, `version`, `verb`, `resource`, `subresource` and `scope`.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

p0lyn0mial · 2020-09-23T12:52:13Z

/assign @sttts @mfojtik

mfojtik · 2020-09-23T13:08:56Z

/lgtm

staging/src/k8s.io/apimachinery/pkg/util/net/http_test.go

staging/src/k8s.io/apimachinery/pkg/util/net/interface.go

staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go

p0lyn0mial · 2020-09-23T14:24:34Z

staging/src/k8s.io/apiserver/pkg/server/filters/wrap.go

maybe we should at least log why we dropped the request?

if errors.Is(http.ErrAbortHandler) { err := recover() klog.Error(err) panic(http.ErrAbortHandler) }

from golang docs To abort a handler so the client sees an interrupted response but the server doesn't log an error, panic with the value ErrAbortHandler

maybe we should actually throw ErrAbortHandler instead of errConnKilled ?
that would be seen as timeout_test.go:220: Get "http://127.0.0.1:61805": EOF on the client side.

p0lyn0mial · 2020-09-24T08:34:04Z

staging/src/k8s.io/apiserver/pkg/server/filters/wrap.go

panicking with http.ErrAbortHandler (well any panic would do) will actually close the underlying connection.

Previously errConnKilled was captured, logged and the connection was left intact since calling WriteHeader multiple times is not supported by the std library.

I think that before closing the connection (before panicking) we could actually try to send a message to the client.

I think that before closing the connection (before panicking) we could actually try to send a message to the client.

Even if we did, the msg would be ignored by the client (golang) library On error, any Response can be ignored
https://golang.org/src/net/http/client.go?s=20422:20474#L575

Where did we catch errConnKilled ?

fedebongio · 2020-09-24T20:06:04Z

/assign @lavalamp

lavalamp · 2020-09-24T21:00:07Z

staging/src/k8s.io/apiserver/pkg/server/filters/wrap.go

I'd prefer a metric, or a rate limit on this log message like that in https://github.com/kubernetes/kubernetes/pull/88600/files

+1 for metric.

What do we have now for errors like EOF or tls errors with --v=2? They are also visible, aren't they? A warning here is definitely too heavy. We won't warn about EOFs either, will we?

What will we see in --v=2 for timeouts in general?

lavalamp · 2020-09-24T21:03:16Z

/approve

I think my preference order is: metric, rate-limited log message, no log message, log message

staging/src/k8s.io/apiserver/pkg/server/filters/timeout_test.go

staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go

p0lyn0mial · 2020-11-05T10:55:29Z

/retest

k8s-ci-robot · 2020-11-05T12:22:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, p0lyn0mial

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kube-scheduler/OWNERS~~ [lavalamp]
~~pkg/kubeapiserver/OWNERS~~ [lavalamp]
~~staging/src/k8s.io/apiserver/OWNERS~~ [lavalamp]
~~staging/src/k8s.io/controller-manager/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sttts · 2020-11-05T12:52:16Z

staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go

nit: due to a timeout, for each group ...

Otherwise it reads that it timed out for every group.

sttts · 2020-11-05T12:54:27Z

staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go

...., but we want to supress a not helpful stacktrace in the logs.

p0lyn0mial · 2020-11-09T07:54:31Z

/test pull-kubernetes-bazel-test

Aborted requests are the ones that were disrupted with http.ErrAbortHandler. For example, the timeout handler will panic with http.ErrAbortHandler when a response to the client has been already sent and the timeout elapsed. Additionally, a new metric requestAbortsTotal was defined to count aborted requests. The new metric allows for aggregation for each group, version, verb, resource, subresource and scope.

deads2k · 2020-11-10T18:38:29Z

staging/src/k8s.io/apiserver/pkg/server/filters/wrap.go

+				metrics.RecordRequestAbort(req, nil)
+				return
+			}
+			metrics.RecordRequestAbort(req, info)


I'd like to see a rate limited log message about this. We rate limit runtime.HandleError by default, so I'll add that here as a followup because @p0lyn0mial is in Europe and on vacation this week.

deads2k · 2020-11-10T18:39:18Z

/lgtm

k8s-ci-robot assigned mfojtik and sttts Sep 23, 2020

k8s-ci-robot requested review from gmarek and lavalamp September 23, 2020 12:52

k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 23, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2020

sttts reviewed Sep 23, 2020

View reviewed changes

staging/src/k8s.io/apimachinery/pkg/util/net/http_test.go Outdated Show resolved Hide resolved

sttts reviewed Sep 23, 2020

View reviewed changes

staging/src/k8s.io/apimachinery/pkg/util/net/interface.go Outdated Show resolved Hide resolved

sttts reviewed Sep 23, 2020

View reviewed changes

staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go Outdated Show resolved Hide resolved

p0lyn0mial commented Sep 23, 2020

View reviewed changes

p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch from 7f11683 to 06643be Compare September 24, 2020 08:19

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 24, 2020

p0lyn0mial commented Sep 24, 2020

View reviewed changes

p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch from 06643be to 1cc75f8 Compare September 24, 2020 09:09

k8s-ci-robot assigned lavalamp Sep 24, 2020

lavalamp reviewed Sep 24, 2020

View reviewed changes

sttts reviewed Sep 28, 2020

View reviewed changes

staging/src/k8s.io/apiserver/pkg/server/filters/timeout_test.go Outdated Show resolved Hide resolved

sttts reviewed Sep 28, 2020

View reviewed changes

staging/src/k8s.io/apiserver/pkg/server/filters/timeout_test.go Outdated Show resolved Hide resolved

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Nov 4, 2020

p0lyn0mial commented Nov 4, 2020

View reviewed changes

staging/src/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go Outdated Show resolved Hide resolved

p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch from 40761c2 to 5eeda83 Compare November 5, 2020 10:23

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2020

sttts reviewed Nov 5, 2020

View reviewed changes

p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch 2 times, most recently from 3be9663 to 9001d67 Compare November 5, 2020 14:22

p0lyn0mial force-pushed the upstream-supress-err-conn-killed branch from 9001d67 to 057986e Compare November 9, 2020 08:24

deads2k reviewed Nov 10, 2020

View reviewed changes

k8s-ci-robot assigned deads2k Nov 10, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2020

deads2k added this to the v1.20 milestone Nov 10, 2020

deads2k mentioned this pull request Nov 10, 2020

add timeout message in addition to metric #96424

Merged

k8s-ci-robot merged commit 40ef0ad into kubernetes:master Nov 10, 2020

github-actions bot mentioned this pull request Nov 18, 2020

Week Ending November 15, 2020 dev-obs/actus#273

Open

openshift-ci-robot mentioned this pull request Jul 8, 2021

[WIP] [release-4.6] Rebase onto v1.19.12 openshift/kubernetes#850

Closed

openshift-ci-robot mentioned this pull request Sep 16, 2021

[release-4.6] Bug 2008266: Rebase 1.19.14 openshift/kubernetes#962

Merged

stop logging killing connection/stream because serving request timed out and response had been started #95002

stop logging killing connection/stream because serving request timed out and response had been started #95002

Uh oh!

Conversation

p0lyn0mial commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p0lyn0mial commented Sep 23, 2020

Uh oh!

mfojtik commented Sep 23, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fedebongio commented Sep 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lavalamp commented Sep 24, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p0lyn0mial commented Nov 5, 2020

Uh oh!

k8s-ci-robot commented Nov 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

p0lyn0mial commented Nov 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deads2k commented Nov 10, 2020

Uh oh!

Uh oh!

p0lyn0mial commented Sep 23, 2020 •

edited

Loading