Ensure frames are released after classification timeout #1598

adleong · 2017-08-18T22:13:53Z

After a classification timeout in ClassifiedRetryFilter, an exception is
raised on the classification future to cancel it. This has the effect that
any frames that had been read by the classification future but not yet released
when the future was cancelled would get dropped and remain un-released forever.
This would cause the flow control window to slowly fill up and eventually
cause frames to stop being sent.

Instead, we no longer use Future.raise and instead implement our own timeout
logic where, even in the case that a timeout exception is returned, processing
continues on the stream and ensures that all frames are released.

Closes #1613.

After a classification timeout in ClassifiedRetryFilter, an exception is raised on the classification future to cancel it. This has the effect that any frames that had been read by the classification future but not yet released when the future was cancelled would get dropped and remain un-released forever. This would cause the flow control window to slowly fill up and eventually cause frames to stop being sent. Instead, we no longer use Future.raise and instead implement our own timeout logic where, even in the case that a timeout exception is returned, processing continues on the stream and ensures that all frames are releaesd.

hawkw

⭐️ Looks good to me, thank you fixing this and for the amount of comments you've added.

hawkw · 2017-08-18T22:37:46Z

router/h2/src/main/scala/io/buoyant/router/h2/ClassifiedRetryFilter.scala

+          // Cancelling the read or discarding the result of the read could result in reading a frame
+          // but never releasing it.  If we initiate a read, we must always capture the resulting
+          // frame and ensure it is released.
+          Future.selectIndex(IndexedSeq(Future.sleep(deadline - now), frameF)).flatMap {


I feel like this is the only thing in this PR that isn't immediately clear to me, but I'm not sure if I can think of any clearer way to express it.

Future.selectIndex runs both Futures in parallel and returns the index of the one that completes first. The first Future is scheduled to return at the deadline. So if we hit the deadline first, we go to the first case and if we read the final frame before hitting the deadline, we go the second case. Does that help?

Yeah, I was able to figure out what was going on here after looking at it for a minute, I just meant that the control flow with selectIndex is a little hard to follow at a glance?

How would you feel about something like

val timeoutF = // Classification timeout. Immediately return a classificationTimeoutException but // ensure the final frame is eventually released. Future.sleep(deadline - now).onSuccess { frameF.onSuccess { f => f.foreach(_.release()); () } Future.exception(classificationTimeoutException) } timeoutF.or(frameF)

take it or leave it, of course.

klingerf · 2017-08-19T00:42:14Z

I talked this over with @adleong and it looks like with this change gRPC streaming is suffering from a memory leak, either as a result of this change, or due to a different issue that didn't manifest until we fixed the flow control issue. I think we should track that down before merging.

klingerf · 2017-08-25T18:48:43Z

Ok, I merged this branch with latest master following the Kubernetes API refactor branch merging, and the memory leak is still happening. I grabbed a heap dump, and it contains one leak suspect, as follows:

Looks like one or more unbounded AsyncQueue objects. Am happy to share the heap dump if folks are interested in looking at it.

hawkw · 2017-08-25T19:16:30Z

This is great, thanks @klingerf – I'll look into this in a moment?

hawkw · 2017-08-25T21:03:48Z

@klingerf just to confirm, do we know for sure that this leak did not exist before 9deb5f7?

klingerf · 2017-08-25T21:18:12Z

@hawkw Yeah, as far as I can tell, this is a new leak introduced with this branch. It's a bit hard to be 100% sure, however, since without the fix from this branch, linkerd just stops processing all streaming requests after a short period of time.

hawkw · 2017-08-31T22:17:55Z

@adleong, I've updated your PR summary to note that this fixes #1613 (I also fixed a typo 😛)

olix0r

This looks like a great stopgap to work around the issue as relates to buffered stream.

However, I think this solution is incomplete, as cancellations may theoretically be caused in other code paths. I think linkerd is probably okay for now, as I can't imagine how we'd cancel a frame read otherwise, but it seems worth noting somewhere (and probably opening an issue) to track a more fundamental fix that intercepts a canceled frame read and either restores it to the queue or returns the frame anyway or... whatever.

Thanks for doing so much digging to fix this!

+ Kubernetes - Update default backend behavior and strip host ports (#1607) + H2 - Ensure frames are released after classification timeout (#1598) - Fix memory leak in long-running H2 streams (#1613)

After a classification timeout in ClassifiedRetryFilter, an exception is raised on the classification future to cancel it. This has the effect that any frames that had been read by the classification future but not yet released when the future was cancelled would get dropped and remain un-released forever. This would cause the flow control window to slowly fill up and eventually cause frames to stop being sent. Instead, we no longer use Future.raise and instead implement our own timeout logic where, even in the case that a timeout exception is returned, processing continues on the stream and ensures that all frames are releaesd.

+ Kubernetes - Update default backend behavior and strip host ports (#1607) + H2 - Ensure frames are released after classification timeout (#1598) - Fix memory leak in long-running H2 streams (#1613)

## 1.2.0 2017-09-07 * **Breaking Change**: `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd admin, and namerd admin now serve on 127.0.0.1 by default (instead of 0.0.0.0). * **Breaking Change**: Removed support for PKCS#1-formatted keys. PKCS#1 formatted keys must be converted to PKCS#8 format. * Added experimental `io.l5d.dnssrv` namer for DNS SRV records (#1611) * Kubernetes * Added an experimental `io.l5d.k8s.configMap` interpreter for reading dtabs from a Kubernetes ConfigMap (#1603). This interpreter will respond to changes in the ConfigMap, allowing for dynamic dtab updates without the need to run Namerd. * Made ingress controller's ingress class annotation configurable (#1584). * Fixed an issue where Linkerd would continue routing traffic to endpoints of a service after that service was removed (#1622). * Major refactoring and performance improvements to `io.l5d.k8s` and `io.l5d.k8s.ns` namers (#1603). * Ingress controller now checks all available ingress resources before using a default backend (#1607). * Ingress controller now correctly routes requests with host headers that contain ports (#1607). * HTTP/2 * Fixed an issue where long-running H2 streams would eventually hang (#1598). * Fixed a memory leak on long-running H2 streams (#1598) * Added a user-friendly error message when a HTTP/2 router receives a HTTP/1 request (#1618) * HTTP/1 * Removed spurious `ReaderDiscarded` exception logged on HTTP/1 retries (#1609) * Consul * Added support for querying Consul by specific service health states (#1601) * Consul namers and Dtab store now fall back to a last known good state on Consul observation errors (#1597) * Improved log messages for Consul observation errors (#1597) * TLS * Removed support for PKCS#1 keys (#1590) * Added validation to prevent incompatible `disableValidation: true` and `clientAuth` settings in TLS client configurations (#1621) * Changed `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd admin, and namerd admin to serve on 127.0.0.1 by default (instead of 0.0.0.0) (#1366) * Deprecated `io.l5d.statsd` telemeter.

adleong self-assigned this Aug 18, 2017

adleong requested review from hawkw, olix0r and pcalcado August 18, 2017 22:13

hawkw approved these changes Aug 18, 2017

View reviewed changes

Move classification timeout up to a higher level

7529adb

hawkw mentioned this pull request Aug 28, 2017

gRPC memory leak #1613

Closed

Fix unbounded poller list in BufferedStream

18e3575

olix0r approved these changes Aug 31, 2017

View reviewed changes

adleong merged commit ea506c6 into master Aug 31, 2017

hawkw mentioned this pull request Aug 31, 2017

1.2.0-rc2 #1619

Merged

hawkw added a commit that referenced this pull request Aug 31, 2017

1.2.0-rc2

05059b2

+ Kubernetes - Update default backend behavior and strip host ports (#1607) + H2 - Ensure frames are released after classification timeout (#1598) - Fix memory leak in long-running H2 streams (#1613)

hawkw mentioned this pull request Sep 1, 2017

gRPC streaming issue with linkerd 1.1.3 #1591

Closed

pcalcado pushed a commit that referenced this pull request Sep 6, 2017

1.2.0-rc2

8c8f0d8

+ Kubernetes - Update default backend behavior and strip host ports (#1607) + H2 - Ensure frames are released after classification timeout (#1598) - Fix memory leak in long-running H2 streams (#1613)

hawkw mentioned this pull request Sep 7, 2017

1.2.0 #1629

Merged

Ensure frames are released after classification timeout #1598

Ensure frames are released after classification timeout #1598

Uh oh!

Conversation

adleong commented Aug 18, 2017 • edited by hawkw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Uh oh!

hawkw Aug 18, 2017

Choose a reason for hiding this comment

Uh oh!

adleong Aug 18, 2017

Choose a reason for hiding this comment

Uh oh!

hawkw Aug 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hawkw Aug 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klingerf commented Aug 19, 2017

Uh oh!

klingerf commented Aug 25, 2017

Uh oh!

hawkw commented Aug 25, 2017

Uh oh!

hawkw commented Aug 25, 2017

Uh oh!

klingerf commented Aug 25, 2017

Uh oh!

hawkw commented Aug 31, 2017

Uh oh!

olix0r left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adleong commented Aug 18, 2017 •

edited by hawkw

Loading

hawkw Aug 18, 2017 •

edited

Loading

hawkw Aug 18, 2017 •

edited

Loading