Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

adleong
Copy link
Member

@adleong adleong commented Aug 18, 2017

After a classification timeout in ClassifiedRetryFilter, an exception is
raised on the classification future to cancel it. This has the effect that
any frames that had been read by the classification future but not yet released
when the future was cancelled would get dropped and remain un-released forever.
This would cause the flow control window to slowly fill up and eventually
cause frames to stop being sent.

Instead, we no longer use Future.raise and instead implement our own timeout
logic where, even in the case that a timeout exception is returned, processing
continues on the stream and ensures that all frames are released.

Closes #1613.

After a classification timeout in ClassifiedRetryFilter, an exception is
raised on the classification future to cancel it.  This has the effect that
any frames that had been read by the classification future but not yet released
when the future was cancelled would get dropped and remain un-released forever.
This would cause the flow control window to slowly fill up and eventually
cause frames to stop being sent.

Instead, we no longer use Future.raise and instead implement our own timeout
logic where, even in the case that a timeout exception is returned, processing
continues on the stream and ensures that all frames are releaesd.
@adleong adleong self-assigned this Aug 18, 2017
@adleong adleong requested review from hawkw, olix0r and pcalcado August 18, 2017 22:13
Copy link
Contributor

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐️ Looks good to me, thank you fixing this and for the amount of comments you've added.

// Cancelling the read or discarding the result of the read could result in reading a frame
// but never releasing it. If we initiate a read, we must always capture the resulting
// frame and ensure it is released.
Future.selectIndex(IndexedSeq(Future.sleep(deadline - now), frameF)).flatMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is the only thing in this PR that isn't immediately clear to me, but I'm not sure if I can think of any clearer way to express it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future.selectIndex runs both Futures in parallel and returns the index of the one that completes first. The first Future is scheduled to return at the deadline. So if we hit the deadline first, we go to the first case and if we read the final frame before hitting the deadline, we go the second case. Does that help?

Copy link
Contributor

@hawkw hawkw Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was able to figure out what was going on here after looking at it for a minute, I just meant that the control flow with selectIndex is a little hard to follow at a glance?

Copy link
Contributor

@hawkw hawkw Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about something like

val timeoutF = 
  // Classification timeout.  Immediately return a classificationTimeoutException but
  // ensure the final frame is eventually released.
  Future.sleep(deadline - now).onSuccess { 
    frameF.onSuccess { f => f.foreach(_.release()); () }
    Future.exception(classificationTimeoutException)
  }
timeoutF.or(frameF)

take it or leave it, of course.

@klingerf
Copy link
Contributor

I talked this over with @adleong and it looks like with this change gRPC streaming is suffering from a memory leak, either as a result of this change, or due to a different issue that didn't manifest until we fixed the flow control issue. I think we should track that down before merging.

@klingerf
Copy link
Contributor

Ok, I merged this branch with latest master following the Kubernetes API refactor branch merging, and the memory leak is still happening. I grabbed a heap dump, and it contains one leak suspect, as follows:

screen shot 2017-08-25 at 11 44 25 am

Looks like one or more unbounded AsyncQueue objects. Am happy to share the heap dump if folks are interested in looking at it.

@hawkw
Copy link
Contributor

hawkw commented Aug 25, 2017

This is great, thanks @klingerf – I'll look into this in a moment?

@hawkw
Copy link
Contributor

hawkw commented Aug 25, 2017

@klingerf just to confirm, do we know for sure that this leak did not exist before 9deb5f7?

@klingerf
Copy link
Contributor

@hawkw Yeah, as far as I can tell, this is a new leak introduced with this branch. It's a bit hard to be 100% sure, however, since without the fix from this branch, linkerd just stops processing all streaming requests after a short period of time.

@hawkw hawkw mentioned this pull request Aug 28, 2017
@hawkw
Copy link
Contributor

hawkw commented Aug 31, 2017

@adleong, I've updated your PR summary to note that this fixes #1613 (I also fixed a typo 😛)

Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great stopgap to work around the issue as relates to buffered stream.

However, I think this solution is incomplete, as cancellations may theoretically be caused in other code paths. I think linkerd is probably okay for now, as I can't imagine how we'd cancel a frame read otherwise, but it seems worth noting somewhere (and probably opening an issue) to track a more fundamental fix that intercepts a canceled frame read and either restores it to the queue or returns the frame anyway or... whatever.

Thanks for doing so much digging to fix this!

@adleong adleong merged commit ea506c6 into master Aug 31, 2017
@hawkw hawkw mentioned this pull request Aug 31, 2017
hawkw added a commit that referenced this pull request Aug 31, 2017
+ Kubernetes
  - Update default backend behavior and strip host ports (#1607)
+ H2
  - Ensure frames are released after classification timeout (#1598)
  - Fix memory leak in long-running H2 streams (#1613)
pcalcado pushed a commit that referenced this pull request Sep 6, 2017
After a classification timeout in ClassifiedRetryFilter, an exception is
raised on the classification future to cancel it.  This has the effect that
any frames that had been read by the classification future but not yet released
when the future was cancelled would get dropped and remain un-released forever.
This would cause the flow control window to slowly fill up and eventually
cause frames to stop being sent.

Instead, we no longer use Future.raise and instead implement our own timeout
logic where, even in the case that a timeout exception is returned, processing
continues on the stream and ensures that all frames are releaesd.
pcalcado pushed a commit that referenced this pull request Sep 6, 2017
+ Kubernetes
  - Update default backend behavior and strip host ports (#1607)
+ H2
  - Ensure frames are released after classification timeout (#1598)
  - Fix memory leak in long-running H2 streams (#1613)
@hawkw hawkw mentioned this pull request Sep 7, 2017
hawkw added a commit that referenced this pull request Sep 7, 2017
## 1.2.0 2017-09-07

* **Breaking Change**: `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd
  admin, and namerd admin now serve on 127.0.0.1 by default (instead of
  0.0.0.0).
* **Breaking Change**: Removed support for PKCS#1-formatted keys. PKCS#1 formatted keys must be converted to PKCS#8 format.
* Added experimental `io.l5d.dnssrv` namer for DNS SRV records (#1611)
* Kubernetes
  * Added an experimental `io.l5d.k8s.configMap` interpreter for reading dtabs from a Kubernetes ConfigMap (#1603). This interpreter will respond to changes in the ConfigMap, allowing for dynamic dtab updates without the need to run Namerd.
  * Made ingress controller's ingress class annotation configurable (#1584).
  * Fixed an issue where Linkerd would continue routing traffic to endpoints of a service after that service was removed (#1622).
  * Major refactoring and performance improvements to `io.l5d.k8s` and `io.l5d.k8s.ns` namers (#1603).
  * Ingress controller now checks all available ingress resources before using a default backend (#1607).
  * Ingress controller now correctly routes requests with host headers that contain ports (#1607).
* HTTP/2
  * Fixed an issue where long-running H2 streams would eventually hang (#1598).
  * Fixed a memory leak on long-running H2 streams (#1598)
  * Added a user-friendly error message when a HTTP/2 router receives a HTTP/1 request (#1618)
* HTTP/1
  * Removed spurious `ReaderDiscarded` exception logged on HTTP/1 retries (#1609)
* Consul
  * Added support for querying Consul by specific service health states (#1601)
  * Consul namers and Dtab store now fall back to a last known good state on Consul observation errors (#1597)
  * Improved log messages for Consul observation errors (#1597)
* TLS
  * Removed support for PKCS#1 keys (#1590)
  * Added validation to prevent incompatible `disableValidation: true` and `clientAuth` settings in TLS client configurations (#1621)
* Changed `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd
  admin, and namerd admin to serve on 127.0.0.1 by default (instead of
  0.0.0.0) (#1366)
* Deprecated `io.l5d.statsd` telemeter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants