-
Notifications
You must be signed in to change notification settings - Fork 504
Ensure frames are released after classification timeout #1598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
After a classification timeout in ClassifiedRetryFilter, an exception is raised on the classification future to cancel it. This has the effect that any frames that had been read by the classification future but not yet released when the future was cancelled would get dropped and remain un-released forever. This would cause the flow control window to slowly fill up and eventually cause frames to stop being sent. Instead, we no longer use Future.raise and instead implement our own timeout logic where, even in the case that a timeout exception is returned, processing continues on the stream and ensures that all frames are releaesd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⭐️ Looks good to me, thank you fixing this and for the amount of comments you've added.
// Cancelling the read or discarding the result of the read could result in reading a frame | ||
// but never releasing it. If we initiate a read, we must always capture the resulting | ||
// frame and ensure it is released. | ||
Future.selectIndex(IndexedSeq(Future.sleep(deadline - now), frameF)).flatMap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is the only thing in this PR that isn't immediately clear to me, but I'm not sure if I can think of any clearer way to express it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future.selectIndex runs both Futures in parallel and returns the index of the one that completes first. The first Future is scheduled to return at the deadline. So if we hit the deadline first, we go to the first case and if we read the final frame before hitting the deadline, we go the second case. Does that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was able to figure out what was going on here after looking at it for a minute, I just meant that the control flow with selectIndex
is a little hard to follow at a glance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel about something like
val timeoutF =
// Classification timeout. Immediately return a classificationTimeoutException but
// ensure the final frame is eventually released.
Future.sleep(deadline - now).onSuccess {
frameF.onSuccess { f => f.foreach(_.release()); () }
Future.exception(classificationTimeoutException)
}
timeoutF.or(frameF)
take it or leave it, of course.
I talked this over with @adleong and it looks like with this change gRPC streaming is suffering from a memory leak, either as a result of this change, or due to a different issue that didn't manifest until we fixed the flow control issue. I think we should track that down before merging. |
Ok, I merged this branch with latest master following the Kubernetes API refactor branch merging, and the memory leak is still happening. I grabbed a heap dump, and it contains one leak suspect, as follows: Looks like one or more unbounded AsyncQueue objects. Am happy to share the heap dump if folks are interested in looking at it. |
This is great, thanks @klingerf – I'll look into this in a moment? |
@hawkw Yeah, as far as I can tell, this is a new leak introduced with this branch. It's a bit hard to be 100% sure, however, since without the fix from this branch, linkerd just stops processing all streaming requests after a short period of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great stopgap to work around the issue as relates to buffered stream.
However, I think this solution is incomplete, as cancellations may theoretically be caused in other code paths. I think linkerd is probably okay for now, as I can't imagine how we'd cancel a frame read otherwise, but it seems worth noting somewhere (and probably opening an issue) to track a more fundamental fix that intercepts a canceled frame read and either restores it to the queue or returns the frame anyway or... whatever.
Thanks for doing so much digging to fix this!
After a classification timeout in ClassifiedRetryFilter, an exception is raised on the classification future to cancel it. This has the effect that any frames that had been read by the classification future but not yet released when the future was cancelled would get dropped and remain un-released forever. This would cause the flow control window to slowly fill up and eventually cause frames to stop being sent. Instead, we no longer use Future.raise and instead implement our own timeout logic where, even in the case that a timeout exception is returned, processing continues on the stream and ensures that all frames are releaesd.
## 1.2.0 2017-09-07 * **Breaking Change**: `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd admin, and namerd admin now serve on 127.0.0.1 by default (instead of 0.0.0.0). * **Breaking Change**: Removed support for PKCS#1-formatted keys. PKCS#1 formatted keys must be converted to PKCS#8 format. * Added experimental `io.l5d.dnssrv` namer for DNS SRV records (#1611) * Kubernetes * Added an experimental `io.l5d.k8s.configMap` interpreter for reading dtabs from a Kubernetes ConfigMap (#1603). This interpreter will respond to changes in the ConfigMap, allowing for dynamic dtab updates without the need to run Namerd. * Made ingress controller's ingress class annotation configurable (#1584). * Fixed an issue where Linkerd would continue routing traffic to endpoints of a service after that service was removed (#1622). * Major refactoring and performance improvements to `io.l5d.k8s` and `io.l5d.k8s.ns` namers (#1603). * Ingress controller now checks all available ingress resources before using a default backend (#1607). * Ingress controller now correctly routes requests with host headers that contain ports (#1607). * HTTP/2 * Fixed an issue where long-running H2 streams would eventually hang (#1598). * Fixed a memory leak on long-running H2 streams (#1598) * Added a user-friendly error message when a HTTP/2 router receives a HTTP/1 request (#1618) * HTTP/1 * Removed spurious `ReaderDiscarded` exception logged on HTTP/1 retries (#1609) * Consul * Added support for querying Consul by specific service health states (#1601) * Consul namers and Dtab store now fall back to a last known good state on Consul observation errors (#1597) * Improved log messages for Consul observation errors (#1597) * TLS * Removed support for PKCS#1 keys (#1590) * Added validation to prevent incompatible `disableValidation: true` and `clientAuth` settings in TLS client configurations (#1621) * Changed `io.l5d.mesh`, `io.l5d.thriftNameInterpreter`, linkerd admin, and namerd admin to serve on 127.0.0.1 by default (instead of 0.0.0.0) (#1366) * Deprecated `io.l5d.statsd` telemeter.
After a classification timeout in ClassifiedRetryFilter, an exception is
raised on the classification future to cancel it. This has the effect that
any frames that had been read by the classification future but not yet released
when the future was cancelled would get dropped and remain un-released forever.
This would cause the flow control window to slowly fill up and eventually
cause frames to stop being sent.
Instead, we no longer use Future.raise and instead implement our own timeout
logic where, even in the case that a timeout exception is returned, processing
continues on the stream and ensures that all frames are released.
Closes #1613.