-
Notifications
You must be signed in to change notification settings - Fork 703
Added support for overload manager #4597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #4597 +/- ##
==========================================
+ Coverage 76.08% 76.23% +0.14%
==========================================
Files 140 140
Lines 13073 13147 +74
==========================================
+ Hits 9947 10023 +76
+ Misses 2872 2871 -1
+ Partials 254 253 -1
|
cc38cf5 to
828bec9
Compare
|
I think in cases that the heap is low, getting metrics is definitely less of a big deal than being oomkilled. It seems like this is about as good a compromise as we're going to be able to do, sadly. It's unfortunate that we have to make the heap size a bootstrap cmdline param, but I don't see any other way to do it. I also think that this feature has to come with a bunch of warnings about being careful with your sizing, making sure that it matches up with any Pod requests and limits you've put on Envoy, and so on. I'll give the PR a more detailed review soon, sorry about the delay @tsaarni. |
|
Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days. |
youngnick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a better look at this PR, it looks pretty reasonable, but seems like it's missing the documentation page? Did you intend to include that here @tsaarni?
|
Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days. |
|
I will come back with some documentation shortly. |
Signed-off-by: Tero Saarni <[email protected]>
828bec9 to
99b64f5
Compare
|
Rebased and missing documentation |
Signed-off-by: Tero Saarni <[email protected]>
e72c570 to
7f27f6d
Compare
|
Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days. |
|
This PR is ready for review. |
|
Sorry for the delay on this @tsaarni, planning to take a look soon! |
sunjayBhatia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one comment on the documentation page!
We can do a follow up issue to add this new bootstrap flag to the dynamic gateway provisioning method of deploying Contour: https://github.com/projectcontour/contour/blob/main/apis/projectcontour/v1alpha1/contourdeployment.go
7f27f6d to
83aec92
Compare
|
Thanks @sunjayBhatia for the review!
Thanks for spotting this! I went the other direction and changed the bootstrap config to match the documentation, since values came from Envoy best practices document. I think I had no real reason to use 90% instead of 95%. |
|
If there are no further questions, I'll merge this tomorrow. |
skriss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple tiny things but this looks good to me. The issue with the admin endpoint is unfortunate, and maybe we can do some more thinking there on if there's something else we can do, but I don't think it needs to block getting the initial PR in, seems like a net improvement.
| Name: "envoy.resource_monitors.fixed_heap", | ||
| TriggerOneof: &envoy_config_overload_v3.Trigger_Threshold{ | ||
| Threshold: &envoy_config_overload_v3.ThresholdTrigger{ | ||
| Value: 0.95, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see folks wanting to tune these thresholds via flag, but I'm fine leaving statically defined for now since we can always add flags later if/when needed. So no action needed, just thinking out loud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, these could be exposed in future, though I guess it means few more command line switches again...
Signed-off-by: Tero Saarni <[email protected]>
|
@skriss Thank you for the review!
I agree. I could not figure anything that could be done on Contour side, except removing the proxy from the admin API, but that is there for a reason. Overload manager cannot be applied directly on named listeners only, or the other way around: it cannot be configured to ignore certain listeners... |
This change adds minimal support for Envoy's overload manager to avoid cases where Envoy process is terminated by the out-of-memory killer, which results in traffic distrubances.
This PR proposes that administrator can (optionally) configure maximum amount of heap that Envoy is allowed to reserve. It does not allow the overload actions to be added or configured in any way. Instead, it configures default actions which are set according to example in "Configuring Envoy as an edge proxy" best practices doc:
shrink_heapaction is executed when 95% andstop_accepting_requestsaction when 98% of configured maximum heap is reached.The configuration of maximum heap is (unfortunately) again a new command line flag. This is the same for all other bootstrap paramters so far as well. The reasoning is that
contour bootsrapis executed inside Envoy init container, where we do not have Contour config file or capability to readContourConfigurationCR from the API server. Or at least we have not done this so far.There is a major conflict between overload manager and how we expose
/readyand/statsby setting up a proxy to serve these endpoints!While the real admin API at
/admin/admin.sockstill works during overload, the requests via the "proxied versions" of/readyand/statsserved via TCP socket will be rejected whenstop_accepting_requestsis active. As a result, Envoy will be removed from the endpoints of the service. Since the Envoy instance was not accepting new requests anyways, maybe this will not make the overload any worse for the other Envoys. However, another side-effect is that administrator cannot monitor the Envoy instance anymore since stats endpoint will not be served either. Especially the memory related metrics would be of interest, since metrics will show that thestop_accepting_requestsaction is active and the heap numbers explaining why. When admin sets max heap too low, they will not be able to find that out by checking metrics - since metrics are not served due to heap being low 🤔The feature itself seems very useful as it can avoid the OOM killer but I'd like to hear your opinion about the limitations.
Fixes #1794
Signed-off-by: Tero Saarni [email protected]
As a workaround, I found out following commands helpful to access the "real" admin API when the "proxied" admin API endpoints are rejecting requests due to overload:
These need to be executed on the worker node, or just on the dev host when running Kind.
Envoy's
fixed_heapmonitor uses the tcmalloc metrics at/memoryand following formula to calculate the overload percentage:(heap_size - pageheap_unmapped) / maximum_heap