Added support for overload manager #4597

tsaarni · 2022-06-28T13:58:55Z

This change adds minimal support for Envoy's overload manager to avoid cases where Envoy process is terminated by the out-of-memory killer, which results in traffic distrubances.

This PR proposes that administrator can (optionally) configure maximum amount of heap that Envoy is allowed to reserve. It does not allow the overload actions to be added or configured in any way. Instead, it configures default actions which are set according to example in "Configuring Envoy as an edge proxy" best practices doc: shrink_heap action is executed when 95% and stop_accepting_requests action when 98% of configured maximum heap is reached.

The configuration of maximum heap is (unfortunately) again a new command line flag. This is the same for all other bootstrap paramters so far as well. The reasoning is that contour bootsrap is executed inside Envoy init container, where we do not have Contour config file or capability to read ContourConfiguration CR from the API server. Or at least we have not done this so far.

There is a major conflict between overload manager and how we expose /ready and /stats by setting up a proxy to serve these endpoints!

While the real admin API at /admin/admin.sock still works during overload, the requests via the "proxied versions" of /ready and /stats served via TCP socket will be rejected when stop_accepting_requests is active. As a result, Envoy will be removed from the endpoints of the service. Since the Envoy instance was not accepting new requests anyways, maybe this will not make the overload any worse for the other Envoys. However, another side-effect is that administrator cannot monitor the Envoy instance anymore since stats endpoint will not be served either. Especially the memory related metrics would be of interest, since metrics will show that the stop_accepting_requests action is active and the heap numbers explaining why. When admin sets max heap too low, they will not be able to find that out by checking metrics - since metrics are not served due to heap being low 🤔

The feature itself seems very useful as it can avoid the OOM killer but I'd like to hear your opinion about the limitations.

Fixes #1794

Signed-off-by: Tero Saarni [email protected]

As a workaround, I found out following commands helpful to access the "real" admin API when the "proxied" admin API endpoints are rejecting requests due to overload:

sudo curl --silent --unix-socket /proc/$(pidof envoy)/root/admin/admin.sock http://localhost/stats | grep -E "^overload|^server.memory"
sudo curl --silent --unix-socket /proc/$(pidof envoy)/root/admin/admin.sock http://localhost/memory # tcmalloc metrics

These need to be executed on the worker node, or just on the dev host when running Kind.

Envoy's fixed_heap monitor uses the tcmalloc metrics at /memory and following formula to calculate the overload percentage: (heap_size - pageheap_unmapped) / maximum_heap

codecov · 2022-06-28T14:04:37Z

Codecov Report

Merging #4597 (f10ff6b) into main (d8553a8) will increase coverage by 0.14%.
The diff coverage is 97.50%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4597      +/-   ##
==========================================
+ Coverage   76.08%   76.23%   +0.14%     
==========================================
  Files         140      140              
  Lines       13073    13147      +74     
==========================================
+ Hits         9947    10023      +76     
+ Misses       2872     2871       -1     
+ Partials      254      253       -1

Impacted Files	Coverage Δ
cmd/contour/bootstrap.go	`0.00% <0.00%> (ø)`
internal/envoy/bootstrap.go	`55.88% <ø> (ø)`
internal/envoy/v3/bootstrap.go	`94.33% <100.00%> (+0.82%)`	⬆️
internal/dag/dag.go	`95.53% <0.00%> (ø)`
internal/envoy/v3/route.go	`73.95% <0.00%> (+0.21%)`	⬆️
internal/sorter/sorter.go	`98.79% <0.00%> (+0.60%)`	⬆️
internal/dag/httpproxy_processor.go	`92.80% <0.00%> (+0.65%)`	⬆️

youngnick · 2022-07-07T04:43:28Z

I think in cases that the heap is low, getting metrics is definitely less of a big deal than being oomkilled. It seems like this is about as good a compromise as we're going to be able to do, sadly.

It's unfortunate that we have to make the heap size a bootstrap cmdline param, but I don't see any other way to do it.

I also think that this feature has to come with a bunch of warnings about being careful with your sizing, making sure that it matches up with any Pod requests and limits you've put on Envoy, and so on.

I'll give the PR a more detailed review soon, sorry about the delay @tsaarni.

github-actions · 2022-07-22T00:25:18Z

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

youngnick

I just had a better look at this PR, it looks pretty reasonable, but seems like it's missing the documentation page? Did you intend to include that here @tsaarni?

github-actions · 2022-08-10T00:20:33Z

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

tsaarni · 2022-08-10T04:00:27Z

I will come back with some documentation shortly.

Signed-off-by: Tero Saarni <[email protected]>

tsaarni · 2022-08-18T11:10:51Z

Rebased and missing documentation site/content/docs/main/config/overload-manager.md added.

Signed-off-by: Tero Saarni <[email protected]>

github-actions · 2022-09-03T00:21:47Z

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

tsaarni · 2022-09-05T12:04:57Z

This PR is ready for review.

skriss · 2022-09-07T21:17:45Z

Sorry for the delay on this @tsaarni, planning to take a look soon!

sunjayBhatia

LGTM, just one comment on the documentation page!

We can do a follow up issue to add this new bootstrap flag to the dynamic gateway provisioning method of deploying Contour: https://github.com/projectcontour/contour/blob/main/apis/projectcontour/v1alpha1/contourdeployment.go

site/content/docs/main/config/overload-manager.md

Signed-off-by: Tero Saarni <[email protected]>

tsaarni · 2022-09-14T05:49:44Z

Thanks @sunjayBhatia for the review!

this and the below might need an update to match the bootstrap config (looks like 90% and 98% for these two actions)

Thanks for spotting this! I went the other direction and changed the bootstrap config to match the documentation, since values came from Envoy best practices document. I think I had no real reason to use 90% instead of 95%.

tsaarni · 2022-09-15T12:14:07Z

If there are no further questions, I'll merge this tomorrow.

skriss

Just a couple tiny things but this looks good to me. The issue with the admin endpoint is unfortunate, and maybe we can do some more thinking there on if there's something else we can do, but I don't think it needs to block getting the initial PR in, seems like a net improvement.

changelogs/unreleased/4597-tsaarni-minor.md

skriss · 2022-09-15T20:10:14Z

internal/envoy/v3/bootstrap.go

+							Name: "envoy.resource_monitors.fixed_heap",
+							TriggerOneof: &envoy_config_overload_v3.Trigger_Threshold{
+								Threshold: &envoy_config_overload_v3.ThresholdTrigger{
+									Value: 0.95,


I could see folks wanting to tune these thresholds via flag, but I'm fine leaving statically defined for now since we can always add flags later if/when needed. So no action needed, just thinking out loud.

True, these could be exposed in future, though I guess it means few more command line switches again...

site/content/docs/main/config/overload-manager.md

Signed-off-by: Tero Saarni <[email protected]>

tsaarni · 2022-09-16T16:00:32Z

@skriss Thank you for the review!

Just a couple tiny things but this looks good to me. The issue with the admin endpoint is unfortunate, and maybe we can do some more thinking there on if there's something else we can do, but I don't think it needs to block getting the initial PR in, seems like a net improvement.

I agree. I could not figure anything that could be done on Contour side, except removing the proxy from the admin API, but that is there for a reason. Overload manager cannot be applied directly on named listeners only, or the other way around: it cannot be configured to ignore certain listeners...

tsaarni requested a review from a team as a code owner June 28, 2022 13:58

tsaarni requested review from stevesloka and youngnick and removed request for a team June 28, 2022 13:58

tsaarni added the release-note/minor A minor change that needs about a paragraph of explanation in the release notes. label Jun 28, 2022

tsaarni force-pushed the overload-manager branch from cc38cf5 to 828bec9 Compare June 28, 2022 15:22

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2022

youngnick reviewed Jul 25, 2022

View reviewed changes

youngnick removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022

skriss assigned tsaarni Jul 26, 2022

skriss added this to the 1.23.0 milestone Jul 26, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022

tsaarni removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022

sunjayBhatia requested review from skriss and sunjayBhatia August 16, 2022 19:37

Added support for overload manager

cb2b80c

Signed-off-by: Tero Saarni <[email protected]>

tsaarni force-pushed the overload-manager branch from 828bec9 to 99b64f5 Compare August 18, 2022 11:09

Added missing documentation

e72c570

Signed-off-by: Tero Saarni <[email protected]>

tsaarni force-pushed the overload-manager branch 2 times, most recently from e72c570 to 7f27f6d Compare August 19, 2022 17:13

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2022

tsaarni removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2022

sunjayBhatia approved these changes Sep 12, 2022

View reviewed changes

site/content/docs/main/config/overload-manager.md Show resolved Hide resolved

sunjayBhatia mentioned this pull request Sep 12, 2022

Overload manager max heap flag should be configurable in instances of Contour created by Gateway Provisioner #4716

Closed

tsaarni added 2 commits September 14, 2022 06:59

Merge branch 'main' into overload-manager

62f1307

aligned shrink heap action with the envoy's best practices doc

83aec92

Signed-off-by: Tero Saarni <[email protected]>

tsaarni force-pushed the overload-manager branch from 7f27f6d to 83aec92 Compare September 14, 2022 04:56

skriss requested changes Sep 15, 2022

View reviewed changes

Corrected the % when stop accepting request, renamed changelog to major

f10ff6b

Signed-off-by: Tero Saarni <[email protected]>

tsaarni added release-note/major A major change that needs more than a paragraph of explanation in the release notes. and removed release-note/minor A minor change that needs about a paragraph of explanation in the release notes. labels Sep 16, 2022

skriss approved these changes Sep 16, 2022

View reviewed changes

tsaarni merged commit 9ed1dc4 into projectcontour:main Sep 16, 2022

Added support for overload manager #4597

Added support for overload manager #4597

Uh oh!

Conversation

tsaarni commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

youngnick commented Jul 7, 2022

Uh oh!

github-actions bot commented Jul 22, 2022

Uh oh!

youngnick left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 10, 2022

Uh oh!

tsaarni commented Aug 10, 2022

Uh oh!

tsaarni commented Aug 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 3, 2022

Uh oh!

tsaarni commented Sep 5, 2022

Uh oh!

skriss commented Sep 7, 2022

Uh oh!

sunjayBhatia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tsaarni commented Sep 14, 2022

Uh oh!

tsaarni commented Sep 15, 2022

Uh oh!

skriss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

skriss Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsaarni Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tsaarni commented Sep 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tsaarni commented Jun 28, 2022 •

edited

Loading

codecov bot commented Jun 28, 2022 •

edited

Loading

tsaarni commented Aug 18, 2022 •

edited

Loading

skriss Sep 15, 2022 •

edited

Loading