Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@tsaarni
Copy link
Member

@tsaarni tsaarni commented Jun 28, 2022

This change adds minimal support for Envoy's overload manager to avoid cases where Envoy process is terminated by the out-of-memory killer, which results in traffic distrubances.

This PR proposes that administrator can (optionally) configure maximum amount of heap that Envoy is allowed to reserve. It does not allow the overload actions to be added or configured in any way. Instead, it configures default actions which are set according to example in "Configuring Envoy as an edge proxy" best practices doc: shrink_heap action is executed when 95% and stop_accepting_requests action when 98% of configured maximum heap is reached.

The configuration of maximum heap is (unfortunately) again a new command line flag. This is the same for all other bootstrap paramters so far as well. The reasoning is that contour bootsrap is executed inside Envoy init container, where we do not have Contour config file or capability to read ContourConfiguration CR from the API server. Or at least we have not done this so far.

There is a major conflict between overload manager and how we expose /ready and /stats by setting up a proxy to serve these endpoints!

While the real admin API at /admin/admin.sock still works during overload, the requests via the "proxied versions" of /ready and /stats served via TCP socket will be rejected when stop_accepting_requests is active. As a result, Envoy will be removed from the endpoints of the service. Since the Envoy instance was not accepting new requests anyways, maybe this will not make the overload any worse for the other Envoys. However, another side-effect is that administrator cannot monitor the Envoy instance anymore since stats endpoint will not be served either. Especially the memory related metrics would be of interest, since metrics will show that the stop_accepting_requests action is active and the heap numbers explaining why. When admin sets max heap too low, they will not be able to find that out by checking metrics - since metrics are not served due to heap being low 🤔

The feature itself seems very useful as it can avoid the OOM killer but I'd like to hear your opinion about the limitations.

Fixes #1794

Signed-off-by: Tero Saarni [email protected]

As a workaround, I found out following commands helpful to access the "real" admin API when the "proxied" admin API endpoints are rejecting requests due to overload:

sudo curl --silent --unix-socket /proc/$(pidof envoy)/root/admin/admin.sock http://localhost/stats | grep -E "^overload|^server.memory"
sudo curl --silent --unix-socket /proc/$(pidof envoy)/root/admin/admin.sock http://localhost/memory # tcmalloc metrics

These need to be executed on the worker node, or just on the dev host when running Kind.

Envoy's fixed_heap monitor uses the tcmalloc metrics at /memory and following formula to calculate the overload percentage: (heap_size - pageheap_unmapped) / maximum_heap

@tsaarni tsaarni requested a review from a team as a code owner June 28, 2022 13:58
@tsaarni tsaarni requested review from stevesloka and youngnick and removed request for a team June 28, 2022 13:58
@tsaarni tsaarni added the release-note/minor A minor change that needs about a paragraph of explanation in the release notes. label Jun 28, 2022
@codecov
Copy link

codecov bot commented Jun 28, 2022

Codecov Report

Merging #4597 (f10ff6b) into main (d8553a8) will increase coverage by 0.14%.
The diff coverage is 97.50%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4597      +/-   ##
==========================================
+ Coverage   76.08%   76.23%   +0.14%     
==========================================
  Files         140      140              
  Lines       13073    13147      +74     
==========================================
+ Hits         9947    10023      +76     
+ Misses       2872     2871       -1     
+ Partials      254      253       -1     
Impacted Files Coverage Δ
cmd/contour/bootstrap.go 0.00% <0.00%> (ø)
internal/envoy/bootstrap.go 55.88% <ø> (ø)
internal/envoy/v3/bootstrap.go 94.33% <100.00%> (+0.82%) ⬆️
internal/dag/dag.go 95.53% <0.00%> (ø)
internal/envoy/v3/route.go 73.95% <0.00%> (+0.21%) ⬆️
internal/sorter/sorter.go 98.79% <0.00%> (+0.60%) ⬆️
internal/dag/httpproxy_processor.go 92.80% <0.00%> (+0.65%) ⬆️

@tsaarni tsaarni force-pushed the overload-manager branch from cc38cf5 to 828bec9 Compare June 28, 2022 15:22
@youngnick
Copy link
Member

I think in cases that the heap is low, getting metrics is definitely less of a big deal than being oomkilled. It seems like this is about as good a compromise as we're going to be able to do, sadly.

It's unfortunate that we have to make the heap size a bootstrap cmdline param, but I don't see any other way to do it.

I also think that this feature has to come with a bunch of warnings about being careful with your sizing, making sure that it matches up with any Pod requests and limits you've put on Envoy, and so on.

I'll give the PR a more detailed review soon, sorry about the delay @tsaarni.

@github-actions
Copy link

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2022
Copy link
Member

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a better look at this PR, it looks pretty reasonable, but seems like it's missing the documentation page? Did you intend to include that here @tsaarni?

@youngnick youngnick removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2022
@skriss skriss added this to the 1.23.0 milestone Jul 26, 2022
@github-actions
Copy link

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@tsaarni tsaarni removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@tsaarni
Copy link
Member Author

tsaarni commented Aug 10, 2022

I will come back with some documentation shortly.

@tsaarni
Copy link
Member Author

tsaarni commented Aug 18, 2022

Rebased and missing documentation site/content/docs/main/config/overload-manager.md added.

Signed-off-by: Tero Saarni <[email protected]>
@tsaarni tsaarni force-pushed the overload-manager branch 2 times, most recently from e72c570 to 7f27f6d Compare August 19, 2022 17:13
@github-actions
Copy link

github-actions bot commented Sep 3, 2022

Marking this PR stale since there has been no activity for 14 days. It will be closed if there is no activity for another 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2022
@tsaarni
Copy link
Member Author

tsaarni commented Sep 5, 2022

This PR is ready for review.

@tsaarni tsaarni removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2022
@skriss
Copy link
Member

skriss commented Sep 7, 2022

Sorry for the delay on this @tsaarni, planning to take a look soon!

Copy link
Member

@sunjayBhatia sunjayBhatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one comment on the documentation page!

We can do a follow up issue to add this new bootstrap flag to the dynamic gateway provisioning method of deploying Contour: https://github.com/projectcontour/contour/blob/main/apis/projectcontour/v1alpha1/contourdeployment.go

@tsaarni
Copy link
Member Author

tsaarni commented Sep 14, 2022

Thanks @sunjayBhatia for the review!

this and the below might need an update to match the bootstrap config (looks like 90% and 98% for these two actions)

Thanks for spotting this! I went the other direction and changed the bootstrap config to match the documentation, since values came from Envoy best practices document. I think I had no real reason to use 90% instead of 95%.

@tsaarni
Copy link
Member Author

tsaarni commented Sep 15, 2022

If there are no further questions, I'll merge this tomorrow.

Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple tiny things but this looks good to me. The issue with the admin endpoint is unfortunate, and maybe we can do some more thinking there on if there's something else we can do, but I don't think it needs to block getting the initial PR in, seems like a net improvement.

Name: "envoy.resource_monitors.fixed_heap",
TriggerOneof: &envoy_config_overload_v3.Trigger_Threshold{
Threshold: &envoy_config_overload_v3.ThresholdTrigger{
Value: 0.95,
Copy link
Member

@skriss skriss Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see folks wanting to tune these thresholds via flag, but I'm fine leaving statically defined for now since we can always add flags later if/when needed. So no action needed, just thinking out loud.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, these could be exposed in future, though I guess it means few more command line switches again...

@tsaarni tsaarni added release-note/major A major change that needs more than a paragraph of explanation in the release notes. and removed release-note/minor A minor change that needs about a paragraph of explanation in the release notes. labels Sep 16, 2022
@tsaarni
Copy link
Member Author

tsaarni commented Sep 16, 2022

@skriss Thank you for the review!

Just a couple tiny things but this looks good to me. The issue with the admin endpoint is unfortunate, and maybe we can do some more thinking there on if there's something else we can do, but I don't think it needs to block getting the initial PR in, seems like a net improvement.

I agree. I could not figure anything that could be done on Contour side, except removing the proxy from the admin API, but that is there for a reason. Overload manager cannot be applied directly on named listeners only, or the other way around: it cannot be configured to ignore certain listeners...

@tsaarni tsaarni merged commit 9ed1dc4 into projectcontour:main Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/major A major change that needs more than a paragraph of explanation in the release notes.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Investigate enabling overload manager for Envoy

4 participants