Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Coder pods running out of memory #14881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mambon2 opened this issue Sep 30, 2024 · 82 comments · Fixed by #15709
Open

Coder pods running out of memory #14881

mambon2 opened this issue Sep 30, 2024 · 82 comments · Fixed by #15709
Labels
bug risk Prone to bugs observability Issues related to observability (metrics, dashboards, alerts, opentelemetry) s2 Broken use cases or features (with a workaround). Only humans may set this.

Comments

@mambon2
Copy link

mambon2 commented Sep 30, 2024

We've noticed over the past several months the coder pods gradually run out of memory over time. Typically around 30 day mark but can't really say for sure. Here is our memory allocation for a 5 replica set in k8s.

    resources:
      limits:
        cpu: 4000m
        memory: 8Gi
      requests:
        cpu: 4000m
        memory: 8Gi

and here is a datadog screenshot of the memory usage around 30 day mark

Screenshot 2024-09-23 at 7 15 33 PM

Coder V2.13.4
We don't keep the k8s logs so not able to share those.

@coder-labeler coder-labeler bot added bug observability Issues related to observability (metrics, dashboards, alerts, opentelemetry) labels Sep 30, 2024
@dannykopping
Copy link
Contributor

HI @mambon2,

Can you please clarify a few things for me?

  1. What metric is being used here to calculate memory usage?
  2. Do you have the GOGC or GOMEMLIMIT environment variables set?
  3. There appear to be some marked jumps in memory usage in some of these processes; does this correlate to any events on your side?
  4. Do your pods OOM once they reach 100%? If so, do you experience any downtime?

It would be helpful if you could share the Prometheus metrics of one of these processes with high memory usage as well as a pprof heap profile; that'll help us understand the problem more concretely.

@spikecurtis
Copy link
Contributor

Also, can you look inside the pod and determine which processes are using memory? Coder server runs provisioners by default which start additional processes as part of provisioning. It would be useful to understand whether it is the main server process that is using increasing memory.

@mambon2
Copy link
Author

mambon2 commented Oct 8, 2024

@dannykopping

  1. metric is kubernetes.memory.usage_pct
  2. no we don't have either one
  3. we have not seen any pattern as the memory increases appear to be incremental
  4. yes, all pods will evict when they reach their memory thresholds. However, it doesn't always happen. Sometimes they stay in kind of unusable state. not sure why.
  5. We'll enable the prometheus ENV var during our next upgrade.
  6. to get the pprof heap profile do I need to just shell into the coder pod and run from cmdline?

@dannykopping
Copy link
Contributor

Thanks for the detail @mambon2

Regarding pprof, you'll need to enable CODER_PPROF_ENABLE and then you can navigate to the address specified in CODER_PPROF_ADDRESS (e.g. http://localhost:6060/debug/pprof/) to download the heap & goroutine profiles.

@matifali matifali removed the bug label Oct 14, 2024
@mambon2
Copy link
Author

mambon2 commented Oct 21, 2024

Was not able to set the PPROF env but I do have some metrics for the past 7 days to show what the memory looks like for each of the 5 pods.

Image
Image
Image
Image
Image

@mambon2
Copy link
Author

mambon2 commented Oct 21, 2024

As you can see from the above graph, the top 2 did not have any spikes but are trending upwards around 80% now. The other 3 all have memory spikes exceeding 100%.

@mambon2
Copy link
Author

mambon2 commented Oct 21, 2024

Thanks for the detail @mambon2

Regarding pprof, you'll need to enable CODER_PPROF_ENABLE and then you can navigate to the address specified in CODER_PPROF_ADDRESS (e.g. http://localhost:6060/debug/pprof/) to download the heap & goroutine profiles.

@dannykopping do I use localhost or external DNS for this?

@mambon2
Copy link
Author

mambon2 commented Oct 21, 2024

Unfortunately, we don't expose the individual pod IP addresses, so we can't navigate through the browser. Can we download heap and goroutine using kubectl exec instead?

I tried using the --output parameter but it seemed to save the file inside the pod

@spikecurtis
Copy link
Contributor

If your pod has curl you should be able to kubectl exec -n <namespace> <pod> -- curl http://localhost:6060/debug/pprof/heap > heap.out and similarly for goroutine (http://localhost:6060/debug/pprof/goroutine?debug=2)

@mambon2
Copy link
Author

mambon2 commented Oct 22, 2024

I have enabled pprof now and pods have restarted and memory has gone back down. When do you want to me to collect these? anytime?

@spikecurtis
Copy link
Contributor

I have enabled pprof now and pods have restarted and memory has gone back down. When do you want to me to collect these? anytime?

It would be most helpful to collect them once memory is anomalously high again.

@mambon2
Copy link
Author

mambon2 commented Nov 6, 2024

I'm posting here the 6 memory spikes we had last week that went above 8GIG limit at
10:40 Thursday Oct 31
11:10 Thursday Oct 31
12:10 Friday Nov 1
12:40 Friday Nov 1
17:20 Friday Nov 1
18:00 Friday Nov 1

Image

@mambon2
Copy link
Author

mambon2 commented Nov 6, 2024

I'm attaching the 3 metrics in CVS format for Thursday Oct 31 - Friday Nov 1
The last one (memory usage by container) is the graph above in my previous comment.

go_goroutines-data-2024-11-06 13_10_54.csv
process_resident_memory_bytes-data-2024-11-06 13_11_16.csv
extract-2024-11-06T17_53_51.413Z-Memory Usage by Container.csv

metrics collection started around 10:48 so we missed the first spike

@spikecurtis
Copy link
Contributor

Hmm, I thought the concern here was about the gradual increase in memory use over days, not short term spikes. Are you still observing the gradual increase?

In terms of the spikes, do they correlate with anything else you can observe? Network I/O, workspace builds, automated jobs involving Coder, etc? Is there any regular periodicity to them?

@mambon2
Copy link
Author

mambon2 commented Nov 7, 2024

@spikecurtis Yes the main concern is about the gradual memory increase. However I asked Coder team if I should send these event as well and Nick Spangler suggested that I should. Hence, yesterday's post.

To answer your question, no. Have not noticed any correlated activity to any of these spikes.

@mambon2
Copy link
Author

mambon2 commented Nov 7, 2024

OK so this gave me an idea to check against the workspaces memory graph and it turns out they coincide. here is the graph for Oct 31 - Nov 1 for memory usage of the workspaces.
Image

If this is indeed the case, then you have something to start looking at.

@mambon2
Copy link
Author

mambon2 commented Nov 7, 2024

Checking the past 1 day we had 4 workspace spikes last night but it did not cause any spikes in coderd pods. Here is the workspace memory graph for the past day. so this doesn't appear to be consistent correlation.
Image

and this is the graph for coderd pods for the same period. Ignore the gap that was due to a scheduled maintenance last night.
Image

@spikecurtis
Copy link
Contributor

A gradual increase in memory over days is likely not working as designed. That is to say, it sounds like a memory leak that we want to chase down, understand, and hopefully fix.

A short term spike in memory use might be working as designed, or it might not, depending on what caused the spike. If coderd is getting a huge increase in legitimate load (API requests, proxy traffic, etc.), then we reasonably expect its memory use to go up. Have you checked API request rates, and network I/O during these spikes?

@mambon2
Copy link
Author

mambon2 commented Nov 15, 2024

Here is the memory graph for the past month and the pattern is alway trending upwards with intermittent spikes exceeding the 8G limit.
Image

@mambon2
Copy link
Author

mambon2 commented Nov 15, 2024

Here is where we stand for the past 12 hours. 2 pods are high at 88% memory used. Let me know what if any other metrics you'd like to see for these 2 pods

Image

@mambon2
Copy link
Author

mambon2 commented Nov 15, 2024

I'm attaching goroutine and heap pprof output for one of the 2 pods here. I had to add txt extension to the files so I can attach them. rename them when you download.

goroutine.out.txt
heap.out.txt

@dannykopping
Copy link
Contributor

Thanks for above, @mambon2.

These profiles don't seem to indicate any real problems to me, unfortunately.

Viewing the heap profile with go tool pprof -http=: heap.out:

Image

This only shows 90MiB of data in-use (i.e. allocated and not released).

Similarly for the goroutines profiles:

Image

I'm only seeing ~2K goroutines, much lower than what I would expect when it comes to the memory usage you're demonstrating here.


Even looking at your metrics shared in #14881 (comment), the highest I can see the memory getting to is ~970MiB at 2024-11-01 13:48:00 which is around 25% of the requests based on what you mentioned in #14881 (comment).


Were these profiles/metrics collected during a peak? Based on the above data I'm not seeing much evidence of excessive memory.

Are you able to pull all the Prometheus metrics for these pods from their peak and send them to us? Since we cannot see the query for these datadog dashboards it's pretty difficult for us to understand what precisely we're looking at.

@mambon2
Copy link
Author

mambon2 commented Nov 18, 2024

These metrics were NOT for one of the peaks. I took them at the time when I posted them here for only 1 of the 2 high memory pods.

We are collecting the Prometheus metrics, but I thought you wanted to see them now when the memory was averaging high vs. during memory spikes. Which metrics would you like to see? and are you interested in latests or memory spikes?

@dannykopping
Copy link
Contributor

@mambon2 to clarify:

In this message you said 2 pods were running high at 88% of their available memory. Are the profiles provided in this message related to one of those pods in this high memory state?

@mambon2
Copy link
Author

mambon2 commented Nov 18, 2024

@mambon2 to clarify:

In this message you said 2 pods were running high at 88% of their available memory. Are the profiles provided in this message related to one of those pods in this high memory state?

Yes

@dannykopping
Copy link
Contributor

OK. Is it currently in a similar state? If you could send us a full set of metrics collected from that pod that'd be helpful.

@mambon2
Copy link
Author

mambon2 commented Nov 18, 2024

So the only metrics we're currently collecting are the ones Nick Spangler commented on earlier, except the 3rd one comes from datadog eks metrics, not prometheus. Are there additional ones you'd like to see? and for what span of time?

go_goroutines - a drastic increase in goroutines can use up a lot of memory
process_resident_memory_bytes - the resident size of memory in the process
container_memory_usage_bytes - the total memory used per container

@mambon2
Copy link
Author

mambon2 commented Dec 11, 2024

Thanks @dannykopping please keep us posted

stirby pushed a commit that referenced this issue Dec 11, 2024
fixes #14881

Our handlers for streaming logs don't read from the websocket. We don't allow the client to send us any data, but the websocket library we use requires reading from the websocket to properly handle pings and closing. Not doing so can [can cause the websocket to hang on write](coder/websocket#405), leaking go routines which were noticed in #14881.

This fixes the issue, and in process refactors our log streaming to a encoder/decoder package which provides generic types for sending JSON over websocket.

I'd also like for us to upgrade to the latest https://github.com/coder/websocket but we should also upgrade our tailscale fork before doing so to avoid including two copies of the websocket library.

(cherry picked from commit 148a5a3)
@mambon2
Copy link
Author

mambon2 commented Dec 12, 2024

@spikecurtis please keep this open, thanks.

@spikecurtis spikecurtis reopened this Dec 16, 2024
@mambon2
Copy link
Author

mambon2 commented Dec 16, 2024

@dannykopping do you have any updates on stable release?

@stirby
Copy link
Collaborator

stirby commented Dec 16, 2024

@mambon2 #15709 was added to the latest stable release: v2.17.3.

@mambon2
Copy link
Author

mambon2 commented Dec 20, 2024

We have deployed v2.17.3 last night. Will keep an eye on memory, thanks.

@dannykopping
Copy link
Contributor

Thanks for the update @mambon2 👍
Appreciate your collaboration on this!

@mambon2
Copy link
Author

mambon2 commented Dec 23, 2024

np, traffic will slow down for the remainder of 2024 but we'll close the loop on 2nd week of Jan.

@matifali matifali removed the needs-triage Issue that require triage label Jan 2, 2025
@mambon2
Copy link
Author

mambon2 commented Jan 6, 2025

Good morning, here is the 3 week graph of memory usage since the upgrade on Dec 20th. I do see a general upward trend but memory remains around 50% with occasional spikes above 100%

Image

@dannykopping
Copy link
Contributor

Thanks for the update @mambon2.
Is the traffic to / usage of your service similar to when the problem originally manifested, or is it still ramping up since the holidays ended?

@mambon2
Copy link
Author

mambon2 commented Jan 6, 2025

Good question. No, the traffic has been lower during the holiday season. It will ramp back to normal this week.

@mambon2
Copy link
Author

mambon2 commented Jan 16, 2025

Well, memory still seems to be climbing. Here are the numbers this morning.

Image

@mambon2
Copy link
Author

mambon2 commented Jan 16, 2025

Here is a 1 month graph of one of the pods. Outside of the 2 spikes, it does not appear to give back memory.

Image

@spikecurtis
Copy link
Contributor

Can you capture goroutine and heap profiles from one of the pods that has persistently high memory?

@mambon2
Copy link
Author

mambon2 commented Jan 17, 2025

@spikecurtis sorry I had to restart the pods last night. Can I do this anytime? there doesn't seem to be any one event that causes this. Rather memory just increases over time without releasing it back.

@spikecurtis
Copy link
Contributor

It appears to be a slow build up of memory over time, so we need the profiles from when the pod is using a lot of memory. It won't be helpful to see the profiles right after a restart.

@mambon2
Copy link
Author

mambon2 commented Jan 21, 2025

In that case we'll have to wait until this builds up again. However, if you look at the graph, there are clear instances where the memory gets bumped up and doesn't go back down. Should we be looking at these events? They happen all throughout the span of the graph and I could probably provide metrics for those events.

@mambon2
Copy link
Author

mambon2 commented Jan 21, 2025

Here is a clear example of this from the past week
Image

@mambon2
Copy link
Author

mambon2 commented Feb 7, 2025

Good morning. We noticed today 2 pods hit over 90% memory and I've collected the pprof metrics before restarting them. Please take a look.

coder-64f8cdcf6b-rr8kz-goroutine.txt
coder-64f8cdcf6b-rr8kz-heap.txt

coder-64f8cdcf6b-vz84v-goroutine.txt
coder-64f8cdcf6b-vz84v-heap.txt

@mambon2
Copy link
Author

mambon2 commented Feb 7, 2025

Here is the 3 day graph for one of them (coder-64f8cdcf6b-rr8kz)
The last 2 spikes both happened at 3:40AM EST. No correlations found in datadog or anywhere else.

Image

@spikecurtis
Copy link
Contributor

I'm looking at this now.

The profiles each show about 60MB of in use heap memory, and approximately 1000 goroutines, neither of which strikes me as particularly high. Each Coder was talking to about 25 workspaces. Does that sound right to you---like with 5 replicas that's 125ish active workspaces?

If that 60MB heap size is accurate, I'm sort of at a loss to explain where 90% (of 8GiB) is going. Can you gather some metrics for one of the pods you sent the profiles for, including the time period where those profiles were generated, and let me know the approximate time of the profile capture?

process_resident_memory_bytes
process_virtual_memory_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_sys_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes

@mambon2
Copy link
Author

mambon2 commented Feb 10, 2025

Curtis, I collected the pprof metrics around the time I sent them to you right before I restarted those 2 pods. that was around 8:36AM. 25 workspaces sounds possible although at this time in the morning we only have about 70 workspaces running in total. and another 100 stopped. I took a look at the other metrics and did not see any correlations. What time period on Feb 7 would you like to see?

@mambon2
Copy link
Author

mambon2 commented Feb 10, 2025

I'm also attaching the memory usage for the same pod during that same window of time. I would re-iterate again, that we need to focus on the events where the memory goes up and does not get released to go back down. It seems fairly obvious from the graph that this is what is causing the issue.

extract-2025-02-10T14_26_53.501Z-Memory Usage by Pod.csv

Image

@mambon2
Copy link
Author

mambon2 commented Feb 10, 2025

the other 3 pods are all in the 80s today and will probably need to restart soon. Here is the 7 day graph of one of them to serve as an example of the memory step-up events I'm referring to.

Image

@mambon2
Copy link
Author

mambon2 commented Feb 10, 2025

Same graph from one of the new pods started last week. as you can see it jumped from 9% to 26% today without releasing the memory.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug risk Prone to bugs observability Issues related to observability (metrics, dashboards, alerts, opentelemetry) s2 Broken use cases or features (with a workaround). Only humans may set this.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants