-
Notifications
You must be signed in to change notification settings - Fork 881
Coder pods running out of memory #14881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
HI @mambon2, Can you please clarify a few things for me?
It would be helpful if you could share the Prometheus metrics of one of these processes with high memory usage as well as a pprof |
Also, can you look inside the pod and determine which processes are using memory? Coder server runs provisioners by default which start additional processes as part of provisioning. It would be useful to understand whether it is the main server process that is using increasing memory. |
|
Thanks for the detail @mambon2 Regarding pprof, you'll need to enable |
As you can see from the above graph, the top 2 did not have any spikes but are trending upwards around 80% now. The other 3 all have memory spikes exceeding 100%. |
@dannykopping do I use localhost or external DNS for this? |
Unfortunately, we don't expose the individual pod IP addresses, so we can't navigate through the browser. Can we download heap and goroutine using kubectl exec instead? I tried using the --output parameter but it seemed to save the file inside the pod |
If your pod has |
I have enabled pprof now and pods have restarted and memory has gone back down. When do you want to me to collect these? anytime? |
It would be most helpful to collect them once memory is anomalously high again. |
I'm attaching the 3 metrics in CVS format for Thursday Oct 31 - Friday Nov 1 go_goroutines-data-2024-11-06 13_10_54.csv metrics collection started around 10:48 so we missed the first spike |
Hmm, I thought the concern here was about the gradual increase in memory use over days, not short term spikes. Are you still observing the gradual increase? In terms of the spikes, do they correlate with anything else you can observe? Network I/O, workspace builds, automated jobs involving Coder, etc? Is there any regular periodicity to them? |
@spikecurtis Yes the main concern is about the gradual memory increase. However I asked Coder team if I should send these event as well and Nick Spangler suggested that I should. Hence, yesterday's post. To answer your question, no. Have not noticed any correlated activity to any of these spikes. |
A gradual increase in memory over days is likely not working as designed. That is to say, it sounds like a memory leak that we want to chase down, understand, and hopefully fix. A short term spike in memory use might be working as designed, or it might not, depending on what caused the spike. If coderd is getting a huge increase in legitimate load (API requests, proxy traffic, etc.), then we reasonably expect its memory use to go up. Have you checked API request rates, and network I/O during these spikes? |
I'm attaching goroutine and heap pprof output for one of the 2 pods here. I had to add txt extension to the files so I can attach them. rename them when you download. |
Thanks for above, @mambon2. These profiles don't seem to indicate any real problems to me, unfortunately. Viewing the heap profile with This only shows 90MiB of data in-use (i.e. allocated and not released). Similarly for the goroutines profiles: I'm only seeing ~2K goroutines, much lower than what I would expect when it comes to the memory usage you're demonstrating here. Even looking at your metrics shared in #14881 (comment), the highest I can see the memory getting to is ~970MiB at 2024-11-01 13:48:00 which is around 25% of the requests based on what you mentioned in #14881 (comment). Were these profiles/metrics collected during a peak? Based on the above data I'm not seeing much evidence of excessive memory. Are you able to pull all the Prometheus metrics for these pods from their peak and send them to us? Since we cannot see the query for these datadog dashboards it's pretty difficult for us to understand what precisely we're looking at. |
These metrics were NOT for one of the peaks. I took them at the time when I posted them here for only 1 of the 2 high memory pods. We are collecting the Prometheus metrics, but I thought you wanted to see them now when the memory was averaging high vs. during memory spikes. Which metrics would you like to see? and are you interested in latests or memory spikes? |
@mambon2 to clarify: In this message you said 2 pods were running high at 88% of their available memory. Are the profiles provided in this message related to one of those pods in this high memory state? |
Yes |
OK. Is it currently in a similar state? If you could send us a full set of metrics collected from that pod that'd be helpful. |
So the only metrics we're currently collecting are the ones Nick Spangler commented on earlier, except the 3rd one comes from datadog eks metrics, not prometheus. Are there additional ones you'd like to see? and for what span of time? go_goroutines - a drastic increase in goroutines can use up a lot of memory |
Thanks @dannykopping please keep us posted |
fixes #14881 Our handlers for streaming logs don't read from the websocket. We don't allow the client to send us any data, but the websocket library we use requires reading from the websocket to properly handle pings and closing. Not doing so can [can cause the websocket to hang on write](coder/websocket#405), leaking go routines which were noticed in #14881. This fixes the issue, and in process refactors our log streaming to a encoder/decoder package which provides generic types for sending JSON over websocket. I'd also like for us to upgrade to the latest https://github.com/coder/websocket but we should also upgrade our tailscale fork before doing so to avoid including two copies of the websocket library. (cherry picked from commit 148a5a3)
@spikecurtis please keep this open, thanks. |
@dannykopping do you have any updates on stable release? |
We have deployed v2.17.3 last night. Will keep an eye on memory, thanks. |
Thanks for the update @mambon2 👍 |
np, traffic will slow down for the remainder of 2024 but we'll close the loop on 2nd week of Jan. |
Thanks for the update @mambon2. |
Good question. No, the traffic has been lower during the holiday season. It will ramp back to normal this week. |
Can you capture goroutine and heap profiles from one of the pods that has persistently high memory? |
@spikecurtis sorry I had to restart the pods last night. Can I do this anytime? there doesn't seem to be any one event that causes this. Rather memory just increases over time without releasing it back. |
It appears to be a slow build up of memory over time, so we need the profiles from when the pod is using a lot of memory. It won't be helpful to see the profiles right after a restart. |
In that case we'll have to wait until this builds up again. However, if you look at the graph, there are clear instances where the memory gets bumped up and doesn't go back down. Should we be looking at these events? They happen all throughout the span of the graph and I could probably provide metrics for those events. |
Good morning. We noticed today 2 pods hit over 90% memory and I've collected the pprof metrics before restarting them. Please take a look. coder-64f8cdcf6b-rr8kz-goroutine.txt coder-64f8cdcf6b-vz84v-goroutine.txt |
I'm looking at this now. The profiles each show about 60MB of in use heap memory, and approximately 1000 goroutines, neither of which strikes me as particularly high. Each Coder was talking to about 25 workspaces. Does that sound right to you---like with 5 replicas that's 125ish active workspaces? If that 60MB heap size is accurate, I'm sort of at a loss to explain where 90% (of 8GiB) is going. Can you gather some metrics for one of the pods you sent the profiles for, including the time period where those profiles were generated, and let me know the approximate time of the profile capture? process_resident_memory_bytes |
Curtis, I collected the pprof metrics around the time I sent them to you right before I restarted those 2 pods. that was around 8:36AM. 25 workspaces sounds possible although at this time in the morning we only have about 70 workspaces running in total. and another 100 stopped. I took a look at the other metrics and did not see any correlations. What time period on Feb 7 would you like to see? |
I'm also attaching the memory usage for the same pod during that same window of time. I would re-iterate again, that we need to focus on the events where the memory goes up and does not get released to go back down. It seems fairly obvious from the graph that this is what is causing the issue. extract-2025-02-10T14_26_53.501Z-Memory Usage by Pod.csv ![]() |
We've noticed over the past several months the coder pods gradually run out of memory over time. Typically around 30 day mark but can't really say for sure. Here is our memory allocation for a 5 replica set in k8s.
and here is a datadog screenshot of the memory usage around 30 day mark
Coder V2.13.4
We don't keep the k8s logs so not able to share those.
The text was updated successfully, but these errors were encountered: