Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Bug: Goroutine leak in coderd.(*api).workspaceAgentTurn #1508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #1939
mafredri opened this issue May 17, 2022 · 5 comments · Fixed by #1978
Closed
Tracked by #1939

Bug: Goroutine leak in coderd.(*api).workspaceAgentTurn #1508

mafredri opened this issue May 17, 2022 · 5 comments · Fixed by #1978
Assignees
Labels
api Area: HTTP API
Milestone

Comments

@mafredri
Copy link
Member

mafredri commented May 17, 2022

There seems to be a goroutine leak in coderd.(*api).workspaceAgentTurn.

This could be seen as two bugs:

  1. Goroutine leak
  2. Use of (*http.Request).Context() after Hijack in more than one place

Steps to Reproduce

  1. Enable pprof for coder server
  2. coder ssh dev
  3. ctlr+d
  4. Goto 1
  5. Check pprof (go tool pprof -http=:8080 http://localhost:6060/debug/pprof/goroutine)

The leak is in part due to reliance on the http.Request context and use of websockets. The underlying websocket library calls (*http.Request).Hijack which disables context propagation.

This happens here:

wsConn, err := websocket.Accept(rw, r, &websocket.AcceptOptions{

And the following contexts will not cancel until the http handler completes:

netConn := websocket.NetConn(r.Context(), wsConn, websocket.MessageBinary)

case <-r.Context().Done():

We must avoid using r.Context() after hijack, unless we are using it with the expectation that the http handler will exit (at which point the context will complete).

I'm unfamiliar with the pion/turn package, but another factor could be wrt how it handles connection closure, perhaps it does not propagate as we expect since we're not calling wsConn.Close() due to context reliance?


Similar reliance on request context after hijack is done elsewhere, we should rethink all of them. Example:

resource, err := api.Database.GetWorkspaceResourceByID(r.Context(), workspaceAgent.ResourceID)

@mafredri mafredri added bug 🐛 api Area: HTTP API labels May 17, 2022
@kylecarbs
Copy link
Member

Absolutely awesome bug find!

@tjcran
Copy link

tjcran commented May 30, 2022

@mafredri what is the potential impact of this bug?

@ketang
Copy link
Contributor

ketang commented May 31, 2022

Very interested in how you found this. I'm sure there was more to it than pprof.

@mafredri
Copy link
Member Author

mafredri commented May 31, 2022

@tjcran For the codepath I analyzed, I estimate ~0.25MB of memory leaked per SSH connection, for active long running servers it could mean significant memory consumption that is unrecoverable without a restart of the coder server. There could be other potential issues I haven't explored, like file descriptor exhaustion on the host.

Edit: There's also a similar (but smaller) memory increase on the workspace running coder agent.

@ketang Not terribly exiting I'm afraid, I was simply curious as to what was causing slight CPU usage while coder server was idling, so I took a look and found this instead. 😄

@tjcran
Copy link

tjcran commented Jun 1, 2022

@tjcran For the codepath I analyzed, I estimate ~0.25MB of memory leaked per SSH connection, for active long running servers it could mean significant memory consumption that is unrecoverable without a restart of the coder server. There could be other potential issues I haven't explored, like file descriptor exhaustion on the host.

@mafredri this is significant enough I"d like to include in Community MVP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Area: HTTP API
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants