fix: close server pty connections on client disconnect #15201
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #15174
We originally noticed that
last_used_at
was constantly ticking upwards even after clients disconnect. After narrowing it down to the web terminal I did some digging and found thatagentssh.Bicopy
was never exiting even after the client disconnects. This meant our websocket connections on the server were held open forever?
and this continually counted as an open connection to the workspace and bumpedlast_used_at
as a result.https://github.com/coder/coder/blob/f0ssel/last_used_at_inc/coderd/workspaceapps/proxy.go#L702
This is due to a combination of our use of
agentssh.Bicopy
as the primary reader of the websocket and our lack of timeout on ourwebsocket.Ping
attempt.I've added a timeout equal to the loop interval for
httpapi.HeartbeatClose
which can now catch the newly failingwebsocket.Ping
call and cancel the entire request context -- leading to theagentssh.Bicopy
finally exiting and releasing the connection.But what makes
last_used_at
get bumped forever?Great question, I wasn't sure at first since I was logging the inputs and outputs and clearly saw stats from the terminal session stop coming through on disconnect. When I logged the contents of the
workspaceapps.StatsCollector
over time I noticed that the stats from the hung websocket connections were never getting cleaned up. After some digging I found that in order for a stats to get cleared out of the stat collector it must get a stat published with a non-zeroended_at
time value. The/pty
route would send this final stat in adefer
block in the handler but since the handler never closed, we never got the stat, so it stayed refreshing forever.I think there's some improvements we can make to make this safer in the future when we take a look at how to finally merge agent session stats and workspace stats together.