-
Notifications
You must be signed in to change notification settings - Fork 881
Reduce DB load of autobuild #15082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My reservation here is that event-based systems can be tricky, and scheduling is also tricky 😅. So I'm definitely biased in favour of a simpler polling-based system. As an alternative, I was thinking about the possibility of storing the next autostart time alongside the cron schedule in the database. That would allow us to only iterate over the workspaces that we know are due to start in the next poll interval, as opposed to all of the workspaces that could potentially start ever.
Yeah, that's also a good call. Decoupling the workspace transition query logic from the workspace transition action logic would probably also help here. |
A quick win is we could probably cache |
To give an update on this: We've successfully managed to reduce the false-positive rate of GetWorkspacesEligibleForTransition. As we can see from this graph of our dogfood instance, we have massively reduced the number of "last transition not valid" logs. These logs occur when Unfortunately there is still work to do on this as we still return false-positives for the "is eligible for autostart" transition. The benefit of reducing the false-positive rate of One of the 7 DB calls that we make is to We've also seen a reduction in the data transfer from the database, this is for two reasons. One reason is a reduction in the number of workspaces returned. The second is because we no longer return the entire workspace row, instead only the ID and Name (as these were the only values we were using). The next step of work is to stop returning false-positives for the "is eligible for autostart" transition. This should hopefully stop |
Whilst writing some tests for #15594, we've discovered the tests require mocking time in Before making further progress on #15594, a change will need to be made to |
…ositives (#15594) Relates to #15082 Further to #15429, this reduces the amount of false-positives returned by the 'is eligible for autostart' part of the query. We achieve this by calculating the 'next start at' time of the workspace, storing it in the database, and using it in our `GetWorkspacesEligibleForTransition` query. The prior implementation of the 'is eligible for autostart' query would return _all_ workspaces that at some point in the future _might_ be eligible for autostart. This now ensures we only return workspaces that _should_ be eligible for autostart. We also now pass `currentTick` instead of `t` to the `GetWorkspacesEligibleForTransition` query as otherwise we'll have one round of workspaces that are skipped by `isEligibleForTransition` due to `currentTick` being a truncated version of `t`.
Update: On our dogfood instance we've managed to reduce the number of "last transition not valid" logs to 0. Here is the per/minute count of "last transition not valid" logs for the past 30 days, notice the massive drop off at the first fix (discussed on the previous update post). Then we have a drop off again at the tail end of the graph, which can be noticed by the lack of data. Here is that same graph but zoomed into the past 2 days. This reduction has been achieved by tackling the final false positive situation: "is eligible for autostart". We previously only stored the cron schedule in the database, which meant we couldn't figure out if a workspace was eligible for autostart in our To rectify this we added a To summarise: We've managed to stop the query from returning false-positives which should massively reduce the amount of database calls our lifecycle executor should make (each false-positive would trigger a further 7 database calls). This means our lifecycle executor should only be making 1 database call per tick (which defaults to 1 minute) unless any workspaces should need to transition, instead of up to If we discover this polling method still causes an unnecessary amount of database load, we can investigate if we need to consider a larger refactor. |
Motivation
The
autobuild/
package periodically queries for workspaces that are eligible for a state transition viaGetWorkspacesEligibleForTransition
.This is run every
CODER_AUTOBUILD_POLL_INTERVAL
, which defaults to 1 minute.This would probably be OK if that were all it did, but it also kicks off a bunch of other queries per workspace in the result set:
GetWorkspaceByID
GetUserByID
GetLatestWorkspaceBuildByWorkspaceID
GetProvisionerJobByID
GetTemplateByID
GetTemplateVersionByID
GetTemplateAccessControl
This can cause significant database load with multiple Coder instances and/or large numbers of workspaces.
Solutions
a) Instead of periodically querying, move to an event-based approach based on the next time we know we need to start a workspace.
b) Keep the existing timer-based approach but optimize the query.
c) Only have one replica running the autobuild query at a time.
The text was updated successfully, but these errors were encountered: