Reduce DB load of autobuild #15082

johnstcn · 2024-10-15T13:52:37Z

Motivation

The autobuild/ package periodically queries for workspaces that are eligible for a state transition via GetWorkspacesEligibleForTransition.

This is run every CODER_AUTOBUILD_POLL_INTERVAL, which defaults to 1 minute.

This would probably be OK if that were all it did, but it also kicks off a bunch of other queries per workspace in the result set:

GetWorkspaceByID
GetUserByID
GetLatestWorkspaceBuildByWorkspaceID
GetProvisionerJobByID
GetTemplateByID
GetTemplateVersionByID
GetTemplateAccessControl

This can cause significant database load with multiple Coder instances and/or large numbers of workspaces.

Solutions

a) Instead of periodically querying, move to an event-based approach based on the next time we know we need to start a workspace.

b) Keep the existing timer-based approach but optimize the query.

c) Only have one replica running the autobuild query at a time.

The text was updated successfully, but these errors were encountered:

aaronlehmann · 2024-10-15T15:23:18Z

I like the event-based approach, since that will scale better to large number of workspaces. Even with < 1000 workspaces, we were seeing high DB load from the default 1 minute scan interval (multiplied by 3 server instances in a HA configuraion).

If you keep the existing timer-based approach, I'd recommend spacing out the per-workspace checks over time instead of running them in a tight loop whenever the timer fires. Having adjusted the timer interval to 10 minutes, this results in clear load spikes on the database:

johnstcn · 2024-10-15T16:45:39Z

I like the event-based approach, since that will scale better to large number of workspaces.

My reservation here is that event-based systems can be tricky, and scheduling is also tricky 😅. So I'm definitely biased in favour of a simpler polling-based system.

As an alternative, I was thinking about the possibility of storing the next autostart time alongside the cron schedule in the database. That would allow us to only iterate over the workspaces that we know are due to start in the next poll interval, as opposed to all of the workspaces that could potentially start ever.

If you keep the existing timer-based approach, I'd recommend spacing out the per-workspace checks over time instead of running them in a tight loop whenever the timer fires.

Yeah, that's also a good call. Decoupling the workspace transition query logic from the workspace transition action logic would probably also help here.

Emyrk · 2024-10-29T16:53:50Z

A quick win is we could probably cache GetTemplateByID and GetTemplateVersionByID. A lot of workspaces are fetching the same data.

…ves (#15429) Relates to #15082 The old implementation of `GetWorkspacesEligibleForTransition` returns many workspaces that are not actually eligible for transition. This new implementation reduces this number significantly (at least on our dogfood instance).

DanielleMaywood · 2024-11-15T14:53:26Z

To give an update on this:

We've successfully managed to reduce the false-positive rate of GetWorkspacesEligibleForTransition.

As we can see from this graph of our dogfood instance, we have massively reduced the number of "last transition not valid" logs. These logs occur when GetWorkspacesEligibleForTransition returns workspaces that are not eligible for transition.

Unfortunately there is still work to do on this as we still return false-positives for the "is eligible for autostart" transition.

The benefit of reducing the false-positive rate of GetWorkspacesEligibleForTransition is that we now make less calls to the database. This is because for each workspace returned from this query, we proceed to make another 7 database calls.

One of the 7 DB calls that we make is to GetWorkspaceByID, and we can see that this change has meant that this query now has a smaller impact on the database as we make far fewer calls to it.

We've also seen a reduction in the data transfer from the database, this is for two reasons. One reason is a reduction in the number of workspaces returned. The second is because we no longer return the entire workspace row, instead only the ID and Name (as these were the only values we were using).

The next step of work is to stop returning false-positives for the "is eligible for autostart" transition. This should hopefully stop GetWorkspacesEligibleForTransition returning false-positives. Once we've achieved that, we will evaluate if any further steps need to be made (i.e. investigate a different approach, such as an event-based system), or if the load from this becomes insignificant.

DanielleMaywood · 2024-11-25T11:35:58Z

Whilst writing some tests for #15594, we've discovered the tests require mocking time in provisionerdserver. This already somewhat exists but is inaccessible to the autostart/autostop tests.

Before making further progress on #15594, a change will need to be made to provisionerdserver to use a quartz.Clock, as well as a change to coderd to pass its clock to provisionerdserver.

…ositives (#15594) Relates to #15082 Further to #15429, this reduces the amount of false-positives returned by the 'is eligible for autostart' part of the query. We achieve this by calculating the 'next start at' time of the workspace, storing it in the database, and using it in our `GetWorkspacesEligibleForTransition` query. The prior implementation of the 'is eligible for autostart' query would return _all_ workspaces that at some point in the future _might_ be eligible for autostart. This now ensures we only return workspaces that _should_ be eligible for autostart. We also now pass `currentTick` instead of `t` to the `GetWorkspacesEligibleForTransition` query as otherwise we'll have one round of workspaces that are skipped by `isEligibleForTransition` due to `currentTick` being a truncated version of `t`.

DanielleMaywood · 2024-12-04T15:05:57Z

Update:

On our dogfood instance we've managed to reduce the number of "last transition not valid" logs to 0.

Here is the per/minute count of "last transition not valid" logs for the past 30 days, notice the massive drop off at the first fix (discussed on the previous update post). Then we have a drop off again at the tail end of the graph, which can be noticed by the lack of data.

Here is that same graph but zoomed into the past 2 days.

This reduction has been achieved by tackling the final false positive situation: "is eligible for autostart". We previously only stored the cron schedule in the database, which meant we couldn't figure out if a workspace was eligible for autostart in our GetWorkspacesEligibleForTransition query. This meant the query returned any workspace that would be eligible for the autostart at any point in the future.

To rectify this we added a next_start_at column to the workspaces table so we can know when the next time a workspace should start is. This means our query will now only return workspaces it knows should be valid for an autostart.

To summarise: We've managed to stop the query from returning false-positives which should massively reduce the amount of database calls our lifecycle executor should make (each false-positive would trigger a further 7 database calls). This means our lifecycle executor should only be making 1 database call per tick (which defaults to 1 minute) unless any workspaces should need to transition, instead of up to 1 + $NUMBER_OF_WORKSPACES * 7 per tick.

If we discover this polling method still causes an unnecessary amount of database load, we can investigate if we need to consider a larger refactor.

coder-labeler bot added the needs decision Needs a higher-level decision to be unblocked. label Oct 15, 2024

johnstcn added need-backend Issues that need backend work and removed needs decision Needs a higher-level decision to be unblocked. labels Oct 15, 2024

johnstcn added the needs-rfc Issues that needs an RFC due to an expansive scope and unclear implementation path. label Oct 16, 2024

johnstcn assigned johnstcn and DanielleMaywood Nov 1, 2024

DanielleMaywood mentioned this issue Nov 7, 2024

fix: make GetWorkspacesEligibleForTransition return less false-positives #15429

Merged

DanielleMaywood mentioned this issue Nov 20, 2024

fix: make GetWorkspacesEligibleForTransition return even less false positives #15594

Merged

DanielleMaywood mentioned this issue Nov 25, 2024

refactor(coderd/provisionerdserver): use quartz.Clock instead of TimeNowFn #15642

Merged

DanielleMaywood closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce DB load of autobuild #15082

Reduce DB load of autobuild #15082

johnstcn commented Oct 15, 2024 •

edited

Loading

aaronlehmann commented Oct 15, 2024

johnstcn commented Oct 15, 2024

Emyrk commented Oct 29, 2024

DanielleMaywood commented Nov 15, 2024

DanielleMaywood commented Nov 25, 2024

DanielleMaywood commented Dec 4, 2024

Reduce DB load of autobuild #15082

Reduce DB load of autobuild #15082

Comments

johnstcn commented Oct 15, 2024 • edited Loading

Motivation

Solutions

aaronlehmann commented Oct 15, 2024

johnstcn commented Oct 15, 2024

Emyrk commented Oct 29, 2024

DanielleMaywood commented Nov 15, 2024

DanielleMaywood commented Nov 25, 2024

DanielleMaywood commented Dec 4, 2024

johnstcn commented Oct 15, 2024 •

edited

Loading