Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Reduce DB load of autobuild #15082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johnstcn opened this issue Oct 15, 2024 · 6 comments
Closed

Reduce DB load of autobuild #15082

johnstcn opened this issue Oct 15, 2024 · 6 comments
Assignees
Labels
need-backend Issues that need backend work needs-rfc Issues that needs an RFC due to an expansive scope and unclear implementation path.

Comments

@johnstcn
Copy link
Member

johnstcn commented Oct 15, 2024

Motivation

The autobuild/ package periodically queries for workspaces that are eligible for a state transition via GetWorkspacesEligibleForTransition.

This is run every CODER_AUTOBUILD_POLL_INTERVAL, which defaults to 1 minute.

This would probably be OK if that were all it did, but it also kicks off a bunch of other queries per workspace in the result set:

  • GetWorkspaceByID
  • GetUserByID
  • GetLatestWorkspaceBuildByWorkspaceID
  • GetProvisionerJobByID
  • GetTemplateByID
  • GetTemplateVersionByID
  • GetTemplateAccessControl

This can cause significant database load with multiple Coder instances and/or large numbers of workspaces.

Solutions

a) Instead of periodically querying, move to an event-based approach based on the next time we know we need to start a workspace.

b) Keep the existing timer-based approach but optimize the query.

c) Only have one replica running the autobuild query at a time.

@coder-labeler coder-labeler bot added the needs decision Needs a higher-level decision to be unblocked. label Oct 15, 2024
@johnstcn johnstcn added need-backend Issues that need backend work and removed needs decision Needs a higher-level decision to be unblocked. labels Oct 15, 2024
@aaronlehmann
Copy link
Contributor

I like the event-based approach, since that will scale better to large number of workspaces. Even with < 1000 workspaces, we were seeing high DB load from the default 1 minute scan interval (multiplied by 3 server instances in a HA configuraion).

If you keep the existing timer-based approach, I'd recommend spacing out the per-workspace checks over time instead of running them in a tight loop whenever the timer fires. Having adjusted the timer interval to 10 minutes, this results in clear load spikes on the database:
Image

@johnstcn
Copy link
Member Author

I like the event-based approach, since that will scale better to large number of workspaces.

My reservation here is that event-based systems can be tricky, and scheduling is also tricky 😅. So I'm definitely biased in favour of a simpler polling-based system.

As an alternative, I was thinking about the possibility of storing the next autostart time alongside the cron schedule in the database. That would allow us to only iterate over the workspaces that we know are due to start in the next poll interval, as opposed to all of the workspaces that could potentially start ever.

If you keep the existing timer-based approach, I'd recommend spacing out the per-workspace checks over time instead of running them in a tight loop whenever the timer fires.

Yeah, that's also a good call. Decoupling the workspace transition query logic from the workspace transition action logic would probably also help here.

@johnstcn johnstcn added the needs-rfc Issues that needs an RFC due to an expansive scope and unclear implementation path. label Oct 16, 2024
@Emyrk
Copy link
Member

Emyrk commented Oct 29, 2024

A quick win is we could probably cache GetTemplateByID and GetTemplateVersionByID. A lot of workspaces are fetching the same data.

DanielleMaywood added a commit that referenced this issue Nov 13, 2024
…ves (#15429)

Relates to #15082

The old implementation of `GetWorkspacesEligibleForTransition` returns
many workspaces that are not actually eligible for transition. This new
implementation reduces this number significantly (at least on our
dogfood instance).
@DanielleMaywood
Copy link
Contributor

To give an update on this:

We've successfully managed to reduce the false-positive rate of GetWorkspacesEligibleForTransition.

As we can see from this graph of our dogfood instance, we have massively reduced the number of "last transition not valid" logs. These logs occur when GetWorkspacesEligibleForTransition returns workspaces that are not eligible for transition.

Image

Unfortunately there is still work to do on this as we still return false-positives for the "is eligible for autostart" transition.

The benefit of reducing the false-positive rate of GetWorkspacesEligibleForTransition is that we now make less calls to the database. This is because for each workspace returned from this query, we proceed to make another 7 database calls.

One of the 7 DB calls that we make is to GetWorkspaceByID, and we can see that this change has meant that this query now has a smaller impact on the database as we make far fewer calls to it.

Image

We've also seen a reduction in the data transfer from the database, this is for two reasons. One reason is a reduction in the number of workspaces returned. The second is because we no longer return the entire workspace row, instead only the ID and Name (as these were the only values we were using).

Image

The next step of work is to stop returning false-positives for the "is eligible for autostart" transition. This should hopefully stop GetWorkspacesEligibleForTransition returning false-positives. Once we've achieved that, we will evaluate if any further steps need to be made (i.e. investigate a different approach, such as an event-based system), or if the load from this becomes insignificant.

@DanielleMaywood
Copy link
Contributor

Whilst writing some tests for #15594, we've discovered the tests require mocking time in provisionerdserver. This already somewhat exists but is inaccessible to the autostart/autostop tests.

Before making further progress on #15594, a change will need to be made to provisionerdserver to use a quartz.Clock, as well as a change to coderd to pass its clock to provisionerdserver.

DanielleMaywood added a commit that referenced this issue Dec 2, 2024
…ositives (#15594)

Relates to #15082

Further to #15429, this reduces the
amount of false-positives returned by the 'is eligible for autostart'
part of the query. We achieve this by calculating the 'next start at'
time of the workspace, storing it in the database, and using it in our
`GetWorkspacesEligibleForTransition` query.

The prior implementation of the 'is eligible for autostart' query would
return _all_ workspaces that at some point in the future _might_ be
eligible for autostart. This now ensures we only return workspaces that
_should_ be eligible for autostart.

We also now pass `currentTick` instead of `t` to the
`GetWorkspacesEligibleForTransition` query as otherwise we'll have one
round of workspaces that are skipped by `isEligibleForTransition` due to
`currentTick` being a truncated version of `t`.
@DanielleMaywood
Copy link
Contributor

Update:

On our dogfood instance we've managed to reduce the number of "last transition not valid" logs to 0.

Here is the per/minute count of "last transition not valid" logs for the past 30 days, notice the massive drop off at the first fix (discussed on the previous update post). Then we have a drop off again at the tail end of the graph, which can be noticed by the lack of data.

Image

Here is that same graph but zoomed into the past 2 days.

Image

This reduction has been achieved by tackling the final false positive situation: "is eligible for autostart". We previously only stored the cron schedule in the database, which meant we couldn't figure out if a workspace was eligible for autostart in our GetWorkspacesEligibleForTransition query. This meant the query returned any workspace that would be eligible for the autostart at any point in the future.

To rectify this we added a next_start_at column to the workspaces table so we can know when the next time a workspace should start is. This means our query will now only return workspaces it knows should be valid for an autostart.

To summarise: We've managed to stop the query from returning false-positives which should massively reduce the amount of database calls our lifecycle executor should make (each false-positive would trigger a further 7 database calls). This means our lifecycle executor should only be making 1 database call per tick (which defaults to 1 minute) unless any workspaces should need to transition, instead of up to 1 + $NUMBER_OF_WORKSPACES * 7 per tick.

If we discover this polling method still causes an unnecessary amount of database load, we can investigate if we need to consider a larger refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-backend Issues that need backend work needs-rfc Issues that needs an RFC due to an expansive scope and unclear implementation path.
Projects
None yet
Development

No branches or pull requests

4 participants