Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: track resource replacements when claiming a prebuilt workspace #17571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Apr 25, 2025

Closes coder/internal#369

We can't know whether a replacement (i.e. drift of terraform state leading to a resource needing to be deleted/recreated) will take place apriori; we can only detect it at plan time, because the provider decides whether a resource must be replaced and it cannot be inferred through static analysis of the template.

This is likely to be the most common gotcha with using prebuilds, since it requires a slight template modification to use prebuilds effectively, so let's head this off before it's an issue for customers.

Drift details will now be logged in the workspace build logs:

image

Plus a notification will be sent to template admins when this situation arises:

image

A new metric - coderd_prebuilt_workspaces_resource_replacements_total - will also increment each time a workspace encounters replacements.

We only track that a resource replacement occurred, not how many. Just one is enough to ruin a prebuild, but we can't know apriori which replacement would cause this.
For example, say we have 2 replacements: a docker_container and a null_resource; we don't know which one might
cause an issue (or indeed if either would), so we just track the replacement.

If you’re using prebuilds to speed up provisioning, unexpected replacements will slow down
workspace startup—even when claiming a prebuilt environment.

For tips on preventing replacements and improving claim performance, see [this guide](https://coder.com/docs/TODO).
Copy link
Contributor Author

@dannykopping dannykopping Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relies on #17580

@github-actions github-actions bot added the stale This issue is like stale bread. label May 6, 2025
@dannykopping dannykopping force-pushed the dk/logreplacements branch 3 times, most recently from 4859410 to 634082b Compare May 7, 2025 21:06
@github-actions github-actions bot removed the stale This issue is like stale bread. label May 8, 2025
@dannykopping dannykopping force-pushed the dk/logreplacements branch from 0322146 to 13168f4 Compare May 8, 2025 12:38
@@ -75,6 +75,7 @@ message CompletedJob {
repeated provisioner.Resource resources = 2;
repeated provisioner.Timing timings = 3;
repeated provisioner.Module modules = 4;
repeated provisioner.ResourceReplacement resourceReplacements = 5;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is backwards-compatible since repeated fields are effectively optional if nil length.

// nolint:gocritic // Necessary to query all the required data.
ctx = dbauthz.AsSystemRestricted(ctx)
// Since this may be called in a fire-and-forget fashion, we need to give up at some point.
trackCtx, trackCancel := context.WithTimeout(ctx, time.Minute)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a best-effort attempt to warn operators of this situation; it's ok if it times out, we'll get a log to trace this with.

@@ -258,7 +258,7 @@ func getStateFilePath(workdir string) string {
}

// revive:disable-next-line:flag-parameter
func (e *executor) plan(ctx, killCtx context.Context, env, vars []string, logr logSink, destroy bool) (*proto.PlanComplete, error) {
func (e *executor) plan(ctx, killCtx context.Context, env, vars []string, logr logSink, metadata *proto.Metadata) (*proto.PlanComplete, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the meat of the feature; everything else is just plumbing between system and user eyeball.

level := proto.LogLevel_INFO

// Terraform indicates that a resource will be deleted and recreated by showing the change along with this substring.
if bytes.Contains(line, []byte("# forces replacement")) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit flimsy; open to other ideas.
In any case, this is just sugar. The fact that the plan, with all its drift details, are shown will be sufficient. Highlighting the lines is just a courtesy to the user.


// TrackResourceReplacement handles a pathological situation whereby a terraform resource is replaced due to drift,
// which can obviate the whole point of pre-provisioning a prebuilt workspace.
// See more detail at https://coder.com/docs/admin/templates/extending-templates/prebuilt-workspaces.md#preventing-resource-replacement.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on #17580

If you’re using prebuilds to speed up provisioning, unexpected replacements will slow down
workspace startup—even when claiming a prebuilt environment.

For tips on preventing replacements and improving claim performance, see [this guide](https://coder.com/docs/admin/templates/extending-templates/prebuilt-workspaces.md#preventing-resource-replacement).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on #17580

@@ -13,43 +16,65 @@ import (
"github.com/coder/coder/v2/coderd/prebuilds"
)

const (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Muddies the purpose of the PR a bit, but it was a worthwhile driveby refactoring given that we're adding a new metric (MetricResourceReplacementsCount) and we need to check for its value in a test.

@dannykopping dannykopping changed the title WIP! feat: log resource replacements when claiming a prebuilt workspace feat: track resource replacements when claiming a prebuilt workspace May 8, 2025
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
also a test that was broken from an earlier fix

Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
@dannykopping dannykopping force-pushed the dk/logreplacements branch from 13168f4 to 70f9a53 Compare May 8, 2025 12:48
@dannykopping dannykopping marked this pull request as ready for review May 8, 2025 12:59
@@ -42,6 +42,11 @@ FROM templates t
WHERE tvp.desired_instances IS NOT NULL -- Consider only presets that have a prebuild configuration.
AND (t.id = sqlc.narg('template_id')::uuid OR sqlc.narg('template_id') IS NULL);

-- name: GetTemplatePresetsByID :one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • should the name be singular instead of plural?
  • move it to presets.sql?
  • there are similar query: -- name: GetPresetByID :one, you may consider reuse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I missed that.

@@ -75,6 +75,7 @@ message CompletedJob {
repeated provisioner.Resource resources = 2;
repeated provisioner.Timing timings = 3;
repeated provisioner.Module modules = 4;
repeated provisioner.ResourceReplacement resourceReplacements = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed on Slack, but these changes need to bump the minor version --- unless it was already bumped since the last release, in which case you need to update the comment describing the version bump, but don't need to bump it twice in a single release.

Copy link
Contributor Author

@dannykopping dannykopping May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also prebuild_claim_for_user_id in provisionersdk/proto/provisioner.proto, but I'm trying to see if I can remove this in favour of passing whether the workspace-prebuilds experiment is used down, since it's effectively just being used as a control flag for that in provisioner/terraform/executor.go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inform template admins that resources will be replaced
3 participants