Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: reinitialize agents when a prebuilt workspace is claimed #17475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

SasSwart
Copy link
Contributor

This pull request allows coder workspace agents to be reinitialized when a prebuilt workspace is claimed by a user. This facilitates the transfer of ownership between the anonymous prebuilds system user and the new owner of the workspace.

Only a single agent per prebuilt workspace is supported for now, but plumbing has already been done to facilitate the seamless transition to multi-agent support.

@SasSwart SasSwart changed the title WIP: agent reinitialization feat: reinitialize agents when a prebuilt workspace is claimed Apr 21, 2025
@SasSwart SasSwart force-pushed the jjs/prebuilds-agent-reinit branch from 35e4bf8 to 18da76e Compare April 23, 2025 13:49
@evgeniy-scherbina evgeniy-scherbina force-pushed the yevhenii/512-claim-prebuild branch from fe569d4 to fcdbba8 Compare April 23, 2025 15:23
@SasSwart SasSwart force-pushed the jjs/prebuilds-agent-reinit branch from cc25406 to 26dbc3a Compare April 24, 2025 12:34
Base automatically changed from yevhenii/512-claim-prebuild to main April 24, 2025 13:39
@SasSwart SasSwart force-pushed the jjs/prebuilds-agent-reinit branch from ec9ed29 to 362db7c Compare April 25, 2025 08:32
Comment on lines 1836 to 1845
// Complete the job, optionally triggering workspace agent reinitialization:

completedJob := proto.CompletedJob{
JobId: job.ID.String(),
Type: &proto.CompletedJob_WorkspaceBuild_{
WorkspaceBuild: &proto.CompletedJob_WorkspaceBuild{},
},
}
_, err = srv.CompleteJob(ctx, &completedJob)
require.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate by defining 'this'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does CompleteJob need to occur for the test to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. CompleteJob is the thing that publishes to the channel. Its the entire point of this test, yes. If CompleteJob ever stops publishing, this should fail.

}

go func() {
<-ctx.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Howcome we wait for this to complete before cancel-ing? To protect messages from being passed to workspaceClaims after we close the channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we expect the user to defer cancel() where they call this. But we can't enforce that. As a defensive measure to protect against leaking the goroutine and pubsub connection, as well as the channel, we call cancel() ourselves when the context expires.

}

func StreamAgentReinitEvents(ctx context.Context, logger slog.Logger, rw http.ResponseWriter, r *http.Request, reinitEvents <-chan agentsdk.ReinitializationEvent) {
sseSendEvent, sseSenderClosed, err := httpapi.ServerSentEventSender(rw, r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix this later, but I think this might be inappropriate at this layer.
This seems like a detail of the HTTP API, and we aren't testing this directly so I'd argue this code could go back to coderd/workspaceagents.go.

Message string `json:"message"`
Reason ReinitializationReason `json:"reason"`
WorkspaceID uuid.UUID
UserID uuid.UUID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bleeds the abstraction a bit because there's nothing prebuilds-specific in this func. The workspace's agent is what reinitializes, and the user should be irrelevant here. Do we need this?

@SasSwart SasSwart marked this pull request as ready for review May 1, 2025 19:34
@@ -294,7 +299,7 @@ message Metadata {
string workspace_owner_login_type = 18;
repeated Role workspace_owner_rbac_roles = 19;
bool is_prebuild = 20;
string running_workspace_agent_token = 21;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-review: it concerns me that running_workspace_agent_token made it into main. I'd love to undo that before this is released but it might already be too late.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running_workspace_agent_token is defunct and meant to be replaced by running_agent_auth_tokens.
The former only supports a single agent. The latter supports multiple agents.
Nothing will break if the former stays, but it pollutes the API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was added to main in #16951 but hasn't been released, so I guess it's fine to change the type of this index. Still, as Cian mentioned, you need to increment the protocol version in provisionerd/proto/version.go.

Is this change actually back-compatible though? Like, what will an external provisionerd that's an old version that doesn't understand this new field end up doing? I think it will fail to set the appropriate environment variable, and then the terraform provider will generate a new token for the Agent, which might invalidate the prebuild or cause the existing agent to get disconnected. Not good.

Comment on lines +275 to +279
message RunningAgentAuthToken {
string agent_id = 1;
string token = 2;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding a new field, you need to increment proto.CurrentVersion. We tend to add a comment as well explaning the versioning history. See provisionerd/proto/version.go.

Comment on lines +276 to +286
tokens := metadata.GetRunningAgentAuthTokens()
if len(tokens) == 1 {
env = append(env, provider.RunningAgentTokenEnvironmentVariable("")+"="+tokens[0].Token)
} else {
// Not currently supported, but added for forward-compatibility
for _, t := range tokens {
// If there are multiple agents, provide all the tokens to terraform so that it can
// choose the correct one for each agent ID.
env = append(env, provider.RunningAgentTokenEnvironmentVariable(t.AgentId)+"="+t.Token)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't changes here also introduce tests in provisioner_test.go?

defer srv.Close()

requestCtx := testutil.Context(t, testutil.WaitShort)
req, err := http.NewRequestWithContext(requestCtx, "GET", srv.URL, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use srv.Client() here?

go func() {
for retrier := retry.New(100*time.Millisecond, 10*time.Second); retrier.Wait(ctx); {
logger.Debug(ctx, "waiting for agent reinitialization instructions")
reinitEvent, err := client.WaitForReinit(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this SSE endpoint that can stream multiple events --- so why are we hanging up after the first event just to redial the endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but would you mind disregarding this given that we're going to replace this with the manifest stream by the next release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

@@ -1733,6 +1744,21 @@ func (s *server) CompleteJob(ctx context.Context, completed *proto.CompletedJob)
if err != nil {
return nil, xerrors.Errorf("update workspace: %w", err)
}

if input.PrebuildClaimedByUser != uuid.Nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels narrow and fragile that we need to pass this imperative direction to provisionerdserver to send a signal specifically for prebuilds. It seems like a generally useful signal to know that a build completed for a given workspace... why not just always send it?

Even if we wanted to send the new owner ID in the message, there is no need to get it from the job input, we've already queried the workspace.

Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest remaining issue is around back compatibility for old versions of the provisionerd server.

a[0].Scripts = []*proto.Script{
{
DisplayName: "Prebuild Test Script",
Script: fmt.Sprintf("sleep 5; printenv | grep 'CODER_AGENT_TOKEN' >> %s; echo '---\n' >> %s", tempAgentLog.Name(), tempAgentLog.Name()), // Make reinitialization take long enough to assert that it happened
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing the token doesn't seem like a good test since the token is not supposed to change between reinits. Can't we write the owner name or ID?

Valid: true,
Time: time.Now(),
},
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either you are running a provisioner daemon, in which case it should be responsible for completing the build job, or you are not, in which case, you need something else to send the pubsub kick that the build is done.

_, err = srv.CompleteJob(ctx, &completedJob)
require.NoError(t, err)

select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.f. testutil.RequireReceive

slog.F("workspace_id", workspace.ID), slog.Error(err))
}
for _, agent := range agents {
runningAgentAuthTokens = append(runningAgentAuthTokens, &sdkproto.RunningAgentAuthToken{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a test for this functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants