-
Notifications
You must be signed in to change notification settings - Fork 881
feat: reinitialize agents when a prebuilt workspace is claimed #17475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
35e4bf8
to
18da76e
Compare
fe569d4
to
fcdbba8
Compare
cc25406
to
26dbc3a
Compare
ec9ed29
to
362db7c
Compare
// Complete the job, optionally triggering workspace agent reinitialization: | ||
|
||
completedJob := proto.CompletedJob{ | ||
JobId: job.ID.String(), | ||
Type: &proto.CompletedJob_WorkspaceBuild_{ | ||
WorkspaceBuild: &proto.CompletedJob_WorkspaceBuild{}, | ||
}, | ||
} | ||
_, err = srv.CompleteJob(ctx, &completedJob) | ||
require.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate by defining 'this'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does CompleteJob
need to occur for the test to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah. CompleteJob is the thing that publishes to the channel. Its the entire point of this test, yes. If CompleteJob ever stops publishing, this should fail.
} | ||
|
||
go func() { | ||
<-ctx.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Howcome we wait for this to complete before cancel
-ing? To protect messages from being passed to workspaceClaims
after we close the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we expect the user to defer cancel()
where they call this. But we can't enforce that. As a defensive measure to protect against leaking the goroutine and pubsub connection, as well as the channel, we call cancel() ourselves when the context expires.
coderd/prebuilds/claim.go
Outdated
} | ||
|
||
func StreamAgentReinitEvents(ctx context.Context, logger slog.Logger, rw http.ResponseWriter, r *http.Request, reinitEvents <-chan agentsdk.ReinitializationEvent) { | ||
sseSendEvent, sseSenderClosed, err := httpapi.ServerSentEventSender(rw, r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fix this later, but I think this might be inappropriate at this layer.
This seems like a detail of the HTTP API, and we aren't testing this directly so I'd argue this code could go back to coderd/workspaceagents.go
.
Message string `json:"message"` | ||
Reason ReinitializationReason `json:"reason"` | ||
WorkspaceID uuid.UUID | ||
UserID uuid.UUID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bleeds the abstraction a bit because there's nothing prebuilds-specific in this func. The workspace's agent is what reinitializes, and the user should be irrelevant here. Do we need this?
@@ -294,7 +299,7 @@ message Metadata { | |||
string workspace_owner_login_type = 18; | |||
repeated Role workspace_owner_rbac_roles = 19; | |||
bool is_prebuild = 20; | |||
string running_workspace_agent_token = 21; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self-review: it concerns me that running_workspace_agent_token
made it into main. I'd love to undo that before this is released but it might already be too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running_workspace_agent_token is defunct and meant to be replaced by running_agent_auth_tokens.
The former only supports a single agent. The latter supports multiple agents.
Nothing will break if the former stays, but it pollutes the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it was added to main in #16951 but hasn't been released, so I guess it's fine to change the type of this index. Still, as Cian mentioned, you need to increment the protocol version in provisionerd/proto/version.go
.
Is this change actually back-compatible though? Like, what will an external provisionerd that's an old version that doesn't understand this new field end up doing? I think it will fail to set the appropriate environment variable, and then the terraform provider will generate a new token for the Agent, which might invalidate the prebuild or cause the existing agent to get disconnected. Not good.
message RunningAgentAuthToken { | ||
string agent_id = 1; | ||
string token = 2; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding a new field, you need to increment proto.CurrentVersion
. We tend to add a comment as well explaning the versioning history. See provisionerd/proto/version.go
.
tokens := metadata.GetRunningAgentAuthTokens() | ||
if len(tokens) == 1 { | ||
env = append(env, provider.RunningAgentTokenEnvironmentVariable("")+"="+tokens[0].Token) | ||
} else { | ||
// Not currently supported, but added for forward-compatibility | ||
for _, t := range tokens { | ||
// If there are multiple agents, provide all the tokens to terraform so that it can | ||
// choose the correct one for each agent ID. | ||
env = append(env, provider.RunningAgentTokenEnvironmentVariable(t.AgentId)+"="+t.Token) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't changes here also introduce tests in provisioner_test.go
?
defer srv.Close() | ||
|
||
requestCtx := testutil.Context(t, testutil.WaitShort) | ||
req, err := http.NewRequestWithContext(requestCtx, "GET", srv.URL, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use srv.Client()
here?
go func() { | ||
for retrier := retry.New(100*time.Millisecond, 10*time.Second); retrier.Wait(ctx); { | ||
logger.Debug(ctx, "waiting for agent reinitialization instructions") | ||
reinitEvent, err := client.WaitForReinit(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this SSE endpoint that can stream multiple events --- so why are we hanging up after the first event just to redial the endpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but would you mind disregarding this given that we're going to replace this with the manifest stream by the next release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough
@@ -1733,6 +1744,21 @@ func (s *server) CompleteJob(ctx context.Context, completed *proto.CompletedJob) | |||
if err != nil { | |||
return nil, xerrors.Errorf("update workspace: %w", err) | |||
} | |||
|
|||
if input.PrebuildClaimedByUser != uuid.Nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels narrow and fragile that we need to pass this imperative direction to provisionerdserver to send a signal specifically for prebuilds. It seems like a generally useful signal to know that a build completed for a given workspace... why not just always send it?
Even if we wanted to send the new owner ID in the message, there is no need to get it from the job input, we've already queried the workspace.
Signed-off-by: Danny Kopping <[email protected]>
Signed-off-by: Danny Kopping <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest remaining issue is around back compatibility for old versions of the provisionerd server.
a[0].Scripts = []*proto.Script{ | ||
{ | ||
DisplayName: "Prebuild Test Script", | ||
Script: fmt.Sprintf("sleep 5; printenv | grep 'CODER_AGENT_TOKEN' >> %s; echo '---\n' >> %s", tempAgentLog.Name(), tempAgentLog.Name()), // Make reinitialization take long enough to assert that it happened |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing the token doesn't seem like a good test since the token is not supposed to change between reinits. Can't we write the owner name or ID?
Valid: true, | ||
Time: time.Now(), | ||
}, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either you are running a provisioner daemon, in which case it should be responsible for completing the build job, or you are not, in which case, you need something else to send the pubsub kick that the build is done.
_, err = srv.CompleteJob(ctx, &completedJob) | ||
require.NoError(t, err) | ||
|
||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c.f. testutil.RequireReceive
slog.F("workspace_id", workspace.ID), slog.Error(err)) | ||
} | ||
for _, agent := range agents { | ||
runningAgentAuthTokens = append(runningAgentAuthTokens, &sdkproto.RunningAgentAuthToken{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a test for this functionality.
This pull request allows coder workspace agents to be reinitialized when a prebuilt workspace is claimed by a user. This facilitates the transfer of ownership between the anonymous prebuilds system user and the new owner of the workspace.
Only a single agent per prebuilt workspace is supported for now, but plumbing has already been done to facilitate the seamless transition to multi-agent support.