-
Notifications
You must be signed in to change notification settings - Fork 881
feat: Allow running standalone provisioner daemons #3563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This Pull Request is becoming stale. In order to minimize WIP, prevent merge conflicts and keep the tracker readable, I'm going close to this PR in 3 days if there isn't more activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job on this PR! In general I think it looks really good, despite the number of comments I left.
return exitErr | ||
}, | ||
} | ||
defaultCacheDir := filepath.Join(os.TempDir(), "coder-cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion (optional): We could consider fixing #2534 (partially) here too? Something like adrg/xdg and using xdg.CacheHome
could work (I only took a quick look, and it seems pretty fully-featured).
Then again, perhaps we should do a more thorough fix all at once, (i.e. respect XDG elsewhere too, like config).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, but I think this changeset is already kind of big and sprawling as it is (it fixes four separate issues) so I would lean toward doing that in a separate PR.
require.ErrorIs(t, err, context.Canceled, "provisioner command terminated with error") | ||
}() | ||
|
||
ctx, cancelFunc := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider allowing tests to timeout individually, e.g.:
ctx, cancelFunc := context.WithCancel(context.Background()) | |
ctx, cancelFunc := context.WithTimeout(context.Background(), testutil.WaitLong) |
@@ -203,7 +203,8 @@ CREATE TABLE provisioner_daemons ( | |||
created_at timestamp with time zone NOT NULL, | |||
updated_at timestamp with time zone, | |||
name character varying(64) NOT NULL, | |||
provisioners provisioner_type[] NOT NULL | |||
provisioners provisioner_type[] NOT NULL, | |||
auth_token uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use-case for allowing auth_token
to be NULL? Do we want to be able to revoke an auth token without deleting the provisioner daemon? Maybe we want to lookup what provisioner daemon had a specific auth token at some point? In that case it could make sense to have a deleted bool
field instead of a nullable auth token.
Maybe auth tokens should be renewable too? That could also work as a delete/create using the same name
(would also leave a "trace" about which auth token has produced what builds).
Another alternative would be to have provisioner_daemon_auth_tokens
with fields like daemon_id, token, granted, revoked
.
Just putting out some ideas, since I wasn't sure of the purpose of the nullability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, this could have used a better explanation. It doesn't have anything to do with changing or revoking auth tokens.
With this change, we support both in-process and out-of-process provisioners, both of which are persisted in the database. An in-process provisioner has a NULL auth_token
to represent the fact that external connections are not allowed to "become" that provisioner. That seemed cleaner than assigning a token that would never be used.
api.websocketWaitMutex.Unlock() | ||
defer api.websocketWaitGroup.Done() | ||
|
||
conn, err := websocket.Accept(rw, r, &websocket.AcceptOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do consider that r.Context()
is invalid from this point onwards (due to http Hijack). So if it's relied upon for connection closure / cancellation, it will not work.
You can consider using func websocketNetConn
to rewire the context below instead of websocket.NetConn(...)
.
} | ||
|
||
errCh := make(chan error, 1) | ||
provisionerDaemon, err := newProvisionerDaemon(ctx, client.ListenProvisionerDaemon, logger, cacheDir, errCh, useEchoProvisioner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying it has to happen here, but since the provisioner has a name, consider using filepath.Join(cacheDir, "provisionerd", name)
. Perhaps in newProvisionerDaemon
.
This will allow multiple provisioners to run on the same machine without potentially breaking terraform init
.
return user | ||
} | ||
|
||
// ExtractWorkspaceAgent requires authentication using a valid provisioner token. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ExtractWorkspaceAgent requires authentication using a valid provisioner token. | |
// ExtractProvisionerDaemon requires authentication using a valid provisioner token. |
} | ||
token, err := uuid.Parse(cookie.Value) | ||
if err != nil { | ||
httpapi.Write(rw, http.StatusUnauthorized, codersdk.Response{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider this a bad request, but perhaps there's a reason it's unauthorized?
httpapi.Write(rw, http.StatusUnauthorized, codersdk.Response{ | |
httpapi.Write(rw, http.StatusBadRequest, codersdk.Response{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly just for consistency with ExtractWorkspaceAgent
, but I think a 401 error is reasonable here.
From a client's point of view, a token is an opaque string, and our implementation happens to generate tokens that look like UUIDs. Since we return a 401 error if the cookie is a valid UUID but isn't a token that exists in the DB, it makes sense to return the same error if it's not a valid UUID (and therefore can't be a valid token).
if err != nil { | ||
if errors.Is(err, sql.ErrNoRows) { | ||
httpapi.Write(rw, http.StatusUnauthorized, codersdk.Response{ | ||
Message: "Provisioner token is invalid.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is a bit misleading, token is a valid (format), but not registered or revoked. Simply saying Forbidden.
could work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like my other comment, this is basically just mimicking the way we handle agent tokens. I think if we get a token that isn't equal to the auth_token
of a valid provisioner, we should generate the same error message regardless of whether it happens to be formatted like a UUID.
But I'm totally open to changing the wording if there's a better way to describe that situation than "invalid".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of auth, I think "invalid" encompasses both format and non-format problems with the credential. Wording here is fine IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spikecurtis I agree from a security perspective (don't reveal too much), but from a usability perspective I think it could be more helpful to the user. But I'm fine with either or.
} | ||
conn, res, err := websocket.Dial(ctx, serverURL.String(), &websocket.DialOptions{ | ||
HTTPClient: httpClient, | ||
// Need to disable compression to avoid a data-race. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't your code, but I wonder if anyone knows what the data race is with compression enabled? 😄
return err | ||
} | ||
|
||
if provisionerDaemon.AuthToken == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of related to my other question about nullability of the auth_token
field, but why would this ever be allowed to happen during create
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it shouldn't be possible. A null auth_token
would indicate that the provisioner was incorrectly registered as "in-process" rather than "out-of-process". But if that does somehow happen, this is just a sanity check so that we generate a meaningful error message rather than panicking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make it so the API errors instead (and doesn't return a nullable UUID)? (I think that would be nicer for consumers in general.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frontend ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general question. If coderd
or the provisionerd
go down, do they reconnect?
} | ||
|
||
func (api *API) postProvisionerDaemon(rw http.ResponseWriter, r *http.Request) { | ||
// Create the user on the site. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect comment
// Create the user on the site. |
@@ -141,6 +141,9 @@ func (p *Server) connect(ctx context.Context) { | |||
if p.isClosed() { | |||
return | |||
} | |||
|
|||
p.sendConnectRequest(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this fail? If it does, is it ok to just go to the for loop?
This Pull Request is becoming stale. In order to minimize WIP, prevent merge conflicts and keep the tracker readable, I'm going close to this PR in 3 days if there isn't more activity. |
This Pull Request is becoming stale. In order to minimize WIP, prevent merge conflicts and keep the tracker readable, I'm going close to this PR in 3 days if there isn't more activity. |
Tried it and worked like a charm (both as a separate process and on a different machine). Some feedback/questions. Don't have to be addressed in this PR. Wondering if we should show another message instead of codertester@coder-v2:/tmp/docker$ coder templates create
> Create and upload "/tmp/docker"? (yes/no) yes
⧗ Queued Is there a way to "assign" a workspace to a provisioner daemon? Some use cases in mind
Can we add a |
This Pull Request is becoming stale. In order to minimize WIP, prevent merge conflicts and keep the tracker readable, I'm going close to this PR in 3 days if there isn't more activity. |
This PR adds support for running out-of-process provisioner daemon instances, authenticated by tokens (basically the same way we authenticate workspace agents).
Example usage
Other notable things about this change:
ProvisionerDaemon
API resources now have anauth_token
field, which is null for in-process provisioners. External connections must include a non-null token in thesession_token
cookie. To avoid leaking tokens, we now prevent non-owner users from listing provisioner daemons through the API.coder server
using./scripts/develop.sh -- <flags...>
.echo
provisioner when running tests, but not when running in production, the provisioner daemon now performs aConnect
RPC as its first action on each connection to coderd. This message is used to register the set of supported provisioner names and store them in theprovisioner_daemons.provisioners
database field.provisioner_daemons.updated_at
field, which was previously never set, is now updated on every provisioner daemon connection.Fixes #1391, fixes #1392, fixes #1393, fixes #1605