-
Notifications
You must be signed in to change notification settings - Fork 887
feat: integrate Acquirer for provisioner jobs #9717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
…-pt2-integrate-acquirer
@@ -57,400 +64,411 @@ func testUserQuietHoursScheduleStore() *atomic.Pointer[schedule.UserQuietHoursSc | |||
return ptr | |||
} | |||
|
|||
func TestAcquireJob(t *testing.T) { | |||
func TestAcquireJob_LongPoll(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To review this file, I strongly suggest "Hide Whitespace"
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
CI testing has made me aware of a race condition --- sometimes wsbuilder is called from inside a transaction, so it's not safe to publish the job posting from inside wsbuilder, since then the publish could arrive before the outer transaction commits. This issue would be solved with triggers, but that's a much bigger refactor of our database & pubsub interfaces. For now I'm going to unwind the changes to wsbuilder and have the callers post the job. |
Signed-off-by: Spike Curtis <[email protected]>
logger.Error(streamCtx, "recv error and failed to cancel acquire job", slog.Error(recvErr)) | ||
// Well, this is awkward. We hit an error receiving from the stream, but didn't cancel before we locked a job | ||
// in the database. We need to mark this job as failed so the end user can retry if they want to. | ||
err := s.Database.UpdateProvisionerJobWithCompleteByID( | ||
context.Background(), | ||
database.UpdateProvisionerJobWithCompleteByIDParams{ | ||
ID: je.job.ID, | ||
CompletedAt: sql.NullTime{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up idea: If this starts happening frequently it could be symptomatic of an underlying issue. Would it make sense to expose this as its own metric in Prometheus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. Especially as part of a larger effort to add relevant provisioner metrics.
codersdk/deployment.go
Outdated
{ | ||
Name: "Poll Interval", | ||
Description: "Time to wait before polling for a new job.", | ||
Flag: "provisioner-daemon-poll-interval", | ||
Env: "CODER_PROVISIONER_DAEMON_POLL_INTERVAL", | ||
Default: time.Second.String(), | ||
Value: &c.Provisioner.DaemonPollInterval, | ||
Group: &deploymentGroupProvisioning, | ||
YAML: "daemonPollInterval", | ||
}, | ||
{ | ||
Name: "Poll Jitter", | ||
Description: "Random jitter added to the poll interval.", | ||
Flag: "provisioner-daemon-poll-jitter", | ||
Env: "CODER_PROVISIONER_DAEMON_POLL_JITTER", | ||
Default: (100 * time.Millisecond).String(), | ||
Value: &c.Provisioner.DaemonPollJitter, | ||
Group: &deploymentGroupProvisioning, | ||
YAML: "daemonPollJitter", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove these from the deployment config then it becomes an error to specify them at all. Could we keep them around for a couple releases but mark them as deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have deprecation for options that are replaced by something else. In this case there is no replacement -- these options are just not needed any more.
I don't feel like waiting a few releases to remove these options will do us much good. We don't have a good mechanism to get coder server
CLI warnings in front of operator eyeballs. It just drops some logs at start of day and they'll get buried. We're just as likely to break people when we remove them later.
I could just change the Descriptions to "This flag is deprecated and unused" and leave them in. I don't think there will ever be a better time than now to delete them.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on this some more, this will probably only break folks who are specifying the arguments via CLI flags. I think if they're specified via environment variables they'll just get ignored. And we don't specify CLI args in our Helm chart, so the blast radius doesn't appear to be as large. And folks who are running Coder via the installation script would probably be specifying configuration knobs as environment variables in /etc/coder.d/coder.env
.
So maybe it's OK to just remove them? 🤷
Thinking outside the scope of the PR, we could have a "Deprecated options" section in the deployment healthcheck JSON that lights up red if you have deprecated options specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking outside the scope of the PR, we could have a "Deprecated options" section in the deployment healthcheck JSON that lights up red if you have deprecated options specified.
That's a good idea
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
…-pt2-integrate-acquirer
Fixes #9428
Some notable things to be aware of
Adds a new "AcquireJobWithCancel" RPC to replace AcquireJob. It allows asynchronous canceling of acquisition to support graceful shutdown. AcquireJob is deprecated, but kept around for deployed external provisioners. In a few versions we could consider removing it.
Refactors Provisionerd to remove polling loops.
Changes coderdtest to gracefully shut down the provisionerd. I added some new error logs that were getting triggered by non-graceful shutdowns.