Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: integrate Acquirer for provisioner jobs #9717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Sep 19, 2023

Conversation

spikecurtis
Copy link
Contributor

Fixes #9428

Some notable things to be aware of

  • Adds a new "AcquireJobWithCancel" RPC to replace AcquireJob. It allows asynchronous canceling of acquisition to support graceful shutdown. AcquireJob is deprecated, but kept around for deployed external provisioners. In a few versions we could consider removing it.

  • Refactors Provisionerd to remove polling loops.

  • Changes coderdtest to gracefully shut down the provisionerd. I added some new error logs that were getting triggered by non-graceful shutdowns.

@@ -57,400 +64,411 @@ func testUserQuietHoursScheduleStore() *atomic.Pointer[schedule.UserQuietHoursSc
return ptr
}

func TestAcquireJob(t *testing.T) {
func TestAcquireJob_LongPoll(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To review this file, I strongly suggest "Hide Whitespace"

@spikecurtis
Copy link
Contributor Author

CI testing has made me aware of a race condition --- sometimes wsbuilder is called from inside a transaction, so it's not safe to publish the job posting from inside wsbuilder, since then the publish could arrive before the outer transaction commits.

This issue would be solved with triggers, but that's a much bigger refactor of our database & pubsub interfaces.

For now I'm going to unwind the changes to wsbuilder and have the callers post the job.

Comment on lines +262 to +269
logger.Error(streamCtx, "recv error and failed to cancel acquire job", slog.Error(recvErr))
// Well, this is awkward. We hit an error receiving from the stream, but didn't cancel before we locked a job
// in the database. We need to mark this job as failed so the end user can retry if they want to.
err := s.Database.UpdateProvisionerJobWithCompleteByID(
context.Background(),
database.UpdateProvisionerJobWithCompleteByIDParams{
ID: je.job.ID,
CompletedAt: sql.NullTime{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up idea: If this starts happening frequently it could be symptomatic of an underlying issue. Would it make sense to expose this as its own metric in Prometheus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Especially as part of a larger effort to add relevant provisioner metrics.

Comment on lines 1273 to 1292
{
Name: "Poll Interval",
Description: "Time to wait before polling for a new job.",
Flag: "provisioner-daemon-poll-interval",
Env: "CODER_PROVISIONER_DAEMON_POLL_INTERVAL",
Default: time.Second.String(),
Value: &c.Provisioner.DaemonPollInterval,
Group: &deploymentGroupProvisioning,
YAML: "daemonPollInterval",
},
{
Name: "Poll Jitter",
Description: "Random jitter added to the poll interval.",
Flag: "provisioner-daemon-poll-jitter",
Env: "CODER_PROVISIONER_DAEMON_POLL_JITTER",
Default: (100 * time.Millisecond).String(),
Value: &c.Provisioner.DaemonPollJitter,
Group: &deploymentGroupProvisioning,
YAML: "daemonPollJitter",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove these from the deployment config then it becomes an error to specify them at all. Could we keep them around for a couple releases but mark them as deprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have deprecation for options that are replaced by something else. In this case there is no replacement -- these options are just not needed any more.

I don't feel like waiting a few releases to remove these options will do us much good. We don't have a good mechanism to get coder server CLI warnings in front of operator eyeballs. It just drops some logs at start of day and they'll get buried. We're just as likely to break people when we remove them later.

I could just change the Descriptions to "This flag is deprecated and unused" and leave them in. I don't think there will ever be a better time than now to delete them.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on this some more, this will probably only break folks who are specifying the arguments via CLI flags. I think if they're specified via environment variables they'll just get ignored. And we don't specify CLI args in our Helm chart, so the blast radius doesn't appear to be as large. And folks who are running Coder via the installation script would probably be specifying configuration knobs as environment variables in /etc/coder.d/coder.env.

So maybe it's OK to just remove them? 🤷

Thinking outside the scope of the PR, we could have a "Deprecated options" section in the deployment healthcheck JSON that lights up red if you have deprecated options specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking outside the scope of the PR, we could have a "Deprecated options" section in the deployment healthcheck JSON that lights up red if you have deprecated options specified.

That's a good idea

@spikecurtis spikecurtis merged commit 375c70d into main Sep 19, 2023
@spikecurtis spikecurtis deleted the spike/9428-pt2-integrate-acquirer branch September 19, 2023 06:25
@github-actions github-actions bot locked and limited conversation to collaborators Sep 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

don't rely on global state to rate-limit AcquireJob requests
2 participants