Simplify issuance backoff handling & improve error messages #35

munnerz · 2022-09-09T14:59:18Z

This PR changes the internals of how the NodePublishVolume calls wait for an initial issuance of a Certificate.

Previously, a volume would be registered for management in the NodePublishVolume call, and then a wait function was used from the driver package to wait until the manager has successfully completed an issuance of the certificate.

This made it take longer for users to see errors if there are any (it'd always take a minimum of 30s to get errors) plus it has lead to poor handling of "backing off" in cases where requests continuously fail (see #30).

Instead, the driver will now attempt one issuance prior to beginning the long-running renewal loop in the manager, by using the new ManageVolumeImmediate method. This method:

registers the volume as under management (to prevent any other NodePublishVolume calls that run at the same time for the volume duplicating work/interfering) but does NOT start the renewal loop
makes a call to issue to attempt a single issuance
if this fails, it returns so the driver can call unmanage and clean up volume data on disk
if it succeeds, it begins the renewal goroutine like usual

This allows us to lean on the kubelet/CSI driver consumer to define the retry behaviour if a pod isn't able to start.

In cases where a renewal is failing, the same exponential backoff behaviour we had before will be applied (and the exponent will not be thrown away every 30s, like described in #30).

I've also tuned the exponential backoff parameters so the initial 'base' backoff in these cases is 30s, with a factor of 2 and a max cap of 5 minutes. This leads to a pattern of waiting T seconds after each attempt:

T=0s
T=30s
T=60s
T=120s
T=240s
T=300s (5 minutes, the cap)
T=300s (5 minutes, the cap)
...

This backoff behaviour/configuration can now also be customised by library consumers to their liking.

As part of this, I've also improved the error messages returned to users in cases where their certificate requests are not being approved or issued (instead of just showing timed out waiting for the condition).

cc @7ing @JoshVanL

jetstack-bot · 2022-09-09T14:59:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: munnerz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [munnerz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

munnerz · 2022-09-09T15:00:56Z

To note as well, the kubelet's standard backoff handling also appears to be exponential. We will rely on the kubelet to handle/time retries if a pod is in a ContainerCreating state - our own exponential backoff only comes into effect for renewals.

We may want to consider surfacing failed renewals via metrics in the CSI driver or something too, to make it easier for operators to identify 'stuck' renewals more easily.

Signed-off-by: James Munnelly <[email protected]>

…lishVolume call Signed-off-by: James Munnelly <[email protected]>

Signed-off-by: James Munnelly <[email protected]>

…sued before Signed-off-by: James Munnelly <[email protected]>

…ady is enabled Signed-off-by: James Munnelly <[email protected]>

Signed-off-by: James Munnelly <[email protected]>

…r issuers now other areas are more responsive Signed-off-by: James Munnelly <[email protected]>

munnerz · 2022-09-09T16:32:07Z

driver/nodeserver.go

 	meta := metadata.FromNodePublishVolumeRequest(req)
 	log := loggerForMetadata(ns.log, meta)
-	ctx, _ = context.WithTimeout(ctx, time.Second*30)
+	ctx, _ = context.WithTimeout(ctx, time.Second*60)


I've increased this as we now return errors from approvers/issuers as soon as they happen, which makes it easier for us to tolerate issuers that take longer (as it's no longer at the cost of user experience when a failure is happening and they have to wait until this time/cap before seeing an error)

JoshVanL

Thanks @munnerz, couple comments. I have walked through the state machine and the changes make sense to me.

driver/nodeserver.go

manager/manager.go

JoshVanL · 2022-09-12T09:28:15Z

manager/manager.go

 			cond := apiutil.GetCertificateRequestCondition(updatedReq, cmapi.CertificateRequestConditionDenied)
+			// if a CR has been explicitly denied, we DO stop execution.
+			// there may be a case to be made that we could continue anyway even if the issuer ignores the approval
+			// status, however these cases are likely few and far between and this makes denial more responsive.


manager/manager.go

Signed-off-by: James Munnelly <[email protected]>

JoshVanL · 2022-09-12T18:00:37Z

/lgtm

7ing · 2022-09-13T22:23:31Z

manager/manager.go

 	maxRequestsPerVolume int
+
+	// backoffConfig configures the exponential backoff applied to certificate renewal failures.
+	backoffConfig wait.Backoff


Minor, should rename it to renewalBackoffConfig ?

munnerz force-pushed the simplify-backoff-handling branch from 66f6f3e to 98c11f1 Compare September 9, 2022 15:01

jetstack-bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Sep 9, 2022

munnerz added 7 commits September 9, 2022 17:29

upgrade apiVersions used for example deploy manifests

9baa172

Signed-off-by: James Munnelly <[email protected]>

refactor nodeserver to only attempt issuance once in a single NodePub…

4d748d6

…lishVolume call Signed-off-by: James Munnelly <[email protected]>

Improve failure messages when CertificateRequests are left pending

0923d01

Signed-off-by: James Munnelly <[email protected]>

Don't automatically begin management of volumes that have not been is…

e80da6f

…sued before Signed-off-by: James Munnelly <[email protected]>

Properly resume management of volumes on startup when continueOnNotRe…

fcf8439

…ady is enabled Signed-off-by: James Munnelly <[email protected]>

Allow setting a custom wait.Backoff config for renewal

f1e5d1d

Signed-off-by: James Munnelly <[email protected]>

Increase NodePublishVolume context deadline to 60s to allow for slowe…

10aa8bf

…r issuers now other areas are more responsive Signed-off-by: James Munnelly <[email protected]>

munnerz force-pushed the simplify-backoff-handling branch from 1fb5657 to 10aa8bf Compare September 9, 2022 16:29

jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Sep 9, 2022

munnerz commented Sep 9, 2022

View reviewed changes

JoshVanL requested changes Sep 12, 2022

View reviewed changes

munnerz force-pushed the simplify-backoff-handling branch from 3c27c74 to 940ee59 Compare September 12, 2022 12:57

Address review feedback

8aaca43

Signed-off-by: James Munnelly <[email protected]>

munnerz force-pushed the simplify-backoff-handling branch from 940ee59 to 8aaca43 Compare September 12, 2022 13:01

now->epoch

ccb95da

Signed-off-by: James Munnelly <[email protected]>

jetstack-bot assigned JoshVanL Sep 12, 2022

jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2022

jetstack-bot merged commit 6ca41bd into cert-manager:main Sep 12, 2022

munnerz deleted the simplify-backoff-handling branch September 12, 2022 18:01

munnerz mentioned this pull request Sep 13, 2022

Default renewal time to 2/3rds through certificates lifetime #36

Merged

7ing reviewed Sep 14, 2022

View reviewed changes

7ing mentioned this pull request Oct 11, 2022

Expect ManageVolumeImmediate() is a blocking call that waits for certificate to be issued #41

Closed

7ing mentioned this pull request Feb 24, 2023

Exponential backoff handling does not apply to certificate renewal in pending phase #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify issuance backoff handling & improve error messages #35

Simplify issuance backoff handling & improve error messages #35

Uh oh!

munnerz commented Sep 9, 2022 •

edited

Loading

Uh oh!

jetstack-bot commented Sep 9, 2022

Uh oh!

munnerz commented Sep 9, 2022

Uh oh!

munnerz Sep 9, 2022

Uh oh!

JoshVanL left a comment

Uh oh!

Uh oh!

Uh oh!

JoshVanL Sep 12, 2022

Uh oh!

Uh oh!

JoshVanL commented Sep 12, 2022

Uh oh!

7ing Sep 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Simplify issuance backoff handling & improve error messages #35

Simplify issuance backoff handling & improve error messages #35

Uh oh!

Conversation

munnerz commented Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jetstack-bot commented Sep 9, 2022

Uh oh!

munnerz commented Sep 9, 2022

Uh oh!

munnerz Sep 9, 2022

Choose a reason for hiding this comment

Uh oh!

JoshVanL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JoshVanL Sep 12, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JoshVanL commented Sep 12, 2022

Uh oh!

7ing Sep 13, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

munnerz commented Sep 9, 2022 •

edited

Loading