Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@munnerz
Copy link
Member

@munnerz munnerz commented Sep 9, 2022

This PR changes the internals of how the NodePublishVolume calls wait for an initial issuance of a Certificate.

Previously, a volume would be registered for management in the NodePublishVolume call, and then a wait function was used from the driver package to wait until the manager has successfully completed an issuance of the certificate.

This made it take longer for users to see errors if there are any (it'd always take a minimum of 30s to get errors) plus it has lead to poor handling of "backing off" in cases where requests continuously fail (see #30).

Instead, the driver will now attempt one issuance prior to beginning the long-running renewal loop in the manager, by using the new ManageVolumeImmediate method. This method:

  1. registers the volume as under management (to prevent any other NodePublishVolume calls that run at the same time for the volume duplicating work/interfering) but does NOT start the renewal loop
  2. makes a call to issue to attempt a single issuance
  3. if this fails, it returns so the driver can call unmanage and clean up volume data on disk
  4. if it succeeds, it begins the renewal goroutine like usual

This allows us to lean on the kubelet/CSI driver consumer to define the retry behaviour if a pod isn't able to start.

In cases where a renewal is failing, the same exponential backoff behaviour we had before will be applied (and the exponent will not be thrown away every 30s, like described in #30).

I've also tuned the exponential backoff parameters so the initial 'base' backoff in these cases is 30s, with a factor of 2 and a max cap of 5 minutes. This leads to a pattern of waiting T seconds after each attempt:

T=0s
T=30s
T=60s
T=120s
T=240s
T=300s (5 minutes, the cap)
T=300s (5 minutes, the cap)
...

This backoff behaviour/configuration can now also be customised by library consumers to their liking.

As part of this, I've also improved the error messages returned to users in cases where their certificate requests are not being approved or issued (instead of just showing timed out waiting for the condition).

cc @7ing @JoshVanL

@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: munnerz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 9, 2022
@munnerz
Copy link
Member Author

munnerz commented Sep 9, 2022

To note as well, the kubelet's standard backoff handling also appears to be exponential. We will rely on the kubelet to handle/time retries if a pod is in a ContainerCreating state - our own exponential backoff only comes into effect for renewals.

We may want to consider surfacing failed renewals via metrics in the CSI driver or something too, to make it easier for operators to identify 'stuck' renewals more easily.

@munnerz munnerz force-pushed the simplify-backoff-handling branch from 66f6f3e to 98c11f1 Compare September 9, 2022 15:01
@jetstack-bot jetstack-bot added dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. and removed dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. labels Sep 9, 2022
@munnerz munnerz force-pushed the simplify-backoff-handling branch from 1fb5657 to 10aa8bf Compare September 9, 2022 16:29
@jetstack-bot jetstack-bot added dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. and removed dco-signoff: no Indicates that at least one commit in this pull request is missing the DCO sign-off message. labels Sep 9, 2022
meta := metadata.FromNodePublishVolumeRequest(req)
log := loggerForMetadata(ns.log, meta)
ctx, _ = context.WithTimeout(ctx, time.Second*30)
ctx, _ = context.WithTimeout(ctx, time.Second*60)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've increased this as we now return errors from approvers/issuers as soon as they happen, which makes it easier for us to tolerate issuers that take longer (as it's no longer at the cost of user experience when a failure is happening and they have to wait until this time/cap before seeing an error)

Copy link
Contributor

@JoshVanL JoshVanL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @munnerz, couple comments. I have walked through the state machine and the changes make sense to me.

cond := apiutil.GetCertificateRequestCondition(updatedReq, cmapi.CertificateRequestConditionDenied)
// if a CR has been explicitly denied, we DO stop execution.
// there may be a case to be made that we could continue anyway even if the issuer ignores the approval
// status, however these cases are likely few and far between and this makes denial more responsive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@munnerz munnerz force-pushed the simplify-backoff-handling branch from 3c27c74 to 940ee59 Compare September 12, 2022 12:57
Signed-off-by: James Munnelly <[email protected]>
@munnerz munnerz force-pushed the simplify-backoff-handling branch from 940ee59 to 8aaca43 Compare September 12, 2022 13:01
Signed-off-by: James Munnelly <[email protected]>
@JoshVanL
Copy link
Contributor

/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2022
@jetstack-bot jetstack-bot merged commit 6ca41bd into cert-manager:main Sep 12, 2022
@munnerz munnerz deleted the simplify-backoff-handling branch September 12, 2022 18:01
maxRequestsPerVolume int

// backoffConfig configures the exponential backoff applied to certificate renewal failures.
backoffConfig wait.Backoff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, should rename it to renewalBackoffConfig ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants