-
Notifications
You must be signed in to change notification settings - Fork 13
Simplify issuance backoff handling & improve error messages #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify issuance backoff handling & improve error messages #35
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: munnerz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
To note as well, the kubelet's standard backoff handling also appears to be exponential. We will rely on the kubelet to handle/time retries if a pod is in a ContainerCreating state - our own exponential backoff only comes into effect for renewals. We may want to consider surfacing failed renewals via metrics in the CSI driver or something too, to make it easier for operators to identify 'stuck' renewals more easily. |
66f6f3e to
98c11f1
Compare
Signed-off-by: James Munnelly <[email protected]>
…lishVolume call Signed-off-by: James Munnelly <[email protected]>
Signed-off-by: James Munnelly <[email protected]>
…sued before Signed-off-by: James Munnelly <[email protected]>
…ady is enabled Signed-off-by: James Munnelly <[email protected]>
Signed-off-by: James Munnelly <[email protected]>
…r issuers now other areas are more responsive Signed-off-by: James Munnelly <[email protected]>
1fb5657 to
10aa8bf
Compare
driver/nodeserver.go
Outdated
| meta := metadata.FromNodePublishVolumeRequest(req) | ||
| log := loggerForMetadata(ns.log, meta) | ||
| ctx, _ = context.WithTimeout(ctx, time.Second*30) | ||
| ctx, _ = context.WithTimeout(ctx, time.Second*60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've increased this as we now return errors from approvers/issuers as soon as they happen, which makes it easier for us to tolerate issuers that take longer (as it's no longer at the cost of user experience when a failure is happening and they have to wait until this time/cap before seeing an error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @munnerz, couple comments. I have walked through the state machine and the changes make sense to me.
| cond := apiutil.GetCertificateRequestCondition(updatedReq, cmapi.CertificateRequestConditionDenied) | ||
| // if a CR has been explicitly denied, we DO stop execution. | ||
| // there may be a case to be made that we could continue anyway even if the issuer ignores the approval | ||
| // status, however these cases are likely few and far between and this makes denial more responsive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
3c27c74 to
940ee59
Compare
Signed-off-by: James Munnelly <[email protected]>
940ee59 to
8aaca43
Compare
Signed-off-by: James Munnelly <[email protected]>
|
/lgtm |
| maxRequestsPerVolume int | ||
|
|
||
| // backoffConfig configures the exponential backoff applied to certificate renewal failures. | ||
| backoffConfig wait.Backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, should rename it to renewalBackoffConfig ?
This PR changes the internals of how the NodePublishVolume calls wait for an initial issuance of a Certificate.
Previously, a volume would be registered for management in the NodePublishVolume call, and then a
waitfunction was used from the driver package to wait until the manager has successfully completed an issuance of the certificate.This made it take longer for users to see errors if there are any (it'd always take a minimum of 30s to get errors) plus it has lead to poor handling of "backing off" in cases where requests continuously fail (see #30).
Instead, the driver will now attempt one issuance prior to beginning the long-running renewal loop in the manager, by using the new
ManageVolumeImmediatemethod. This method:issueto attempt a single issuanceThis allows us to lean on the kubelet/CSI driver consumer to define the retry behaviour if a pod isn't able to start.
In cases where a renewal is failing, the same exponential backoff behaviour we had before will be applied (and the exponent will not be thrown away every 30s, like described in #30).
I've also tuned the exponential backoff parameters so the initial 'base' backoff in these cases is 30s, with a factor of 2 and a max cap of 5 minutes. This leads to a pattern of waiting T seconds after each attempt:
This backoff behaviour/configuration can now also be customised by library consumers to their liking.
As part of this, I've also improved the error messages returned to users in cases where their certificate requests are not being approved or issued (instead of just showing
timed out waiting for the condition).cc @7ing @JoshVanL