Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

tkashem
Copy link
Contributor

@tkashem tkashem commented Jan 13, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:
apf post startup config provider should ensure that the bootstrap spec of the suggested priority level config and flow schema objects is saved when a given object already exists.
if a new version of the apiserver has a change in the spec of any of the suggested objects in bootstrap configuration, the updated spec does not get stored, because if a given object already exists in the cluster we don't update it to match the spec of the bootstrap object.

We recently merged #95259 that changed the spec of the suggested configuration in bootstrap. It went into 1.20. If a cluster upgrades from 1.19 to 1.20, the updated spec in the bootstrap configuration won't be reflected in the stored objects.

This PR takes the following approach:
On a fresh install, all (both mandatory and suggested) configuration objects will have auto update enabled via the following annotation
apf.kubernetes.io/autoupdate-spec: 'true'

  • the post start hook for APF tries for 30s to initialize the bootstrap configuration, the apiserver will fail to start if it can't finish the initialization within 30s
  • after the initialization is complete, the kube-apiserver periodically checks the bootstrap configuration objects on the cluster and applies update if necessary.

kube-apiserver enforces an 'always auto-update' policy for the mandatory configuration object(s). This implies:

  • the auto-update annotation key is added with a value of 'true' if it is missing.
  • the auto-update annotation key is set to 'true' if its current value is a boolean false or has an invalid boolean representation (if the cluster operator sets it to 'false' it will be stomped)
  • any changes to the spec made by the cluster operator will be stomped.

The kube-apiserver will apply update on the suggested configuration if:

  • the cluster operator has enabled auto-update by setting the annotation (apf.kubernetes.io/autoupdate-spec: 'true') or
  • the annotation key is missing but the generation is 1

If the suggested configuration object is missing the annotation key, kube-apiserver will update the annotation appropriately:

  • it is set to 'true' if generation of the object is '1' which usually indicates that the spec of the object has not been changed.
  • it is set to 'false' if generation of the object is greater than 1.

The above approach for suggested configuration ensures that we don't squash changes made by an operator. Please note, we can't protect the changes made by the operator in the following scenario:

  • the user changes the spec and then deletes and recreates the same object. (generation resets to 1)

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 13, 2021
Copy link
Contributor Author

@tkashem tkashem Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is probably a reason to have shouldEnsureSuggested in place. are there any risks if we ensure all objects both suggested and mandatory without having shouldEnsureSuggested?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We explicitly and deliberately made a distinction between required and suggested configuration, and we explicitly and deliberately made the suggested configuration be something that the cluster operators can supplement, delete, and/or modify in any way they want.

Yes that means the operators have to think and potentially do something when migrating to a new release, in cases where they will want different configuration in the new release. I do not think automatically smashing in the suggested config is a better solution.

@tkashem
Copy link
Contributor Author

tkashem commented Jan 13, 2021

/assign @MikeSpreitzer

@tkashem tkashem changed the title [WIP] fix apf post startup config provider logic works on upgrade [WIP] fix apf post startup config provider logic to work when desired spec has changes Jan 13, 2021
@tkashem
Copy link
Contributor Author

tkashem commented Jan 13, 2021

/retest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making that %v into a %w?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We explicitly and deliberately made a distinction between required and suggested configuration, and we explicitly and deliberately made the suggested configuration be something that the cluster operators can supplement, delete, and/or modify in any way they want.

Yes that means the operators have to think and potentially do something when migrating to a new release, in cases where they will want different configuration in the new release. I do not think automatically smashing in the suggested config is a better solution.

@MikeSpreitzer
Copy link
Member

@kubernetes/sig-api-machinery-bugs
(well, I think it's not really a bug, it's a prompt for more design work)

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 14, 2021
@fedebongio
Copy link
Contributor

/cc @lavalamp @wojtek-t
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 14, 2021
@tkashem tkashem changed the title [WIP] fix apf post startup config provider logic to work when desired spec has changes [WIP] apf post startup config provider: desired spec in the bootstrap definition should match the stored configuration Jan 25, 2021
@tkashem tkashem force-pushed the apf-post-startup-fix branch from fb56ce7 to 92ec9c5 Compare February 1, 2021 18:12
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 1, 2021
@tkashem tkashem force-pushed the apf-post-startup-fix branch from 92ec9c5 to a8c4fe6 Compare February 6, 2021 05:00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value is this abstraction providing? Looks like it's just a way to break type safety

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to avoid duplicating the ensure/delete strategy for each type and also for whether the configuration is suggested or mandatory. As a trade off we have the boilerplate code in wrapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top level functions accepts typed object, so we provide some sort of type safety. if you look at the main function that drives it, they are all typed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

@tkashem tkashem force-pushed the apf-post-startup-fix branch from 3da6fb4 to e2c7279 Compare May 6, 2021 21:50
@tkashem
Copy link
Contributor Author

tkashem commented May 6, 2021

I'm not convinced if 30s isn't too low, but other than that the pattern looks ok.

@wojtek-t I have set 5m as the timeout for bootstrap initialization. Let me know if this looks good.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two very minor nits - will approve once fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[technically it doesn't matter - but it's misleading to the user]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, it's a copy paste error, fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@tkashem tkashem force-pushed the apf-post-startup-fix branch from e2c7279 to b50fa59 Compare May 7, 2021 13:11
@tkashem
Copy link
Contributor Author

tkashem commented May 7, 2021

Just two very minor nits - will approve once fixed.

@wojtek-t fixed, please take a look.

@wojtek-t
Copy link
Member

wojtek-t commented May 7, 2021

/lgtm
/approve

/assign @lavalamp - for API approval

@k8s-ci-robot
Copy link
Contributor

@wojtek-t: GitHub didn't allow me to assign the following users: -, for, API, approval.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/lgtm
/approve

/assign @lavalamp - for API approval

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2021
@tkashem
Copy link
Contributor Author

tkashem commented May 7, 2021

/test pull-kubernetes-e2e-kind

@tkashem
Copy link
Contributor Author

tkashem commented May 7, 2021

/test pull-kubernetes-integration

Copy link
Contributor

@lavalamp lavalamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

API changes look good, thanks for the giant comment. Just two typos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// objects on the cluster and applies update if necessary.
// objects on the cluster and applies updates if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The kube-apiserver will apply update on the suggested configuration if:
// The kube-apiserver will apply updates on the suggested configuration if:

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, tkashem, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2021
Take the following approach:
On a fresh install, all bootstrap configuration objects will
have auto update enabled via the following annotation :
`apf.kubernetes.io/autoupdate: 'true'`

The kube-apiserver periodically checks the bootstrap configuration
objects on the cluster and applies update if necessary.

We enforce an 'always auto-update' policy for the mandatory
configuration object(s).

We update the suggested configuration objects when:
- auto update is enabled (`apf.kubernetes.io/autoupdate: 'true'`) or
- auto update annotation key is missing but `generation` is `1`

If the configuration object is missing the annotation key, we add
it appropriately:
it is set to `true` if `generation` is `1`, `false` otherwise.

The above approach ensures that we don't squash changes made by an
operator. Please note, we can't protect the changes made by the
operator in the following scenario:
- the user changes the spec and then deletes and recreates
  the same object. (generation resets to 1)

remove using a marker
@tkashem tkashem force-pushed the apf-post-startup-fix branch from b50fa59 to 759a641 Compare May 7, 2021 18:23
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2021
@lavalamp
Copy link
Contributor

lavalamp commented May 7, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2021
@k8s-ci-robot k8s-ci-robot merged commit 0a46301 into kubernetes:master May 7, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone May 7, 2021
@tkashem
Copy link
Contributor Author

tkashem commented May 11, 2021

@MikeSpreitzer regarding your comment #98028 (comment)

note that before this PR, the behavior at apiserver startup is to try to inject the suggested config objects if and only if the exempt priority level config object does not exist. And that is the last config object created. So its absence indicates incomplete cluster initialization. Its presence means the suggested config objects have been created. If at some later time one of the suggested config objects is missing then it must be because a cluster admin deleted that object. We made the apiservers deliberately respect that deletion and not try to restore the suggested object. This PR should be revised to keep that behavior.

I just realized that we have an issue with this approach, let's take for example the suggested flow schema probes (that exempts all probes) #100678 we have added in 1.22

  • Fresh install of 1.22: the probes flow schema will be created along with the other configuration objects.
  • Upgrade from 1.21 to 1.22: since the exempt mandatory priority level config object already exists on the cluster, the auto updater will not create the probes flow schema (it thinks that the cluster operator explicitly deleted the probes object) even though probes is being introduced in 1.22. We don't tag a newly introduced bootstrap configuration.

There are two options:

  • A: When we add a new suggested configuration object, we need to tag it so we know it's a brand new object we are adding, on the following release we remove that tag. This is not maintainable and error prone.
  • B: We always recreate missing suggested configuration objects. This means we no longer allow the cluster operator to delete any suggested configuration. At the same time, the cluster operator can achieve the same goal (render the suggested configuration useless) by creating a new "similar" FlowSchema object with a logically higher matching precedence.

I propose Solution B. Let me know if you have any other ideas.

cc @wojtek-t

@wojtek-t
Copy link
Member

wojtek-t commented May 12, 2021

We always recreate missing suggested configuration objects. This means we no longer allow the cluster operator to delete any suggested configuration. At the same time, the cluster operator can achieve the same goal (render the suggested configuration useless) by creating a new "similar" FlowSchema object with a logically higher matching precedence.

Alternatively - they can modify "auto-update" annotation and update FS to match other (or no) requests.
I think that the suggested configuration is so basic that while I can imagine operators changing the shares or adjusting what requests should hit (which they will be able to do via modifying the auto-update annotation), I don't think removing them actually is a reasonable operator action. So I agree that option (B) sounds good to me.

@MikeSpreitzer
Copy link
Member

I think the whole set of suggested configuration objects is ill-designed and am disappointed that we have not yet explored a different set (along the lines of what was originally proposed). I continue to think that deleting suggested config objects is a reasonable operator action.

There is another possible approach. Let the last mandatory object have an annotation that identifies the vintage of suggested config objects that was injected. That would let the config producer controller tell the difference between an old set of objects vs. an explicitly edited new set of objects.

Another possible approach eschews an explicit tag and simply looks at the existing objects.

We actually discussed this issue of recently introduced suggested config objects in a meeting and decided that we would handle it with a release note explaining that there is a new config object that the operator may want to consider. The difficult scenario is what to do after the operator has made some change --- does that mean the newly defined suggested config object should be eschewed, or injected anyway? It depends on what change(s) the operator made and why. There is no way that the config producing controller can tell.

return false, nil
}, hookContext.StopCh)
if err != nil {
klog.ErrorS(err, "APF bootstrap ensurer is exiting")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this always logs an error on server shutdown... this should be info, at best

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this confused debugging on #103512

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants