Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

mikesmithgh
Copy link
Collaborator

@mikesmithgh mikesmithgh commented Oct 1, 2021

The goal of this PR is to avoid returning an error on the following conditions so that we don't reqeueue and retry K8s API calls that will fail.

  • Delete: ignore 404 not found since the object does not exist
  • Update: ignore 404 not found since the object does not exist
  • Update: ignore 409 conflict this can occur when the object has been modified and doesn't match what is in the cache. E.g., resourceVersion changed
  • Create: ignore 409 already exists since the object has already been created

@mikesmithgh mikesmithgh force-pushed the ignore-isnotfound-err branch from 1c24b53 to d672fc9 Compare October 1, 2021 02:47
@codecov
Copy link

codecov bot commented Oct 1, 2021

Codecov Report

Merging #376 (a9968b0) into master (81ac1ad) will increase coverage by 3.20%.
The diff coverage is 93.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #376      +/-   ##
==========================================
+ Coverage   46.52%   49.72%   +3.20%     
==========================================
  Files          49       55       +6     
  Lines        3196     3306     +110     
==========================================
+ Hits         1487     1644     +157     
+ Misses       1461     1420      -41     
+ Partials      248      242       -6     
Flag Coverage Δ
integration 41.94% <12.72%> (-0.94%) ⬇️
unit 29.43% <92.95%> (+13.50%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/internal/testutils/hooks/hooks.go 75.00% <75.00%> (ø)
pkg/controller/composite/controller.go 55.42% <81.81%> (+4.96%) ⬆️
pkg/controller/decorator/controller.go 57.41% <90.00%> (+2.07%) ⬆️
pkg/controller/common/manage_children.go 68.05% <100.00%> (+24.64%) ⬆️
pkg/dynamic/clientset/clientset.go 57.14% <100.00%> (+1.42%) ⬆️
pkg/internal/testutils/common/metav1.go 100.00% <100.00%> (ø)
pkg/internal/testutils/common/rest.go 100.00% <100.00%> (ø)
pkg/internal/testutils/common/unstructured.go 100.00% <100.00%> (ø)
pkg/internal/testutils/common/util.go 100.00% <100.00%> (ø)
.../internal/testutils/dynamic/clientset/clientset.go 100.00% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81ac1ad...a9968b0. Read the comment docs.

@grzesuav
Copy link
Contributor

grzesuav commented Oct 3, 2021

hi @mjsmith1028 , will take a look tommorow, thanks for the PR

Copy link
Contributor

@grzesuav grzesuav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and it seems that deletion of parent object is handled, but in a bit different manner. If you encoutnered this issue, could you post logs // stack trace ?

@mikesmithgh
Copy link
Collaborator Author

Hi @grzesuav, I still haven't had time to fully dig into this. But, I played around a little today and I think I see what is happening at least in one case.

  1. https://github.com/metacontroller/metacontroller/blob/master/pkg/controller/composite/controller.go#L543 - removes the finalizer
  2. k8s deletes the object
  3. https://github.com/metacontroller/metacontroller/blob/master/pkg/controller/composite/controller.go#L598 - attempts to update the object but it no longer exists

The reason I started seeing this is I bumped up the go-client-qps/burst to 150/300 and num workers to 50 so we are processing workloads very fast. In this case, the object is deleted so fast that it can trigger this scenario occasionally.

I added a time.Sleep(1 * time.Minute) before updateParentStatus to reproduce this.

2021-10-19T00:56:39.728Z	ERROR	failed to sync IndexedJob 'default/print-index': can't update status for IndexedJob default/print-index: indexedjobs.ctl.enisoc.com "print-index" not found

metacontroller/pkg/controller/composite.(*parentController).worker
	/go/src/metacontroller/pkg/controller/composite/controller.go:287
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90
metacontroller/pkg/controller/composite.(*parentController).Start.func1.1
	/go/src/metacontroller/pkg/controller/composite/controller.go:263

I'll try to get more concrete details on this. But, at least in my local branch I have added checks for 404 on deletes, 409 on updates, and 403 on creates. I'll update the PR with some more details later this week.

@grzesuav
Copy link
Contributor

@mjsmith1028 yes, it can be related to fact that client-go uses cache state of the cluster to not kill api-server, so it can see not up-to-date state of the cluster. discovery-interval is set by default to 30seconds

@grzesuav
Copy link
Contributor

to be more exact, informers (which are triggering reconcile) uses cache version of the k8s-client, but operations in controller interacts with real-time cluster.

@mikesmithgh mikesmithgh force-pushed the ignore-isnotfound-err branch from d672fc9 to 1583cab Compare November 5, 2021 17:43
@mikesmithgh mikesmithgh changed the title fix(controller): Do not add to queue if not found fix(controller): Ignore 404/409 error responses Nov 5, 2021
@mikesmithgh mikesmithgh force-pushed the ignore-isnotfound-err branch 2 times, most recently from 7b3f255 to 10abb3a Compare November 11, 2021 17:07
@mikesmithgh mikesmithgh force-pushed the ignore-isnotfound-err branch from 10abb3a to a9968b0 Compare November 11, 2021 17:23
Copy link
Contributor

@grzesuav grzesuav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just doubts in those two places if we can safely swallow the error, could you comment ?

@mikesmithgh
Copy link
Collaborator Author

I have just doubts in those two places if we can safely swallow the error, could you comment ?

Hi @grzesuav , I added some comments on why I think it should be safe.

One thing to note is that my metacontroller setup has 50 workers, 150 client-go qps and 300 client-go-burst. So, it is processing sync requests pretty fast and the cluster is pretty active with other work outside of my controller which may put some stress on K8s control plane causing occasional slowness in K8s API responses. I think that is why these edge cases showed up.

Please let me know if the logic seems correct in my comments. Thanks!

@grzesuav grzesuav merged commit 5c983a4 into metacontroller:master Nov 25, 2021
@grzesuav
Copy link
Contributor

🎉 This PR is included in version 2.0.15 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants