-
Notifications
You must be signed in to change notification settings - Fork 41.4k
fix race in aggregated discovery controller #115302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix race in aggregated discovery controller #115302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was the problematic line of code.
if apiservice was removed (for instance during long-running fetchFreshDiscoveryForService) then this would then add it back erroneously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is lastReconciled ever set now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been removed. It was only used for tests and replaced by AllServicesSynced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, great.
I think that's good, thanks! |
/lgtm |
LGTM label has been added. Git tree hash: aaf23b9733607a1ea9a2d1394dcc19a0ce4d0541
|
/assign @lavalamp |
staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is for testing can you make it not exported and/or move it to live in the _test.go file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require making the tests whitebox. Wasnt sure of the correct side of the tradeoff to take. Ill change it over then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's unfortunate but I don't think you want external callers of this as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather have the test use an internal method than a function exported for the sake of an external test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct. im moving it over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed AllServicesSynced
and replaced with internal test function
Caused by following sequence: 1. Add APIService to map 2. Begin Async Fetch 3. Remove APIService from map 4. Finish Async Fetch & stores apiservice back in map to update lastReconciled time fixes by removing lastReconciled (only used for testing) and switching tests to just waiting until dirty queue is empty
08cb6d3
to
cff4d07
Compare
/retest PTAL again |
) | ||
|
||
func newDiscoveryManager(rm discoveryendpoint.ResourceManager) *discoveryManager { | ||
return NewDiscoveryManager(rm).(*discoveryManager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to think about for a followup: why does this package need both an interface and an implementation struct? Why not just an exported implementation struct?
/lgtm |
LGTM label has been added. Git tree hash: 366bcfee710586f5919056696d681f1416554868
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexzielenski, lavalamp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
1 similar comment
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
1 similar comment
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
/retest |
/triage accepted |
Thanks for the fix ... the test is still flaking on release-1.26 (https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestRemoveAPIService)... can we backport the fix if only to stabilize CI? |
Discovered by flaking
TestRemoveAPIServer
.Caused by following sequence:
fixes by removing lastReconciled (only used for testing) so it would not need to be updated.
and switching tests to just waiting until dirty queue is empty
had to put sleeps in the
TestRemoveAPIService
to be able to trigger the bug as well as afterfetchFreshDiscoveryForService
to simulate long-running request, asstress
could not repro on my systemWhat type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
flaky test initially discovered in #115230 (comment)
but its also a bug outside of tests
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: