Deflake tests in `staging/src/k8s.io/kube-aggregator/pkg/apiserver` #115859

gjkim42 · 2023-02-17T14:05:41Z

What type of PR is this?

/kind failing-test
/kind flake

What this PR does / why we need it:

This PR addresses two issues in handler_discovery_test.go.

Updating and reading the bool-type variable is not thread-safe. We can simply fix this one.
(This is quite tricky to address) It is hard to know that discoveryManager completes the sync on all items in the queue.

More details on issue 2 below:
waitForEmptyQueue cannot guarantee that discoveryManager completes the sync on all items in the queue. It just guarantee that all items have been started to sync.

This PR adds waitForQueueComplete and implements completerWorkqueue to check if the workqueue is complete.

Which issue(s) this PR fixes:

Fixes #115858

$ stress ./apiserver.test                    
5s: 0 runs so far, 0 failures       
10s: 12 runs so far, 0 failures     
15s: 24 runs so far, 0 failures     
...
3m50s: 445 runs so far, 0 failures
3m55s: 456 runs so far, 0 failures
4m0s: 461 runs so far, 0 failures

Special notes for your reviewer:

Does this PR introduce a user-facing change?

none

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

aojea · 2023-02-17T16:13:00Z

in a super quick look, I can observe that you have getters to the controller cache, maybe that can give you a way to confirm that all items were processed ?

alexzielenski · 2023-02-17T16:24:44Z

staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery_test.go

While this blocks until all items in the queue have synced, I don't think it is the right general solution. waitForEmptyQueue needs to be modified to wait for queue to stop processing, otherwise the other tests that use waitForEmptyQueue will still be affected

I think so too. I am trying to solve this with more general way. See #115859 (comment)

gjkim42 · 2023-02-17T16:28:48Z

@aojea

in a super quick look, I can observe that you have getters to the controller cache, maybe that can give you a way to confirm that all items were processed ?

I think the way (using cache) looks not clear for now.

Actually, I am struggling to solve this issue with more general way. (the current fix looks just a hack for me)

I feel that there are a few workarounds,

waiting for the desired state with timeout in tests
adding processingCount to the controller
adding IsEmpty(Len is 0 and is not processing anything) method to queue.Interface

3 looks better approach for me... what do you think?

gjkim42 · 2023-02-17T16:44:11Z

3. adding IsEmpty(Len is 0 and is not processing anything) method to queue.Interface

applied 3.

One concern is that I am not sure it is ok to add a method to queue.Interface

gjkim42 · 2023-02-17T16:57:29Z

adding IsEmpty(Len is 0 and is not processing anything) method to queue.Interface

applied 3.

One concern is that I am not sure it is ok to add a method to queue.Interface

There are some other implementations for queue.Interface...
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/115859/pull-kubernetes-verify-govet-levee/1626622414869762048

gjkim42 · 2023-02-17T17:58:19Z

added completerWorkQueue to check if the workqueue is complete for testing purpose.

gjkim42 · 2023-02-24T00:28:23Z

@alexzielenski can you take a final look at the most recent updates?

added a test that you suggested and changed the commit message.
Could you PTAL?

alexzielenski · 2023-02-24T02:32:47Z

/lgtm

k8s-ci-robot · 2023-02-24T02:32:53Z

LGTM label has been added.

Git tree hash: 815e55710d8c07d51424a4e0f0fc2b08200cb486

gjkim42 · 2023-02-27T14:07:02Z

@liggitt
Could you take a look?

I think this test-only change will also need to be back ported to 1.26. My change to patch a race #115302 introduced this flake.

And we think that this needs to be backported.

aojea · 2023-02-28T23:36:07Z

in a super quick look, I can observe that you have getters to the controller cache, maybe that can give you a way to confirm that all items were processed ?

diff --git a/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery_test.go b/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery_test.go
index 9bd48972404..c07993779d7 100644
--- a/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery_test.go
+++ b/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery_test.go
@@ -18,6 +18,7 @@ package apiserver
 
 import (
        "context"
        "net/http"
        "net/http/httptest"
        "strconv"
@@ -44,12 +45,14 @@ func newDiscoveryManager(rm discoveryendpoint.ResourceManager) *discoveryManager
        return NewDiscoveryManager(rm).(*discoveryManager)
 }
 
-// Returns true if the queue of services to sync empty this means everything has
-// been reconciled and placed into merged document
-func waitForEmptyQueue(stopCh <-chan struct{}, dm *discoveryManager) bool {
+// Returns true when it reachs the number of the cached services
+func waitForCacheNumber(stopCh <-chan struct{}, n int, dm *discoveryManager) bool {
        return cache.WaitForCacheSync(stopCh, func() bool {
                // Once items have successfully synced they are removed from queue.
-               return dm.dirtyAPIServiceQueue.Len() == 0
+               dm.resultsLock.Lock()
+               defer dm.resultsLock.Unlock()
+               return len(dm.cachedResults) == n
        })
 }
 
@@ -63,7 +66,6 @@ func TestBasic(t *testing.T) {
        service2.SetGroups(apiGroup2.Items)
        aggregatedResourceManager := discoveryendpoint.NewResourceManager()
        aggregatedManager := newDiscoveryManager(aggregatedResourceManager)
-
        for _, g := range apiGroup1.Items {
                for _, v := range g.Versions {
                        aggregatedManager.AddAPIService(&apiregistrationv1.APIService{
@@ -103,7 +105,7 @@ func TestBasic(t *testing.T) {
 
        go aggregatedManager.Run(testCtx.Done())
 
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 2, aggregatedManager))
 
        response, _, parsed := fetchPath(aggregatedResourceManager, "")
        if response.StatusCode != 200 {
@@ -159,7 +161,9 @@ func TestDirty(t *testing.T) {
        defer cancel()
 
        go aggregatedManager.Run(testCtx.Done())
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 1, aggregatedManager))
+       // time.Sleep(1 * time.Second)
+       fmt.Println("POST", aggregatedManager.apiServices, aggregatedManager.cachedResults)
 
        // immediately check for ping, since Run() should block for local services
        if !pinged.Load() {
@@ -211,7 +215,7 @@ func TestRemoveAPIService(t *testing.T) {
                aggregatedManager.RemoveAPIService(s.Name)
        }
 
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 0, aggregatedManager))
 
        response, _, parsed := fetchPath(aggyService, "")
        if response.StatusCode != 200 {
@@ -293,7 +297,7 @@ func TestLegacyFallback(t *testing.T) {
        defer cancel()
 
        go aggregatedManager.Run(testCtx.Done())
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 1, aggregatedManager))
 
        // At this point external services have synced. Check if discovery document
        // includes the legacy resources
@@ -362,7 +366,7 @@ func TestNotModified(t *testing.T) {
 
        // Important to wait here to ensure we prime the cache with the initial list
        // of documents in order to exercise 304 Not Modified
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 1, aggregatedManager))
 
        // Now add all groups. We excluded one group before so that AllServicesSynced
        // could include it in this round. Now, if AllServicesSynced ever returns
@@ -373,7 +377,7 @@ func TestNotModified(t *testing.T) {
        }
 
        // This would wait the full timeout on 1.26.0.
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 1, aggregatedManager))
 }

I tried the cache approach and looks easier than the queue

stress ./apiserver.test -test.run TestDirty
5s: 999 runs so far, 0 failures
10s: 2071 runs so far, 0 failures
15s: 3137 runs so far, 0 failures
20s: 4173 runs so far, 0 failures
25s: 5219 runs so far, 0 failures
30s: 6275 runs so far, 0 failures
35s: 7326 runs so far, 0 failures
40s: 8383 runs so far, 0 failures
45s: 9470 runs so far, 0 failures
50s: 10549 runs so far, 0 failures

alexzielenski · 2023-02-28T23:40:44Z

@aojea while that is a good approach I think if we in the future added a test for updating a pre-existing entry (which I'm now likely to add very soon) your alternative not be sufficient if I'm reading correctly?

liggitt · 2023-03-01T01:48:11Z

/approve

k8s-ci-robot · 2023-03-01T01:48:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexzielenski, gjkim42, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/kube-aggregator/OWNERS~~ [liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liggitt · 2023-03-01T03:32:04Z

looks like there are verify errors to resolve

/hold

This uses atomic.Bool as updating and reading a boolean-type variable concurrently is not thread-safe.

`waitForEmptyQueue` cannot guarantee that all items in the queue have been synced completely but guarantee that all items have been started. This adds `waitForQueueComplete` and implements `completerWorkqueue` to check if the workqueue is complete to deflake the tests in staging/src/k8s.io/kube-aggregator/pkg/apiserver.

gjkim42 · 2023-03-01T05:33:25Z

looks like there are verify errors to resolve

That is because #115770 has been merged which introduces a new test to fix.

Now, I rebased this PR and fixed the newly created test as well.

$ stress ./apiserver.test
5s: 0 runs so far, 0 failures       
10s: 12 runs so far, 0 failures     
15s: 12 runs so far, 0 failures
...
1h0m0s: 4582 runs so far, 0 failures

@alexzielenski @liggitt

Could you re-LGTM on this?

liggitt · 2023-03-01T14:39:42Z

/lgtm

k8s-ci-robot · 2023-03-01T14:39:49Z

LGTM label has been added.

Git tree hash: 50f28cfc357d258afd1c1b1a552e24aa905c1da1

liggitt · 2023-03-01T14:39:53Z

/hold cancel

aojea · 2023-03-01T16:08:09Z

@aojea while that is a good approach I think if we in the future added a test for updating a pre-existing entry (which I'm now likely to add very soon) your alternative not be sufficient if I'm reading correctly?

@alexzielenski it was just an alternative because I saw your comment in the workqueue complete, but I don't have strong opinion, the current option LGTM too

…f-#115302-upstream-release-1.26 Cherry pick of Aggregated Discovery Patches: #115302 #115770 #115998 #115859

k8s-ci-robot requested review from alexzielenski and Jefftree February 17, 2023 14:06

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 17, 2023

alexzielenski reviewed Feb 17, 2023

View reviewed changes

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 17, 2023

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 17, 2023

gjkim42 force-pushed the deflake-TestDirty branch 2 times, most recently from f2d6989 to 8a2bc2a Compare February 17, 2023 18:12

gjkim42 changed the title ~~WIP: Deflake TestDirty~~ Add waitForQueueComplete to check if the workqueue is complete Feb 17, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 24, 2023

This was referenced Feb 28, 2023

[Flaky test] k8s.io/kubernetes/vendor/k8s.io/kube-aggregator/pkg/apiserver TestDirty #116157

Closed

don't process unsupported loadbalancers with mixed protocols #115966

Merged

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2023

gjkim42 added 2 commits March 1, 2023 13:23

Fix a data race in TestDirty

e7d8dfb

This uses atomic.Bool as updating and reading a boolean-type variable concurrently is not thread-safe.

gjkim42 force-pushed the deflake-TestDirty branch from c40fa3e to e24e3de Compare March 1, 2023 04:29

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2023

k8s-ci-robot merged commit e2fff53 into kubernetes:master Mar 1, 2023

k8s-ci-robot added this to the v1.27 milestone Mar 1, 2023

gjkim42 deleted the deflake-TestDirty branch March 1, 2023 22:28

alexzielenski mentioned this pull request Mar 2, 2023

Cherry pick of Aggregated Discovery Patches: #115302 #115770 #115998 #115859 #115805

Merged

k8s-ci-robot added a commit that referenced this pull request Mar 11, 2023

Merge pull request #115805 from alexzielenski/automated-cherry-pick-o…

44c0247

…f-#115302-upstream-release-1.26 Cherry pick of Aggregated Discovery Patches: #115302 #115770 #115998 #115859

Deflake tests in staging/src/k8s.io/kube-aggregator/pkg/apiserver #115859

Deflake tests in staging/src/k8s.io/kube-aggregator/pkg/apiserver #115859

Uh oh!

Conversation

gjkim42 commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

aojea commented Feb 17, 2023

Uh oh!

alexzielenski Feb 17, 2023

Choose a reason for hiding this comment

Uh oh!

gjkim42 Feb 17, 2023

Choose a reason for hiding this comment

Uh oh!

gjkim42 commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gjkim42 commented Feb 17, 2023

Uh oh!

gjkim42 commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gjkim42 commented Feb 17, 2023

Uh oh!

gjkim42 commented Feb 24, 2023

Uh oh!

alexzielenski commented Feb 24, 2023

Uh oh!

k8s-ci-robot commented Feb 24, 2023

Uh oh!

gjkim42 commented Feb 27, 2023

Uh oh!

aojea commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexzielenski commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liggitt commented Mar 1, 2023

Uh oh!

k8s-ci-robot commented Mar 1, 2023

Uh oh!

liggitt commented Mar 1, 2023

Uh oh!

gjkim42 commented Mar 1, 2023

Uh oh!

liggitt commented Mar 1, 2023

Uh oh!

k8s-ci-robot commented Mar 1, 2023

Uh oh!

liggitt commented Mar 1, 2023

Uh oh!

aojea commented Mar 1, 2023

Uh oh!

Uh oh!

Deflake tests in `staging/src/k8s.io/kube-aggregator/pkg/apiserver` #115859

Deflake tests in `staging/src/k8s.io/kube-aggregator/pkg/apiserver` #115859

gjkim42 commented Feb 17, 2023 •

edited

Loading

gjkim42 commented Feb 17, 2023 •

edited

Loading

gjkim42 commented Feb 17, 2023 •

edited

Loading

aojea commented Feb 28, 2023 •

edited

Loading

alexzielenski commented Feb 28, 2023 •

edited

Loading