Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

gjkim42
Copy link
Member

@gjkim42 gjkim42 commented Feb 17, 2023

What type of PR is this?

/kind failing-test
/kind flake

What this PR does / why we need it:

This PR addresses two issues in handler_discovery_test.go.

  1. Updating and reading the bool-type variable is not thread-safe. We can simply fix this one.
  2. (This is quite tricky to address) It is hard to know that discoveryManager completes the sync on all items in the queue.

More details on issue 2 below:
waitForEmptyQueue cannot guarantee that discoveryManager completes the sync on all items in the queue. It just guarantee that all items have been started to sync.

This PR adds waitForQueueComplete and implements completerWorkqueue to check if the workqueue is complete.

Which issue(s) this PR fixes:

Fixes #115858

$ stress ./apiserver.test                    
5s: 0 runs so far, 0 failures       
10s: 12 runs so far, 0 failures     
15s: 24 runs so far, 0 failures     
...
3m50s: 445 runs so far, 0 failures
3m55s: 456 runs so far, 0 failures
4m0s: 461 runs so far, 0 failures

Special notes for your reviewer:

Does this PR introduce a user-facing change?

none

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 17, 2023
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 17, 2023
@aojea
Copy link
Member

aojea commented Feb 17, 2023

in a super quick look, I can observe that you have getters to the controller cache, maybe that can give you a way to confirm that all items were processed ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this blocks until all items in the queue have synced, I don't think it is the right general solution. waitForEmptyQueue needs to be modified to wait for queue to stop processing, otherwise the other tests that use waitForEmptyQueue will still be affected

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too. I am trying to solve this with more general way. See #115859 (comment)

@gjkim42
Copy link
Member Author

gjkim42 commented Feb 17, 2023

@aojea

in a super quick look, I can observe that you have getters to the controller cache, maybe that can give you a way to confirm that all items were processed ?

I think the way (using cache) looks not clear for now.

Actually, I am struggling to solve this issue with more general way. (the current fix looks just a hack for me)

I feel that there are a few workarounds,

  1. waiting for the desired state with timeout in tests
  2. adding processingCount to the controller
  3. adding IsEmpty(Len is 0 and is not processing anything) method to queue.Interface

3 looks better approach for me... what do you think?

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 17, 2023
@gjkim42
Copy link
Member Author

gjkim42 commented Feb 17, 2023

3. adding IsEmpty(Len is 0 and is not processing anything) method to queue.Interface

applied 3.

One concern is that I am not sure it is ok to add a method to queue.Interface

@gjkim42
Copy link
Member Author

gjkim42 commented Feb 17, 2023

  1. adding IsEmpty(Len is 0 and is not processing anything) method to queue.Interface

applied 3.

One concern is that I am not sure it is ok to add a method to queue.Interface

There are some other implementations for queue.Interface...
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/115859/pull-kubernetes-verify-govet-levee/1626622414869762048

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 17, 2023
@gjkim42
Copy link
Member Author

gjkim42 commented Feb 17, 2023

added completerWorkQueue to check if the workqueue is complete for testing purpose.

@gjkim42 gjkim42 force-pushed the deflake-TestDirty branch 2 times, most recently from f2d6989 to 8a2bc2a Compare February 17, 2023 18:12
@gjkim42 gjkim42 changed the title WIP: Deflake TestDirty Add waitForQueueComplete to check if the workqueue is complete Feb 17, 2023
@gjkim42
Copy link
Member Author

gjkim42 commented Feb 24, 2023

@alexzielenski can you take a final look at the most recent updates?

added a test that you suggested and changed the commit message.
Could you PTAL?

@alexzielenski
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 24, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 815e55710d8c07d51424a4e0f0fc2b08200cb486

@gjkim42
Copy link
Member Author

gjkim42 commented Feb 27, 2023

@liggitt
Could you take a look?

I think this test-only change will also need to be back ported to 1.26. My change to patch a race #115302 introduced this flake.

And we think that this needs to be backported.

@aojea
Copy link
Member

aojea commented Feb 28, 2023

in a super quick look, I can observe that you have getters to the controller cache, maybe that can give you a way to confirm that all items were processed ?

diff --git a/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery_test.go b/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery_test.go
index 9bd48972404..c07993779d7 100644
--- a/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery_test.go
+++ b/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_discovery_test.go
@@ -18,6 +18,7 @@ package apiserver
 
 import (
        "context"
        "net/http"
        "net/http/httptest"
        "strconv"
@@ -44,12 +45,14 @@ func newDiscoveryManager(rm discoveryendpoint.ResourceManager) *discoveryManager
        return NewDiscoveryManager(rm).(*discoveryManager)
 }
 
-// Returns true if the queue of services to sync empty this means everything has
-// been reconciled and placed into merged document
-func waitForEmptyQueue(stopCh <-chan struct{}, dm *discoveryManager) bool {
+// Returns true when it reachs the number of the cached services
+func waitForCacheNumber(stopCh <-chan struct{}, n int, dm *discoveryManager) bool {
        return cache.WaitForCacheSync(stopCh, func() bool {
                // Once items have successfully synced they are removed from queue.
-               return dm.dirtyAPIServiceQueue.Len() == 0
+               dm.resultsLock.Lock()
+               defer dm.resultsLock.Unlock()
+               return len(dm.cachedResults) == n
        })
 }
 
@@ -63,7 +66,6 @@ func TestBasic(t *testing.T) {
        service2.SetGroups(apiGroup2.Items)
        aggregatedResourceManager := discoveryendpoint.NewResourceManager()
        aggregatedManager := newDiscoveryManager(aggregatedResourceManager)
-
        for _, g := range apiGroup1.Items {
                for _, v := range g.Versions {
                        aggregatedManager.AddAPIService(&apiregistrationv1.APIService{
@@ -103,7 +105,7 @@ func TestBasic(t *testing.T) {
 
        go aggregatedManager.Run(testCtx.Done())
 
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 2, aggregatedManager))
 
        response, _, parsed := fetchPath(aggregatedResourceManager, "")
        if response.StatusCode != 200 {
@@ -159,7 +161,9 @@ func TestDirty(t *testing.T) {
        defer cancel()
 
        go aggregatedManager.Run(testCtx.Done())
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 1, aggregatedManager))
+       // time.Sleep(1 * time.Second)
+       fmt.Println("POST", aggregatedManager.apiServices, aggregatedManager.cachedResults)
 
        // immediately check for ping, since Run() should block for local services
        if !pinged.Load() {
@@ -211,7 +215,7 @@ func TestRemoveAPIService(t *testing.T) {
                aggregatedManager.RemoveAPIService(s.Name)
        }
 
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 0, aggregatedManager))
 
        response, _, parsed := fetchPath(aggyService, "")
        if response.StatusCode != 200 {
@@ -293,7 +297,7 @@ func TestLegacyFallback(t *testing.T) {
        defer cancel()
 
        go aggregatedManager.Run(testCtx.Done())
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 1, aggregatedManager))
 
        // At this point external services have synced. Check if discovery document
        // includes the legacy resources
@@ -362,7 +366,7 @@ func TestNotModified(t *testing.T) {
 
        // Important to wait here to ensure we prime the cache with the initial list
        // of documents in order to exercise 304 Not Modified
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 1, aggregatedManager))
 
        // Now add all groups. We excluded one group before so that AllServicesSynced
        // could include it in this round. Now, if AllServicesSynced ever returns
@@ -373,7 +377,7 @@ func TestNotModified(t *testing.T) {
        }
 
        // This would wait the full timeout on 1.26.0.
-       require.True(t, waitForEmptyQueue(testCtx.Done(), aggregatedManager))
+       require.True(t, waitForCacheNumber(testCtx.Done(), 1, aggregatedManager))
 }

I tried the cache approach and looks easier than the queue

stress ./apiserver.test -test.run TestDirty
5s: 999 runs so far, 0 failures
10s: 2071 runs so far, 0 failures
15s: 3137 runs so far, 0 failures
20s: 4173 runs so far, 0 failures
25s: 5219 runs so far, 0 failures
30s: 6275 runs so far, 0 failures
35s: 7326 runs so far, 0 failures
40s: 8383 runs so far, 0 failures
45s: 9470 runs so far, 0 failures
50s: 10549 runs so far, 0 failures

@alexzielenski
Copy link
Member

alexzielenski commented Feb 28, 2023

@aojea while that is a good approach I think if we in the future added a test for updating a pre-existing entry (which I'm now likely to add very soon) your alternative not be sufficient if I'm reading correctly?

@liggitt
Copy link
Member

liggitt commented Mar 1, 2023

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexzielenski, gjkim42, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2023
@liggitt
Copy link
Member

liggitt commented Mar 1, 2023

looks like there are verify errors to resolve

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2023
gjkim42 added 2 commits March 1, 2023 13:23
This uses atomic.Bool as updating and reading a boolean-type variable
concurrently is not thread-safe.
`waitForEmptyQueue` cannot guarantee that all items in the queue have
been synced completely but guarantee that all items have been started.

This adds `waitForQueueComplete` and implements `completerWorkqueue` to
check if the workqueue is complete to deflake the tests in
staging/src/k8s.io/kube-aggregator/pkg/apiserver.
@gjkim42 gjkim42 force-pushed the deflake-TestDirty branch from c40fa3e to e24e3de Compare March 1, 2023 04:29
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2023
@gjkim42
Copy link
Member Author

gjkim42 commented Mar 1, 2023

looks like there are verify errors to resolve

That is because #115770 has been merged which introduces a new test to fix.

Now, I rebased this PR and fixed the newly created test as well.

$ stress ./apiserver.test
5s: 0 runs so far, 0 failures       
10s: 12 runs so far, 0 failures     
15s: 12 runs so far, 0 failures
...
1h0m0s: 4582 runs so far, 0 failures

@alexzielenski @liggitt

Could you re-LGTM on this?

@liggitt
Copy link
Member

liggitt commented Mar 1, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 50f28cfc357d258afd1c1b1a552e24aa905c1da1

@liggitt
Copy link
Member

liggitt commented Mar 1, 2023

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2023
@aojea
Copy link
Member

aojea commented Mar 1, 2023

@aojea while that is a good approach I think if we in the future added a test for updating a pre-existing entry (which I'm now likely to add very soon) your alternative not be sufficient if I'm reading correctly?

@alexzielenski it was just an alternative because I saw your comment in the workqueue complete, but I don't have strong opinion, the current option LGTM too

@k8s-ci-robot k8s-ci-robot merged commit e2fff53 into kubernetes:master Mar 1, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Mar 1, 2023
@gjkim42 gjkim42 deleted the deflake-TestDirty branch March 1, 2023 22:28
k8s-ci-robot added a commit that referenced this pull request Mar 11, 2023
…f-#115302-upstream-release-1.26

Cherry pick of Aggregated Discovery Patches: #115302 #115770 #115998 #115859
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaking test] pull-kubernetes-unit - TestDirty fails sometimes
6 participants