Fix goroutine leak in federation service controller#33359
Conversation
|
Can a kubernetes member verify that this patch is reasonable to test? If so, please reply with "@k8s-bot ok to test" on its own line. Regular contributors should join the org to skip this step. While we transition away from the Jenkins GitHub PR Builder plugin, "ok to test" commenters will need to be on the admin list defined in this file. |
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
|
I signed it! |
|
CLAs look good, thanks! |
madhusudancs
left a comment
There was a problem hiding this comment.
This PR drives home the urgency to move the federated service controller to the new FederatedInformer architecture.
| func (sc *ServiceController) clusterEndpointWorker() { | ||
| fedClient := sc.federationClient | ||
| // process all pending events in endpointWorkerDoneChan | ||
| eventPending := true |
There was a problem hiding this comment.
What's the point of this variable? It's not really useful. Just get rid of it and change the loop to
for {
...
}
There was a problem hiding this comment.
the break statement in default case will only break the select loop, and the goroutine will get stuck in for loop forever, so we need this variable to come out of for loop
There was a problem hiding this comment.
Oops, you are right. This isn't a common idiom which is why it threw me off the guard.
What you have is fine, but if you want to stay with a commonly used idiom, please see https://golang.org/ref/spec#Break_statements.
| } | ||
|
|
||
| for clusterName, cache := range sc.clusterCache.clientMap { | ||
| workerExist, keyFound := sc.endpointWorkerMap[clusterName] |
There was a problem hiding this comment.
Change keyFound to found.
| if keyFound && workerExist { | ||
| continue | ||
| } | ||
| sc.endpointWorkerMap[clusterName] = true |
There was a problem hiding this comment.
Set this after starting the goroutine, just to be safe.
| fedClient := sc.federationClient | ||
| // process all pending events in serviceWorkerDoneChan | ||
| eventPending := true | ||
| for eventPending { |
There was a problem hiding this comment.
Same comment as above.
| } | ||
|
|
||
| for clusterName, cache := range sc.clusterCache.clientMap { | ||
| workerExist, keyFound := sc.serviceWorkerMap[clusterName] |
There was a problem hiding this comment.
Same as above. Change to found.
| go func(cache *clusterCache, clusterName string) { | ||
| fedClient := sc.federationClient | ||
| for { | ||
| func() { |
There was a problem hiding this comment.
This function is important here.
Notice the defer statement inside this block. This func() exists for that sole reason. At the end of each loop, we need to say we are done for that key, so we defer cache.endpointQueue.Done(key). If you remove the func() {} wrapping, defer will be run at the end of the enclosing goroutine and the keys won't be removed from the queue until then. So I don't think you should be removing the func() {} wrapping.
There was a problem hiding this comment.
agreed, missed that, will change
| if err != nil { | ||
| glog.Errorf("Failed to sync service: %+v", err) | ||
| } | ||
| }() |
There was a problem hiding this comment.
Same comment as above.
| KubeAPIQPS = 20.0 | ||
| KubeAPIBurst = 30 | ||
|
|
||
| maxNoOfClusters = 256 |
There was a problem hiding this comment.
We have informally talked about this before but we have never established this formally. We plan to support 100 clusters initially. 256 seems too high to start with.
There was a problem hiding this comment.
sure will change that to 100
|
@madhusudancs handled the review comments in the second commit, plz check |
|
@k8s-bot federation gce e2e test this |
1 similar comment
|
@k8s-bot federation gce e2e test this |
|
Jenkins Federation GCE e2e failed for commit 690a06b. Full PR test history. The magic incantation to run this job again is |
|
Federation tests that are failing are known failures. These changes seem to not have caused any regressions, so LGTM'ing the PR. @shashidharatd thanks for these changes! |
|
1.4.0 train has already left, so this must be cherry-picked to v1.4.1+. |
|
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
|
Jenkins GKE smoke e2e failed for commit 690a06b. Full PR test history. The magic incantation to run this job again is |
|
Automatic merge from submit-queue |
|
Thanks @madhusudancs |
|
Added cherrypick-candidate label. As @madhusudancs mentions above, this can wait for v1.4.1. Not sure what labelling/milestone convention we're using to represent that. |
|
@shashidharatd can you open the PR to cherry-pick this into the release-1.4 branch |
|
@jessfraz are we not doing automated cherry-picks any more? |
#33163-#33227-#33359-#33605-#33967-#33977-#34158-origin-release-1.4 Automatic merge from submit-queue Automated cherry pick of #32914 #33163 #33227 #33359 #33605 #33967 #33977 #34158 origin release 1.4 Cherry pick of #32914 #33163 #33227 #33359 #33605 #33967 #33977 #34158 on release-1.4. #32914: Limit the number of names per image reported in the node #33163: fix the appending bug #33227: remove cpu limits for dns pod. The current limits are not #33359: Fix goroutine leak in federation service controller #33605: Add periodic ingress reconciliations. #33967: scheduler: cache.delete deletes the pod from node specified #33977: Heal the namespaceless ingresses in federation e2e. #34158: Add missing argument to log message in federated ingress
|
Commit found in the "release-1.4" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
…ck-of-#32914-kubernetes#33163-kubernetes#33227-kubernetes#33359-kubernetes#33605-kubernetes#33967-kubernetes#33977-kubernetes#34158-origin-release-1.4 Automatic merge from submit-queue Automated cherry pick of kubernetes#32914 kubernetes#33163 kubernetes#33227 kubernetes#33359 kubernetes#33605 kubernetes#33967 kubernetes#33977 kubernetes#34158 origin release 1.4 Cherry pick of kubernetes#32914 kubernetes#33163 kubernetes#33227 kubernetes#33359 kubernetes#33605 kubernetes#33967 kubernetes#33977 kubernetes#34158 on release-1.4. kubernetes#32914: Limit the number of names per image reported in the node kubernetes#33163: fix the appending bug kubernetes#33227: remove cpu limits for dns pod. The current limits are not kubernetes#33359: Fix goroutine leak in federation service controller kubernetes#33605: Add periodic ingress reconciliations. kubernetes#33967: scheduler: cache.delete deletes the pod from node specified kubernetes#33977: Heal the namespaceless ingresses in federation e2e. kubernetes#34158: Add missing argument to log message in federated ingress
What this PR does / why we need it: Fixes a memory leak
Which issue this PR fixes (optional, in
fixes #<issue number>(, #<issue_number>, ...)format, will close that issue when PR gets merged): fixes #33186Special notes for your reviewer: Every second new goroutines are created and are getting blocked waiting for the lock in the event queue. only one worker will get a lock when there are some events to process, so all the goroutines which are created every second waits for the lock forever and causes the memory/goroutine leak.
As a fix the new worker will be created only when there is no worker exist. and only one worker per cluster either waits for the event or processes all the events and goes out of existence.
This change is