Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fix goroutine leak in federation service controller#33359

Merged
k8s-github-robot merged 2 commits into
kubernetes:masterfrom
shashidharatd:federation
Sep 24, 2016
Merged

Fix goroutine leak in federation service controller#33359
k8s-github-robot merged 2 commits into
kubernetes:masterfrom
shashidharatd:federation

Conversation

@shashidharatd
Copy link
Copy Markdown

@shashidharatd shashidharatd commented Sep 23, 2016

What this PR does / why we need it: Fixes a memory leak

Which issue this PR fixes (optional, in fixes #<issue number>(, #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #33186

Special notes for your reviewer: Every second new goroutines are created and are getting blocked waiting for the lock in the event queue. only one worker will get a lock when there are some events to process, so all the goroutines which are created every second waits for the lock forever and causes the memory/goroutine leak.

As a fix the new worker will be created only when there is no worker exist. and only one worker per cluster either waits for the event or processes all the events and goes out of existence.

Fixes memory/goroutine leak in Federation Service controller.

This change is Reviewable

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Can a kubernetes member verify that this patch is reasonable to test? If so, please reply with "@k8s-bot ok to test" on its own line.

Regular contributors should join the org to skip this step.

While we transition away from the Jenkins GitHub PR Builder plugin, "ok to test" commenters will need to be on the admin list defined in this file.

@googlebot
Copy link
Copy Markdown

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please let us know the company's name.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Sep 23, 2016
@shashidharatd
Copy link
Copy Markdown
Author

I signed it!

@googlebot
Copy link
Copy Markdown

CLAs look good, thanks!

Copy link
Copy Markdown
Contributor

@madhusudancs madhusudancs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR drives home the urgency to move the federated service controller to the new FederatedInformer architecture.

func (sc *ServiceController) clusterEndpointWorker() {
fedClient := sc.federationClient
// process all pending events in endpointWorkerDoneChan
eventPending := true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this variable? It's not really useful. Just get rid of it and change the loop to

for {
    ...
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the break statement in default case will only break the select loop, and the goroutine will get stuck in for loop forever, so we need this variable to come out of for loop

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you are right. This isn't a common idiom which is why it threw me off the guard.

What you have is fine, but if you want to stay with a commonly used idiom, please see https://golang.org/ref/spec#Break_statements.

}

for clusterName, cache := range sc.clusterCache.clientMap {
workerExist, keyFound := sc.endpointWorkerMap[clusterName]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change keyFound to found.

if keyFound && workerExist {
continue
}
sc.endpointWorkerMap[clusterName] = true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set this after starting the goroutine, just to be safe.

fedClient := sc.federationClient
// process all pending events in serviceWorkerDoneChan
eventPending := true
for eventPending {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

}

for clusterName, cache := range sc.clusterCache.clientMap {
workerExist, keyFound := sc.serviceWorkerMap[clusterName]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Change to found.

go func(cache *clusterCache, clusterName string) {
fedClient := sc.federationClient
for {
func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is important here.

Notice the defer statement inside this block. This func() exists for that sole reason. At the end of each loop, we need to say we are done for that key, so we defer cache.endpointQueue.Done(key). If you remove the func() {} wrapping, defer will be run at the end of the enclosing goroutine and the keys won't be removed from the queue until then. So I don't think you should be removing the func() {} wrapping.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, missed that, will change

if err != nil {
glog.Errorf("Failed to sync service: %+v", err)
}
}()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

KubeAPIQPS = 20.0
KubeAPIBurst = 30

maxNoOfClusters = 256
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have informally talked about this before but we have never established this formally. We plan to support 100 clusters initially. 256 seems too high to start with.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will change that to 100

@k8s-github-robot k8s-github-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 23, 2016
@shashidharatd
Copy link
Copy Markdown
Author

@madhusudancs handled the review comments in the second commit, plz check

@madhusudancs
Copy link
Copy Markdown
Contributor

@k8s-bot federation gce e2e test this

1 similar comment
@madhusudancs
Copy link
Copy Markdown
Contributor

@k8s-bot federation gce e2e test this

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Jenkins Federation GCE e2e failed for commit 690a06b. Full PR test history.

The magic incantation to run this job again is @k8s-bot federation gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@madhusudancs
Copy link
Copy Markdown
Contributor

Federation tests that are failing are known failures. These changes seem to not have caused any regressions, so LGTM'ing the PR.

@shashidharatd thanks for these changes!

@madhusudancs madhusudancs added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Sep 24, 2016
@madhusudancs madhusudancs added this to the v1.4 milestone Sep 24, 2016
@madhusudancs
Copy link
Copy Markdown
Contributor

1.4.0 train has already left, so this must be cherry-picked to v1.4.1+.

@k8s-github-robot
Copy link
Copy Markdown

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Jenkins GKE smoke e2e failed for commit 690a06b. Full PR test history.

The magic incantation to run this job again is @k8s-bot gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-github-robot
Copy link
Copy Markdown

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 46c36fc into kubernetes:master Sep 24, 2016
@shashidharatd
Copy link
Copy Markdown
Author

Thanks @madhusudancs

@ghost ghost added the cherrypick-candidate label Sep 24, 2016
@ghost
Copy link
Copy Markdown

ghost commented Sep 24, 2016

Added cherrypick-candidate label. As @madhusudancs mentions above, this can wait for v1.4.1. Not sure what labelling/milestone convention we're using to represent that.

@jessfraz jessfraz added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Oct 6, 2016
@jessfraz
Copy link
Copy Markdown
Contributor

jessfraz commented Oct 6, 2016

@shashidharatd can you open the PR to cherry-pick this into the release-1.4 branch

@madhusudancs
Copy link
Copy Markdown
Contributor

@jessfraz are we not doing automated cherry-picks any more?

k8s-github-robot pushed a commit that referenced this pull request Oct 6, 2016
#33163-#33227-#33359-#33605-#33967-#33977-#34158-origin-release-1.4

Automatic merge from submit-queue

Automated cherry pick of #32914 #33163 #33227 #33359 #33605 #33967 #33977 #34158 origin release 1.4

Cherry pick of #32914 #33163 #33227 #33359 #33605 #33967 #33977 #34158 on release-1.4.

#32914: Limit the number of names per image reported in the node
#33163: fix the appending bug
#33227: remove cpu limits for dns pod. The current limits are not
#33359: Fix goroutine leak in federation service controller
#33605: Add periodic ingress reconciliations.
#33967: scheduler: cache.delete deletes the pod from node specified
#33977: Heal the namespaceless ingresses in federation e2e.
#34158: Add missing argument to log message in federated ingress
@k8s-cherrypick-bot
Copy link
Copy Markdown

Commit found in the "release-1.4" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked.

@shashidharatd shashidharatd deleted the federation branch October 19, 2016 13:17
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016
…ck-of-#32914-kubernetes#33163-kubernetes#33227-kubernetes#33359-kubernetes#33605-kubernetes#33967-kubernetes#33977-kubernetes#34158-origin-release-1.4

Automatic merge from submit-queue

Automated cherry pick of kubernetes#32914 kubernetes#33163 kubernetes#33227 kubernetes#33359 kubernetes#33605 kubernetes#33967 kubernetes#33977 kubernetes#34158 origin release 1.4

Cherry pick of kubernetes#32914 kubernetes#33163 kubernetes#33227 kubernetes#33359 kubernetes#33605 kubernetes#33967 kubernetes#33977 kubernetes#34158 on release-1.4.

kubernetes#32914: Limit the number of names per image reported in the node
kubernetes#33163: fix the appending bug
kubernetes#33227: remove cpu limits for dns pod. The current limits are not
kubernetes#33359: Fix goroutine leak in federation service controller
kubernetes#33605: Add periodic ingress reconciliations.
kubernetes#33967: scheduler: cache.delete deletes the pod from node specified
kubernetes#33977: Heal the namespaceless ingresses in federation e2e.
kubernetes#34158: Add missing argument to log message in federated ingress
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

goroutine leak in federation service controller

7 participants