Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@haircommander
Copy link
Member

Now that exec sync requests are run by conmon, there are more processes in the mix and more possibility for zombies

this commit adds a zombie monitor to the defunct process metrics collection flow. it is a little clunky, but it would be weird
to have two different /proc parsers for very similar uses

Signed-off-by: Peter Hunt [email protected]

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

add a zombie monitor to cleanup CRI-O's defunct children

@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/feature Categorizes issue or PR as related to a new feature. labels Aug 27, 2021
@openshift-ci openshift-ci bot requested a review from rhatdan August 27, 2021 16:28
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2021
@codecov
Copy link

codecov bot commented Aug 27, 2021

Codecov Report

Merging #5260 (d2d5608) into master (35d1be8) will increase coverage by 0.09%.
The diff coverage is 67.92%.

❗ Current head d2d5608 differs from pull request most recent head 4bdcabb. Consider uploading reports for the commit 4bdcabb to get more accurate results

@@            Coverage Diff             @@
##           master    #5260      +/-   ##
==========================================
+ Coverage   43.76%   43.86%   +0.09%     
==========================================
  Files         118      119       +1     
  Lines       11721    11757      +36     
==========================================
+ Hits         5130     5157      +27     
- Misses       6104     6112       +8     
- Partials      487      488       +1     

// ParseDefunctProcessesForPath retrieves the number of zombie processes from
// a specific process filesystem, as well as the number of defunct children of the current running process.
func ParseDefunctProcessesForPathAndParent(path string, parent int) (defunctCount uint, defunctChildren []int, retErr error) {
defunctChildren = make([]int, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems not required

Suggested change
defunctChildren = make([]int, 0)

// It fails if one has already been created.
func InitializeZombieMonitor() error {
if monitor != nil {
return errors.New("zombie monitor already intialized")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this will probably not happen, but we could return the existing monitor here.

}
logrus.Warn(err)
return 0
return float64(process.TotalDefunctProcesses())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the "return the current state of the system" call time-wise dependent on the zombie monitor (runs every 5 seconds for now). I'm not sure if we wanna have this coupling, how about separating both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was unsure but didnt' want to duplicate work. I will split it up, which will allow us to cleanup zombies less frequently

Comment on lines 54 to 58
count, zombieChildren, err := ParseDefunctProcesses()
if err != nil {
log.Warnf(context.Background(), "Failed to get defunct process information: %v", err)
}
atomic.StoreUint32(&zm.zombieCount, uint32(count))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this run after the case <-time.After(time.Second * 5): below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped now

case <-zm.closeChan:
// Since the process will soon shutdown, and its children will be reparented, no need to delay the shutdown to cleanup.
return
case <-time.After(time.Second * 5):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too often? How about giving the whole cleanup more time, e.g. 15 minutes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about 5 minutes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

}
for _, child := range zombieChildren {
if _, err := syscall.Wait4(child, nil, syscall.WNOHANG, nil); err != nil {
log.Errorf(context.Background(), "Failed to reap child process %d: %v", child, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use logrus if we do not have a context to add.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@haircommander
Copy link
Member Author

comments addressed!

Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TomSweeneyRedHat
Copy link
Contributor

LGTM
but a few tests need some help

@nee1esh
Copy link

nee1esh commented Sep 13, 2021

/retest-required

// a specific process filesystem.
func DefunctProcessesForPath(path string) (defunctCount uint, retErr error) {
// ParseDefunctProcessesForPath retrieves the number of zombie processes from
// a specific process filesystem, as well as the number of defunct children of the current running process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of the current running process/of a given parent/?

Comment on lines 99 to 100
ppidBegin := stateByte + 2
endOfPPid := strings.IndexRune(data[ppidBegin:], ' ')
Copy link
Collaborator

@kolyshkin kolyshkin Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ppidBegin / ppidEnd?

// can have parentheses, so look for the last ')'.
i := strings.LastIndexByte(data, ')')
if i <= 2 || i >= len(data)-1 {
endOfComm := strings.LastIndexByte(data, ')')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commEnd?


// the fourth field is PPid, and we can start looking after the space after State
ppidBegin := stateByte + 2
endOfPPid := strings.IndexRune(data[ppidBegin:], ' ')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use IndexByte here.

return nil, errors.Errorf("invalid stat data (no comm): %q", data)
}

stateByte := endOfComm + 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stateIdx?

Comment on lines 12 to 13
// 1: it is responsible for reporting the number of defunct PIDs currently on the node
// 2: it is responsible for cleaning up zombies that are children of the currently running process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it is responsible for//

@kolyshkin
Copy link
Collaborator

@haircommander I think I missed the discussion on where those zombies are coming from, can you point me out to any details on that?

@haircommander
Copy link
Member Author

@haircommander I think I missed the discussion on where those zombies are coming from, can you point me out to any details on that?

we've had a number of reports that there are conmon zombies that are leaking (https://bugzilla.redhat.com/show_bug.cgi?id=1952137 is one). We initially thought it was a bug in the exec implementation, but I am now thinking it was actually leaks that were closed in #5283.

Part of me is less convinced we need this anymore, though it may not be a bad idea to sweep for them periodically

Now that exec sync requests are run by conmon, there are more processes in the mix and more possibility for zombies

this commit adds a zombie monitor to the defunct process metrics collection flow. it is a little clunky, but it would be weird
to have two different /proc parsers for very similar uses

Signed-off-by: Peter Hunt <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 15, 2021

@haircommander: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/openshift-jenkins/integration_crun_cgroupv2 4bdcabb link false
ci/openshift-jenkins/integration_crun 4bdcabb link true
ci/openshift-jenkins/e2e_crun_cgroupv2 4bdcabb link false
ci/prow/e2e-gcp 4bdcabb link /test e2e-gcp

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.


// Stat represents status information of a process from /proc/[pid]/stat.
type Stat struct {
// Pid is the PID of the process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing period at the end of sentence (feel free to ignore).

// State is the state of the process.
State string

// PPid is the parent PID of the process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, I'm just not sure we need a zombie reaper if there are no bugs in the code (and if they are, they need to be fixed).

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 16, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, kolyshkin, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [haircommander,kolyshkin,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haircommander
Copy link
Member Author

haircommander commented Sep 16, 2021

yeah I am leaning with you as well, especially becaus we've recently found a leak which is possibly the cause of some of the cases we've found. Closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants