-
Notifications
You must be signed in to change notification settings - Fork 1.1k
internal/process: add functionality to clean up zombie children #5260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ec664c1 to
dc74324
Compare
Codecov Report
@@ Coverage Diff @@
## master #5260 +/- ##
==========================================
+ Coverage 43.76% 43.86% +0.09%
==========================================
Files 118 119 +1
Lines 11721 11757 +36
==========================================
+ Hits 5130 5157 +27
- Misses 6104 6112 +8
- Partials 487 488 +1 |
dc74324 to
5e6311f
Compare
| // ParseDefunctProcessesForPath retrieves the number of zombie processes from | ||
| // a specific process filesystem, as well as the number of defunct children of the current running process. | ||
| func ParseDefunctProcessesForPathAndParent(path string, parent int) (defunctCount uint, defunctChildren []int, retErr error) { | ||
| defunctChildren = make([]int, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line seems not required
| defunctChildren = make([]int, 0) |
internal/process/zombie_monitor.go
Outdated
| // It fails if one has already been created. | ||
| func InitializeZombieMonitor() error { | ||
| if monitor != nil { | ||
| return errors.New("zombie monitor already intialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this will probably not happen, but we could return the existing monitor here.
server/metrics/metrics.go
Outdated
| } | ||
| logrus.Warn(err) | ||
| return 0 | ||
| return float64(process.TotalDefunctProcesses()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the "return the current state of the system" call time-wise dependent on the zombie monitor (runs every 5 seconds for now). I'm not sure if we wanna have this coupling, how about separating both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I was unsure but didnt' want to duplicate work. I will split it up, which will allow us to cleanup zombies less frequently
internal/process/zombie_monitor.go
Outdated
| count, zombieChildren, err := ParseDefunctProcesses() | ||
| if err != nil { | ||
| log.Warnf(context.Background(), "Failed to get defunct process information: %v", err) | ||
| } | ||
| atomic.StoreUint32(&zm.zombieCount, uint32(count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this run after the case <-time.After(time.Second * 5): below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropped now
internal/process/zombie_monitor.go
Outdated
| case <-zm.closeChan: | ||
| // Since the process will soon shutdown, and its children will be reparented, no need to delay the shutdown to cleanup. | ||
| return | ||
| case <-time.After(time.Second * 5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this too often? How about giving the whole cleanup more time, e.g. 15 minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about 5 minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
internal/process/zombie_monitor.go
Outdated
| } | ||
| for _, child := range zombieChildren { | ||
| if _, err := syscall.Wait4(child, nil, syscall.WNOHANG, nil); err != nil { | ||
| log.Errorf(context.Background(), "Failed to reap child process %d: %v", child, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use logrus if we do not have a context to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
5e6311f to
92d80c2
Compare
|
comments addressed! |
saschagrunert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
LGTM |
|
/retest-required |
| // a specific process filesystem. | ||
| func DefunctProcessesForPath(path string) (defunctCount uint, retErr error) { | ||
| // ParseDefunctProcessesForPath retrieves the number of zombie processes from | ||
| // a specific process filesystem, as well as the number of defunct children of the current running process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/of the current running process/of a given parent/?
| ppidBegin := stateByte + 2 | ||
| endOfPPid := strings.IndexRune(data[ppidBegin:], ' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ppidBegin / ppidEnd?
| // can have parentheses, so look for the last ')'. | ||
| i := strings.LastIndexByte(data, ')') | ||
| if i <= 2 || i >= len(data)-1 { | ||
| endOfComm := strings.LastIndexByte(data, ')') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commEnd?
|
|
||
| // the fourth field is PPid, and we can start looking after the space after State | ||
| ppidBegin := stateByte + 2 | ||
| endOfPPid := strings.IndexRune(data[ppidBegin:], ' ') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use IndexByte here.
| return nil, errors.Errorf("invalid stat data (no comm): %q", data) | ||
| } | ||
|
|
||
| stateByte := endOfComm + 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stateIdx?
internal/process/zombie_monitor.go
Outdated
| // 1: it is responsible for reporting the number of defunct PIDs currently on the node | ||
| // 2: it is responsible for cleaning up zombies that are children of the currently running process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it is responsible for//
|
@haircommander I think I missed the discussion on where those zombies are coming from, can you point me out to any details on that? |
we've had a number of reports that there are Part of me is less convinced we need this anymore, though it may not be a bad idea to sweep for them periodically |
Now that exec sync requests are run by conmon, there are more processes in the mix and more possibility for zombies this commit adds a zombie monitor to the defunct process metrics collection flow. it is a little clunky, but it would be weird to have two different /proc parsers for very similar uses Signed-off-by: Peter Hunt <[email protected]>
92d80c2 to
4bdcabb
Compare
|
@haircommander: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
|
||
| // Stat represents status information of a process from /proc/[pid]/stat. | ||
| type Stat struct { | ||
| // Pid is the PID of the process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing period at the end of sentence (feel free to ignore).
| // State is the state of the process. | ||
| State string | ||
|
|
||
| // PPid is the parent PID of the process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
kolyshkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, I'm just not sure we need a zombie reaper if there are no bugs in the code (and if they are, they need to be fixed).
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander, kolyshkin, saschagrunert The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
yeah I am leaning with you as well, especially becaus we've recently found a leak which is possibly the cause of some of the cases we've found. Closing for now |
Now that exec sync requests are run by conmon, there are more processes in the mix and more possibility for zombies
this commit adds a zombie monitor to the defunct process metrics collection flow. it is a little clunky, but it would be weird
to have two different /proc parsers for very similar uses
Signed-off-by: Peter Hunt [email protected]
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?