-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[1.18] oci: repeat $runtime state calls if appropriate #3866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.18] oci: repeat $runtime state calls if appropriate #3866
Conversation
logging "Exec'd..." for every exec sync is too much, especially on Info if we fail to decode state, we should log why the `$runtime state` command failed Signed-off-by: Peter Hunt <[email protected]>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1 similar comment
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
8dc2348 to
e5d62c6
Compare
Codecov Report
@@ Coverage Diff @@
## release-1.18 #3866 +/- ##
================================================
- Coverage 40.47% 40.43% -0.04%
================================================
Files 105 105
Lines 8636 8660 +24
================================================
+ Hits 3495 3502 +7
- Misses 4829 4845 +16
- Partials 312 313 +1 |
internal/oci/container.go
Outdated
| return false | ||
| } | ||
|
|
||
| out, err := exec.Command("kill", "-0", strconv.Itoa(pid)).CombinedOutput() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a syscall instead of spawning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
e5d62c6 to
c8603ed
Compare
|
/retest |
1 similar comment
|
/retest |
/retest |
internal/oci/runtime_oci.go
Outdated
| if err != nil { | ||
| logrus.Errorf("%v: attempt %d", err, i) | ||
| } | ||
| if out != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in the else block of the above if statement ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the order of these
|
|
we have seen cases where $runtime state calls fail supriously, but succeed later this is not great, though, we shouldn't incorrectly label pods if this happens. We now retry state calls up to three times if we determine the container is still running (by calling kill on its pid) Signed-off-by: Peter Hunt <[email protected]>
c8603ed to
e85e2cc
Compare
| // went away we do not error out stopping kubernetes to recover. | ||
| // We always populate the fields below so kube can restart/reschedule | ||
| // containers failing. | ||
| if out == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should err be checked here ? The loop above may exit after the 3rd attempt and by that time, err may still be non-nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the assumption is that if out is nil, then err is populated. we only need to check one, so which doesn't really matter.
|
/retest |
1 similar comment
|
/retest |
| // it is used to check a container state when we don't want (or don't trust) a `$runtime state` call | ||
| func (c *Container) IsRunning() bool { | ||
| pid := c.state.Pid | ||
| process, err := os.FindProcess(pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? We could directly make the syscall since we already know the pid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose not, though I personally would rather use the standard lib instead of directly using syscall. I am worried about having to handle things that the lib takes care of (this isn't a great example, but for go 1.14, longer syscalls need eintr handling https://golang.org/doc/go1.14#runtime). My inclination is to leave as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.FindProcess(pid) doesn't add any extra cost on unix as it just populates a struct with the specified PID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
/test e2e-aws |
| if err == nil { | ||
| break | ||
| } | ||
| logrus.Errorf("%v: attempt %d", err, i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add more context to this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the context comes from the called function:
failed to update container state for %s: %v", c.id, err
I couldn't think of anything else to add out here that wouldn't be redundant. Is there anything you'd like to see added?
| // it is used to check a container state when we don't want (or don't trust) a `$runtime state` call | ||
| func (c *Container) IsRunning() bool { | ||
| pid := c.state.Pid | ||
| process, err := os.FindProcess(pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.FindProcess(pid) doesn't add any extra cost on unix as it just populates a struct with the specified PID.
|
/test e2e-aws |
|
/hold |
|
we have identified the underlying issue in runc. it's best not to hide flakes like this pr does. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
we have seen cases where
$runtime statecalls fail supriously, but succeed laterthis is not great, though, we shouldn't incorrectly label pods if this happens.
We now retry state calls up to three times if we determine the container is still running (by calling kill on its pid)
also carry #3853
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?