Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@haircommander
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

we have seen cases where $runtime state calls fail supriously, but succeed later
this is not great, though, we shouldn't incorrectly label pods if this happens.

We now retry state calls up to three times if we determine the container is still running (by calling kill on its pid)

also carry #3853

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


logging "Exec'd..." for every exec sync is too much, especially on Info
if we fail to decode state, we should log why the `$runtime state` command failed

Signed-off-by: Peter Hunt <[email protected]>
@openshift-ci-robot openshift-ci-robot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Jun 10, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2020
@codecov
Copy link

codecov bot commented Jun 10, 2020

Codecov Report

Merging #3866 into release-1.18 will decrease coverage by 0.03%.
The diff coverage is 22.58%.

@@               Coverage Diff                @@
##           release-1.18    #3866      +/-   ##
================================================
- Coverage         40.47%   40.43%   -0.04%     
================================================
  Files               105      105              
  Lines              8636     8660      +24     
================================================
+ Hits               3495     3502       +7     
- Misses             4829     4845      +16     
- Partials            312      313       +1     

return false
}

out, err := exec.Command("kill", "-0", strconv.Itoa(pid)).CombinedOutput()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a syscall instead of spawning here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@haircommander
Copy link
Member Author

/retest

1 similar comment
@haircommander
Copy link
Member Author

/retest

@haircommander
Copy link
Member Author


# time="2020-06-10 19:57:30.837507507Z" level=debug msg="Request: &ListContainersRequest{Filter:&ContainerFilter{Id:,State:&ContainerStateValue{State:CONTAINER_RUNNING,},PodSandboxId:,LabelSelector:map[string]string{},},}" file="go-grpc-middleware/chain.go:25" id=eab6ca6e-7c6b-458b-b269-24b0e6513979 name=/runtime.v1alpha2.RuntimeService/ListContainers

/retest

if err != nil {
logrus.Errorf("%v: attempt %d", err, i)
}
if out != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the else block of the above if statement ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the order of these

@haircommander
Copy link
Member Author

[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should support readOnly directory specified in the volumeMount
flaky test perhaps?
/test e2e_cgroupv2

we have seen cases where $runtime state calls fail supriously, but succeed later
this is not great, though, we shouldn't incorrectly label pods if this happens.

We now retry state calls up to three times if we determine the container is still running (by calling kill on its pid)

Signed-off-by: Peter Hunt <[email protected]>
// went away we do not error out stopping kubernetes to recover.
// We always populate the fields below so kube can restart/reschedule
// containers failing.
if out == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should err be checked here ? The loop above may exit after the 3rd attempt and by that time, err may still be non-nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the assumption is that if out is nil, then err is populated. we only need to check one, so which doesn't really matter.

@haircommander
Copy link
Member Author

/retest

1 similar comment
@haircommander
Copy link
Member Author

/retest

// it is used to check a container state when we don't want (or don't trust) a `$runtime state` call
func (c *Container) IsRunning() bool {
pid := c.state.Pid
process, err := os.FindProcess(pid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? We could directly make the syscall since we already know the pid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose not, though I personally would rather use the standard lib instead of directly using syscall. I am worried about having to handle things that the lib takes care of (this isn't a great example, but for go 1.14, longer syscalls need eintr handling https://golang.org/doc/go1.14#runtime). My inclination is to leave as is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.FindProcess(pid) doesn't add any extra cost on unix as it just populates a struct with the specified PID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haircommander
Copy link
Member Author

/test e2e-aws

if err == nil {
break
}
logrus.Errorf("%v: attempt %d", err, i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add more context to this error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the context comes from the called function:
failed to update container state for %s: %v", c.id, err
I couldn't think of anything else to add out here that wouldn't be redundant. Is there anything you'd like to see added?

// it is used to check a container state when we don't want (or don't trust) a `$runtime state` call
func (c *Container) IsRunning() bool {
pid := c.state.Pid
process, err := os.FindProcess(pid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.FindProcess(pid) doesn't add any extra cost on unix as it just populates a struct with the specified PID.

@haircommander
Copy link
Member Author

/test e2e-aws

@haircommander
Copy link
Member Author

/hold
entirely possible we can get away without this

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2020
@haircommander
Copy link
Member Author

we have identified the underlying issue in runc. it's best not to hide flakes like this pr does.

@haircommander haircommander deleted the log-updates-1.18-2 branch September 27, 2021 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants