wip: try to get openshift ci suite to pass #4582

haircommander · 2021-02-17T15:39:33Z

What type of PR is this?

/kind api-change
/kind bug
/kind ci
/kind cleanup
/kind dependency-change
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake
/kind other

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Signed-off-by: Urvashi Mohnani <[email protected]>

[1.20] Point to k8s release-1.20 branch for tests

so we don't leak a goroutine Signed-off-by: Peter Hunt <[email protected]>

…pick-4412-to-release-1.20 [release-1.20] image pull: close progress chan

Signed-off-by: Urvashi Mohnani <[email protected]>

Signed-off-by: Skyler Clark <[email protected]>

This saves some time (a few seconds per test at least) as we avoid running setup_test (and cleanup_test has nothing to clean up), and removes some code duplication. Signed-off-by: Kir Kolyshkin <[email protected]>

...and add a status check to one case where we use run, to make it more obvious that `run` is really needed here. Signed-off-by: Kir Kolyshkin <[email protected]>

Using /dev/loop-control is problematic since it is not supposed to be read from or written to. Use /dev/kmsg, and actually enable the write test. Signed-off-by: Kir Kolyshkin <[email protected]>

…pick-4402-to-release-1.20 [release-1.20] convert shmsize annotation to handler_allowed

…pick-4369-to-release-1.20 [release-1.20] test/devices.bats: nits

[1.20] Bump version to v1.20.0-rc.1

which changed after a k8s release-notes package bump Signed-off-by: Peter Hunt <[email protected]>

This test case is sometimes fails like this: > not ok 40 ctr execsync > # (in test file ./ctr.bats, line 429) > # `[[ "$output" == *"command timed out"* ]]' failed > ..... > # time="2020-12-02T23:28:58Z" level=fatal msg="connect: connect endpoint 'unix:///tmp/tmp.ncaFI2tZzn/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded" This happens because our CI might be slow and this 1 second timeout also applies to connect. Increase the timeout and the sleep accordingly to fix the flake. Signed-off-by: Kir Kolyshkin <[email protected]>

[1.20] release-notes: fix flags

…pick-4411-to-release-1.20 [release-1.20] test/ctr.bats: fix a "ctr execsync" flake

ResourceCache is a structure that keeps track of partially created Pods and Containers. Its features include: - tracking pods and containers after their initial creation times out - automatic garbage collection (after a timer) Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

Before when a client's request for a RunPodSandbox or ContainerCreate timed out, CRI-O would clean up the resource. However, these requests usually fail when the node is under load. In these cases, it would be better to hold onto the progress, not get rid of it. This commit uses the previously created ResourceCache to cache the progress of a container creation and sandbox run. When a duplicate name is detected, before erroring, the server checks in the ResourceCache to see if we've already successfully created that resource. If so, we return it as if we'd just created it. It also moves the SetCreated call to after the resource is deemed as not having timed out. Hopefully, this reduces the load on already overloaded nodes. Signed-off-by: Peter Hunt <[email protected]>

Even if we use the resource cache as is, the user is still bombarded with messages saying the name is reserved. This is bad UX, and we're capable of improving it. Add watcher idiom to resource cache, allowing a handler routine of RunPodSandbox or CreateContainer to wait for a resource to be available. Something that is key here is if the resource becomes available while we're watching for it, *we still need to error on this request* This is because we could get the resource from the cache, remove it (thus meaning it won't be cleaned up), and the kubelet's request could time out, and it could try again. This would cause us to leak a resource. This way, if we get into this situation, there needs to be three requests: first that times out second that discovers the resource is ready, but still errors third that actually retrives that resource and returns it. This will result in many fewer "name is reserved" errors (one every 2 seconds to one every 4 minutes) Signed-off-by: Peter Hunt <[email protected]>

Now that we plan on caching the results of a pod sandbox creation, we shouldn't short circut the network creation. In a perfect world, we'd give the CNI plugin unbounded time, which would allow us to reuse even the longest of CNI creation time. However, this leads to the chance that the CNI plugin runs forever, which is not ideal. Instead, give the sandbox network creation 5 minutes (a minute more than the full request), to improve the odds we have a completed sandbox that can be reused, rather than thrown away. Signed-off-by: Peter Hunt <[email protected]>

timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>

Older version of this code used to have a goroutine for each resource, which is no longer the case, so remove the obsoleted part of the doc. It is already described elsewhere how the resource is becoming stale and removed. Signed-off-by: Kir Kolyshkin <[email protected]>

Signed-off-by: Kir Kolyshkin <[email protected]>

The 10s timeout is not enough sometimes to finish container or pod creation. Increase to 30s to fix occasional flakes, and move to a separate function wait_crio. While at it, - increate conmon sleep and crictl create/runp cancel timeout to 3s; - move create_conmon to setup; - fix ID checks (we're looking for string, not substring); - change a 3m timeout to 150s. Not critical, just nits. Signed-off-by: Kir Kolyshkin <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

[1.20] improve timeout handling and fix flakes

Signed-off-by: Peter Hunt <[email protected]>

bump to v1.20.0

This is mostly to have containers/storage@83150e3 which should fix the authentication failure during bootstrap node install. BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1907770 Signed-off-by: Qi Wang <[email protected]>

[release-1.20] Bump containers image to v5.10.1

…1.20 [1.20] ResourceStore: fix segfault and update tests

as well as remove any file that exists in that spot as well as create nsmgr package to mirror release-1.21 structure (only holding the NSType right now) Signed-off-by: Peter Hunt <[email protected]>

[1.20] config: pre-create pinns directories

We want systemd to only send SIGTERM to the main process on stop and let that process forward the signal. Today, we have seen issues where the main process forwards the signal in addition to systemd sending it directly leading to failures in graceful shutdown. Signed-off-by: Mrunal Patel <[email protected]>

[1.20] Set systemd property KillMode to mixed for containers

conmon forwards the SIGTERM to the main container process. This is undesirable during node shutdown as it results in the container receving 2 SIGTERMs. With this change, conmon ignores the SIGPIPE and will exit after the main container process dies. If it doesn't for some reason, then it gets SIGKILL by systemd after 90 seconds. We need a follow-on to adjust the 90s to a value higher than the termination grace period as a follow on. This doesn't matter today as we don't retain container state on reboot. Signed-off-by: Mrunal Patel <[email protected]>

[1.20] Set conmon scope KillSignal to SIGPIPE

This reverts commit 8164338. Signed-off-by: Mrunal Patel <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

[1.20] Revert "Set systemd property KillMode to mixed for containers"

Signed-off-by: Mrunal Patel <[email protected]>

[1.20] bump protobuf to 1.3.2

[1.20] Log container stop timeout

openshift-ci-robot · 2021-02-17T15:39:34Z

@haircommander: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2021-02-17T15:39:36Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

98b2d90 Merge pull request [1.20] Point to k8s release-1.20 branch for tests #4410 from umohnani8/1.20
9cfbe39 Merge pull request [release-1.20] image pull: close progress chan #4413 from openshift-cherrypick-robot/cherry-pick-4412-to-release-1.20
20bf416 Merge pull request [release-1.20] convert shmsize annotation to handler_allowed #4417 from openshift-cherrypick-robot/cherry-pick-4402-to-release-1.20
0551b1c Merge pull request [release-1.20] test/devices.bats: nits #4418 from openshift-cherrypick-robot/cherry-pick-4369-to-release-1.20
9f324ed Merge pull request [1.20] Bump version to v1.20.0-rc.1 #4416 from umohnani8/1.20
dd82db6 Merge pull request [1.20] release-notes: fix flags #4426 from haircommander/release-notes-fix-1.20
209938c Merge pull request [release-1.20] test/ctr.bats: fix a "ctr execsync" flake #4427 from openshift-cherrypick-robot/cherry-pick-4411-to-release-1.20
62afeec Merge pull request [1.20] improve timeout handling and fix flakes #4430 from haircommander/handle-timeout-1.20
d388528 Merge pull request bump to v1.20.0 #4433 from haircommander/bump-1.20.0
845747f Merge pull request [release-1.20] update containers/storage to v1.24.4 #4462 from QiWang19/release-1.20
82b3148 Merge pull request [1.20] Provide functionality to start infra containers on the specified set of CPUs #4469 from haircommander/provide_infra_cpu_flags-1.20
d9f17c8 Merge pull request add regitries deprecation notice #4477 from vrothberg/1.20-registries-deprecation-notice
a1ab08a Merge pull request [release-1.20] runtime_vm: set finished time when containers stop #4496 from openshift-cherrypick-robot/cherry-pick-4468-to-release-1.20
ce4f759 Merge pull request [1.20] Support pprof profile over unix socket #4520 from mrunalp/profile_unix_socket_1.20
e838816 Merge pull request [release-1.20] Bump containers image to v5.10.1 #4531 from QiWang19/update1.20-c/image
4a37b4e Merge pull request [1.20] ResourceStore: fix segfault and update tests #4535 from haircommander/resourcestore-fixes-1.20
18f4ba0 Merge pull request [1.20] config: pre-create pinns directories #4538 from haircommander/pinns-create-dir-1.20
2ef6415 Merge pull request [1.20] Set systemd property KillMode to mixed for containers #4539 from mrunalp/systemd_killmode_mixed
78527db Merge pull request [1.20] Set conmon scope KillSignal to SIGPIPE #4546 from mrunalp/conmon_killsignal_pipe
4de839c Merge pull request [1.20] Revert "Set systemd property KillMode to mixed for containers" #4547 from mrunalp/revert_mixed_mode
8921e00 Merge pull request [1.20] bump protobuf to 1.3.2 #4553 from haircommander/proto-bump-1.20
fdbdf43 Merge pull request [1.20] Log container stop timeout #4554 from mrunalp/log_stop_timeout_1.20

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2021-02-17T15:39:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [haircommander]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2021-02-17T15:39:40Z

@haircommander: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

codecov · 2021-02-17T15:41:54Z

Codecov Report

Merging #4582 (fdbdf43) into master (b03c34a) will increase coverage by 0.13%.
The diff coverage is 48.61%.

@@            Coverage Diff             @@
##           master    #4582      +/-   ##
==========================================
+ Coverage   40.49%   40.62%   +0.13%     
==========================================
  Files         116      117       +1     
  Lines        9329     9449     +120     
==========================================
+ Hits         3778     3839      +61     
- Misses       5125     5177      +52     
- Partials      426      433       +7

openshift-ci · 2021-02-17T15:42:54Z

@haircommander: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-agnostic	`fdbdf43`	link	`/test e2e-agnostic`
ci/prow/images	`fdbdf43`	link	`/test images`
ci/prow/e2e-gcp	`fdbdf43`	link	`/test e2e-gcp`
ci/openshift-jenkins/e2e_features_rhel	`fdbdf43`	link	`/test e2e_features_rhel`
ci/openshift-jenkins/integration_rhel	`fdbdf43`	link	`/test integration_rhel`
ci/openshift-jenkins/e2e_features_fedora	`fdbdf43`	link	`/test e2e_features_fedora`
ci/openshift-jenkins/critest_rhel	`fdbdf43`	link	`/test critest_rhel`
ci/openshift-jenkins/e2e_crun	`fdbdf43`	link	`/test e2e_crun`
ci/openshift-jenkins/e2e_rhel	`fdbdf43`	link	`/test e2e_rhel`
ci/openshift-jenkins/critest_fedora	`fdbdf43`	link	`/test critest_fedora`
ci/openshift-jenkins/integration_crun_cgroupv2	`fdbdf43`	link	`/test integration_cgroupv2`
ci/openshift-jenkins/e2e_fedora	`fdbdf43`	link	`/test e2e_fedora`
ci/openshift-jenkins/e2e_crun_cgroupv2	`fdbdf43`	link	`/test e2e_cgroupv2`
ci/openshift-jenkins/integration_crun	`fdbdf43`	link	`/test integration_crun`
ci/openshift-jenkins/integration_fedora	`fdbdf43`	link	`/test integration_fedora`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

umohnani8 and others added 30 commits December 7, 2020 13:55

Point to k8s release-1.20 branch for tests

3278fc8

Signed-off-by: Urvashi Mohnani <[email protected]>

Merge pull request cri-o#4410 from umohnani8/1.20

98b2d90

[1.20] Point to k8s release-1.20 branch for tests

image pull: close progress chan

13368a5

so we don't leak a goroutine Signed-off-by: Peter Hunt <[email protected]>

Merge pull request cri-o#4413 from openshift-cherrypick-robot/cherry-…

9cfbe39

…pick-4412-to-release-1.20 [release-1.20] image pull: close progress chan

Bump version to v1.20.0-rc.1

9e0c091

Signed-off-by: Urvashi Mohnani <[email protected]>

moves shmsize to a handler allowed annotation

1650a5e

Signed-off-by: Skyler Clark <[email protected]>

test/devices.bats: skip earlier

4a09622

This saves some time (a few seconds per test at least) as we avoid running setup_test (and cleanup_test has nothing to clean up), and removes some code duplication. Signed-off-by: Kir Kolyshkin <[email protected]>

test/devices.bats: rm unneeded run

89f932f

...and add a status check to one case where we use run, to make it more obvious that `run` is really needed here. Signed-off-by: Kir Kolyshkin <[email protected]>

test/devices.bats: fix "additional device permissions" case

9550c39

Using /dev/loop-control is problematic since it is not supposed to be read from or written to. Use /dev/kmsg, and actually enable the write test. Signed-off-by: Kir Kolyshkin <[email protected]>

Merge pull request cri-o#4417 from openshift-cherrypick-robot/cherry-…

20bf416

…pick-4402-to-release-1.20 [release-1.20] convert shmsize annotation to handler_allowed

Merge pull request cri-o#4418 from openshift-cherrypick-robot/cherry-…

0551b1c

…pick-4369-to-release-1.20 [release-1.20] test/devices.bats: nits

Merge pull request cri-o#4416 from umohnani8/1.20

9f324ed

[1.20] Bump version to v1.20.0-rc.1

release-notes: fix flags

d38d767

which changed after a k8s release-notes package bump Signed-off-by: Peter Hunt <[email protected]>

Merge pull request cri-o#4426 from haircommander/release-notes-fix-1.20

dd82db6

[1.20] release-notes: fix flags

Merge pull request cri-o#4427 from openshift-cherrypick-robot/cherry-…

209938c

…pick-4411-to-release-1.20 [release-1.20] test/ctr.bats: fix a "ctr execsync" flake

Add unit tests for ResourceCache

e62cb75

Signed-off-by: Peter Hunt <[email protected]>

test: add timeout.bats

7853a1b

timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>

test/timeout.bats: fix comments

7ed9d89

Signed-off-by: Kir Kolyshkin <[email protected]>

fix docs

e3dc7e7

Signed-off-by: Peter Hunt <[email protected]>

Merge pull request cri-o#4430 from haircommander/handle-timeout-1.20

62afeec

[1.20] improve timeout handling and fix flakes

bump to v1.20.0

1deae4b

Signed-off-by: Peter Hunt <[email protected]>

Merge pull request cri-o#4433 from haircommander/bump-1.20.0

d388528

bump to v1.20.0

update containers/storage to v1.24.4

455492d

This is mostly to have containers/storage@83150e3 which should fix the authentication failure during bootstrap node install. BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1907770 Signed-off-by: Qi Wang <[email protected]>

openshift-merge-robot and others added 14 commits February 2, 2021 15:49

Merge pull request cri-o#4531 from QiWang19/update1.20-c/image

e838816

[release-1.20] Bump containers image to v5.10.1

Merge pull request cri-o#4535 from haircommander/resourcestore-fixes-…

4a37b4e

…1.20 [1.20] ResourceStore: fix segfault and update tests

config: pre-create pinns directories

69c9211

as well as remove any file that exists in that spot as well as create nsmgr package to mirror release-1.21 structure (only holding the NSType right now) Signed-off-by: Peter Hunt <[email protected]>

Merge pull request cri-o#4538 from haircommander/pinns-create-dir-1.20

18f4ba0

[1.20] config: pre-create pinns directories

Merge pull request cri-o#4539 from mrunalp/systemd_killmode_mixed

2ef6415

[1.20] Set systemd property KillMode to mixed for containers

Merge pull request cri-o#4546 from mrunalp/conmon_killsignal_pipe

78527db

[1.20] Set conmon scope KillSignal to SIGPIPE

Revert "Set systemd property KillMode to mixed for containers"

2c9c713

This reverts commit 8164338. Signed-off-by: Mrunal Patel <[email protected]>

bump protobuf to 1.3.2

c15ec58

Signed-off-by: Peter Hunt <[email protected]>

Merge pull request cri-o#4547 from mrunalp/revert_mixed_mode

4de839c

[1.20] Revert "Set systemd property KillMode to mixed for containers"

Log container stop timeout

37a8420

Signed-off-by: Mrunal Patel <[email protected]>

Merge pull request cri-o#4553 from haircommander/proto-bump-1.20

8921e00

[1.20] bump protobuf to 1.3.2

Merge pull request cri-o#4554 from mrunalp/log_stop_timeout_1.20

fdbdf43

[1.20] Log container stop timeout

haircommander requested review from mrunalp and runcom as code owners February 17, 2021 15:39

openshift-ci-robot requested review from sboeuf and umohnani8 February 17, 2021 15:39

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2021

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2021

haircommander closed this Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wip: try to get openshift ci suite to pass #4582

wip: try to get openshift ci suite to pass #4582

Uh oh!

haircommander commented Feb 17, 2021

Uh oh!

openshift-ci-robot commented Feb 17, 2021

Uh oh!

openshift-ci-robot commented Feb 17, 2021

Uh oh!

openshift-ci-robot commented Feb 17, 2021

Uh oh!

openshift-ci-robot commented Feb 17, 2021

Uh oh!

codecov bot commented Feb 17, 2021

Uh oh!

openshift-ci bot commented Feb 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

wip: try to get openshift ci suite to pass #4582

wip: try to get openshift ci suite to pass #4582

Uh oh!

Conversation

haircommander commented Feb 17, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

openshift-ci-robot commented Feb 17, 2021

Uh oh!

openshift-ci-robot commented Feb 17, 2021

Uh oh!

openshift-ci-robot commented Feb 17, 2021

Uh oh!

openshift-ci-robot commented Feb 17, 2021

Uh oh!

codecov bot commented Feb 17, 2021

Codecov Report

Uh oh!

openshift-ci bot commented Feb 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants