OCPNODE-2482: Enable CRI-O internal repair feature as the default #4519

kwilczynski · 2024-08-09T15:05:05Z

- What I did

For a while now, CRI-O can attempt to repair the storage directory on start-up following an unclean shutdown, such as a node crash or unexpected restart. This allows CRI-O to recover from a storage directory corruption, alleviating potential crashes or termination with a fatal error once it's started back up.

However, this internal repair feature has to be enabled to take effect, so it's currently an opt-in solution. This feature has matured and can be turned on as the new default when deploying OpenShift clusters. As a result, our customers will benefit from improved cluster resilience.

Thus, turn the internal feature on as the new default.

Deploy updated configuration manually or using either the Machine Config Operator.

Then, to verify the internal repair feature working, proceed using the following steps:

(assuming that the test will be performed using an OpenShift cluster with privileged access available)

Stop kubelet and CRI-O services
Navigate to the /var/lib/containers directory
Remove a random layer from any of the available container images (to corrupt the storage directory)
Remove the /var/lib/crio/clean.shutdown file to simulate CRI-O unclean shutdown
Start CRI-O and kubelet services
Verify that both services are working correctly
Verify that CRI-O run the check and repair to fix the corrupted storage directory on start-up (see service logs)

- Description for the changelog

Change the internal repair option to be enabled by default.

openshift-ci-robot · 2024-08-09T15:05:10Z

@kwilczynski: This pull request references OCPNODE-1784 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set.

Details

In response to this:

- What I did

For a while now, CRI-O can attempt to repair the storage directory on start-up following an unclean shutdown, such as a node crash or unexpected restart. This allows CRI-O to recover from a storage directory corruption, alleviating potential crashes or termination with a fatal error once it's started back up.

However, this internal repair feature has to be enabled to take effect, so it's currently an opt-in solution. This feature has matured and can be turned on as the new default when deploying OpenShift clusters. As a result, our customers will benefit from improved cluster resilience.

Thus, turn the internal feature on as the new default.

Related:

Use custom storage check options for CRI-O internal wipe cri-o/cri-o#8417

make use of c/storage Check() and Repair() functions cri-o/cri-o#7177

- How to verify it

Deploy updated configuration manually or using either the Machine Config Operator.

Then, to verify the internal repair feature working, proceed using the following steps:

(assuming that the test will be performed using an OpenShift cluster with privileged access available)

Stop kubelet and CRI-O services

Navigate to the /var/lib/containers directory

Remove a random layer from any of the available container images (to corrupt the storage directory)

Remove the /var/lib/crio/clean.shutdown file to simulate CRI-O unclean shutdown

Start CRI-O and kubelet services

Verify that both services are working correctly

Verify that CRI-O run the check and repair to fix the corrupted storage directory on start-up (see service logs)

- Description for the changelog
Change the internal repair option to be enabled by default.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

kwilczynski · 2024-08-09T15:05:19Z

/assign kwilczynski

kwilczynski · 2024-08-09T15:06:52Z

/cherry-pick release-4.17

openshift-cherrypick-robot · 2024-08-09T15:06:55Z

@kwilczynski: once the present PR merges, I will cherry-pick it on top of release-4.17 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kwilczynski · 2024-08-09T15:07:36Z

/approve

haircommander · 2024-08-09T16:22:59Z

/lgtm
/approve

haircommander · 2024-08-09T17:01:44Z

/retest

openshift-ci-robot · 2024-08-09T17:04:13Z

@kwilczynski: This pull request references OCPNODE-1784 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set.

Details

In response to this:

- What I did

For a while now, CRI-O can attempt to repair the storage directory on start-up following an unclean shutdown, such as a node crash or unexpected restart. This allows CRI-O to recover from a storage directory corruption, alleviating potential crashes or termination with a fatal error once it's started back up.

However, this internal repair feature has to be enabled to take effect, so it's currently an opt-in solution. This feature has matured and can be turned on as the new default when deploying OpenShift clusters. As a result, our customers will benefit from improved cluster resilience.

Thus, turn the internal feature on as the new default.

Related:

Use custom storage check options for CRI-O internal wipe cri-o/cri-o#8417

make use of c/storage Check() and Repair() functions cri-o/cri-o#7177

- How to verify it

Deploy updated configuration manually or using either the Machine Config Operator.

Then, to verify the internal repair feature working, proceed using the following steps:

(assuming that the test will be performed using an OpenShift cluster with privileged access available)

Stop kubelet and CRI-O services

Navigate to the /var/lib/containers directory

Remove a random layer from any of the available container images (to corrupt the storage directory)

Remove the /var/lib/crio/clean.shutdown file to simulate CRI-O unclean shutdown

Start CRI-O and kubelet services

Verify that both services are working correctly

Verify that CRI-O run the check and repair to fix the corrupted storage directory on start-up (see service logs)

- Description for the changelog
Change the internal repair option to be enabled by default.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-08-09T17:04:22Z

@kwilczynski: This pull request references OCPNODE-2482 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

- What I did

For a while now, CRI-O can attempt to repair the storage directory on start-up following an unclean shutdown, such as a node crash or unexpected restart. This allows CRI-O to recover from a storage directory corruption, alleviating potential crashes or termination with a fatal error once it's started back up.

However, this internal repair feature has to be enabled to take effect, so it's currently an opt-in solution. This feature has matured and can be turned on as the new default when deploying OpenShift clusters. As a result, our customers will benefit from improved cluster resilience.

Thus, turn the internal feature on as the new default.

Related:

Use custom storage check options for CRI-O internal wipe cri-o/cri-o#8417

make use of c/storage Check() and Repair() functions cri-o/cri-o#7177

- How to verify it

Deploy updated configuration manually or using either the Machine Config Operator.

Then, to verify the internal repair feature working, proceed using the following steps:

(assuming that the test will be performed using an OpenShift cluster with privileged access available)

Stop kubelet and CRI-O services

Navigate to the /var/lib/containers directory

Remove a random layer from any of the available container images (to corrupt the storage directory)

Remove the /var/lib/crio/clean.shutdown file to simulate CRI-O unclean shutdown

Start CRI-O and kubelet services

Verify that both services are working correctly

Verify that CRI-O run the check and repair to fix the corrupted storage directory on start-up (see service logs)

- Description for the changelog
Change the internal repair option to be enabled by default.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

kwilczynski · 2024-08-09T17:31:54Z

/retest

haircommander · 2024-08-09T19:31:28Z

/skip

openshift-ci-robot · 2024-08-09T19:50:44Z

/retest-required

Remaining retests: 0 against base HEAD da94f1b and 2 for PR HEAD 1ba88c3 in total

haircommander · 2024-08-09T20:30:55Z

/skip

haircommander · 2024-08-09T22:13:29Z

/skip

haircommander · 2024-08-10T02:08:34Z

/retest

kwilczynski · 2024-08-10T06:52:33Z

/test e2e-aws-ovn-upgrade

haircommander · 2024-08-10T15:34:37Z

/skip
/retest

haircommander · 2024-08-10T22:21:41Z

/retest

kwilczynski · 2024-08-12T08:37:15Z

@cri-o/cri-o-maintainers, please have a look. Thank you!

sohankunkerkar · 2024-08-12T08:39:51Z

/retest

openshift-ci-robot · 2024-08-12T14:14:34Z

/retest-required

Remaining retests: 0 against base HEAD 6caab8b and 1 for PR HEAD 1ba88c3 in total

openshift-ci-robot · 2024-08-12T19:31:56Z

/retest-required

Remaining retests: 0 against base HEAD 16d393f and 0 for PR HEAD 1ba88c3 in total

openshift-ci-robot · 2024-08-13T02:31:41Z

/hold

Revision 1ba88c3 was retested 3 times: holding

Signed-off-by: Krzysztof Wilczyński <[email protected]>

kwilczynski · 2024-08-13T10:06:38Z

/unhold

haircommander · 2024-08-14T14:35:28Z

/skip
/retest

haircommander · 2024-08-14T14:35:49Z

/lgtm

openshift-ci · 2024-08-14T14:49:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, kwilczynski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~templates/master/01-master-container-runtime/OWNERS~~ [haircommander]
~~templates/worker/01-worker-container-runtime/OWNERS~~ [haircommander]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-08-14T15:24:28Z

/retest-required

Remaining retests: 0 against base HEAD d1a18e7 and 2 for PR HEAD f8db91e in total

openshift-ci · 2024-08-14T20:57:35Z

@kwilczynski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-cherrypick-robot · 2024-08-14T21:02:45Z

@kwilczynski: new pull request created: #4535

Details

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-bot · 2024-08-14T23:48:04Z

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.18.0-202408142211.p0.g0c688ef.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 9, 2024

openshift-ci bot assigned kwilczynski Aug 9, 2024

openshift-ci bot requested review from QiWang19 and sohankunkerkar August 9, 2024 15:05

openshift-ci bot assigned haircommander Aug 9, 2024

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 9, 2024

kwilczynski changed the title ~~OCPNODE-1784: Enable CRI-O internal repair feature as the default~~ OCPNODE-2482: Enable CRI-O internal repair feature as the default Aug 9, 2024

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2024

OCPNODE-1784: Enable CRI-O internal repair feature as the default

f8db91e

Signed-off-by: Krzysztof Wilczyński <[email protected]>

kwilczynski force-pushed the feature/enable-internal-repair branch from 1ba88c3 to f8db91e Compare August 13, 2024 10:06

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2024

openshift-merge-bot bot merged commit 0c688ef into openshift:master Aug 14, 2024

openshift-cherrypick-robot mentioned this pull request Aug 14, 2024

[release-4.17] OCPNODE-2482: Enable CRI-O internal repair feature as the default #4535

Merged

kwilczynski deleted the feature/enable-internal-repair branch October 9, 2024 16:48

OCPNODE-2482: Enable CRI-O internal repair feature as the default #4519

OCPNODE-2482: Enable CRI-O internal repair feature as the default #4519

Uh oh!

Conversation

kwilczynski commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 9, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwilczynski commented Aug 9, 2024

Uh oh!

kwilczynski commented Aug 9, 2024

Uh oh!

openshift-cherrypick-robot commented Aug 9, 2024

Uh oh!

kwilczynski commented Aug 9, 2024

Uh oh!

haircommander commented Aug 9, 2024

Uh oh!

haircommander commented Aug 9, 2024

Uh oh!

openshift-ci-robot commented Aug 9, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 9, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwilczynski commented Aug 9, 2024

Uh oh!

haircommander commented Aug 9, 2024

Uh oh!

openshift-ci-robot commented Aug 9, 2024

Uh oh!

haircommander commented Aug 9, 2024

Uh oh!

haircommander commented Aug 9, 2024

Uh oh!

haircommander commented Aug 10, 2024

Uh oh!

kwilczynski commented Aug 10, 2024

Uh oh!

haircommander commented Aug 10, 2024

Uh oh!

haircommander commented Aug 10, 2024

Uh oh!

kwilczynski commented Aug 12, 2024

Uh oh!

sohankunkerkar commented Aug 12, 2024

Uh oh!

openshift-ci-robot commented Aug 12, 2024

Uh oh!

openshift-ci-robot commented Aug 12, 2024

Uh oh!

openshift-ci-robot commented Aug 13, 2024

Uh oh!

kwilczynski commented Aug 13, 2024

Uh oh!

haircommander commented Aug 14, 2024

Uh oh!

haircommander commented Aug 14, 2024

Uh oh!

openshift-ci bot commented Aug 14, 2024

Uh oh!

openshift-ci-robot commented Aug 14, 2024

Uh oh!

openshift-ci bot commented Aug 14, 2024

Uh oh!

openshift-cherrypick-robot commented Aug 14, 2024

Uh oh!

openshift-bot commented Aug 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kwilczynski commented Aug 9, 2024 •

edited

Loading

openshift-ci-robot commented Aug 9, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 9, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 9, 2024 •

edited by openshift-ci bot

Loading