Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kwilczynski
Copy link

@kwilczynski kwilczynski commented Aug 9, 2024

- What I did

For a while now, CRI-O can attempt to repair the storage directory on start-up following an unclean shutdown, such as a node crash or unexpected restart. This allows CRI-O to recover from a storage directory corruption, alleviating potential crashes or termination with a fatal error once it's started back up.

However, this internal repair feature has to be enabled to take effect, so it's currently an opt-in solution. This feature has matured and can be turned on as the new default when deploying OpenShift clusters. As a result, our customers will benefit from improved cluster resilience.

Thus, turn the internal feature on as the new default.

Related:

- How to verify it

Deploy updated configuration manually or using either the Machine Config Operator.

Then, to verify the internal repair feature working, proceed using the following steps:

(assuming that the test will be performed using an OpenShift cluster with privileged access available)

  1. Stop kubelet and CRI-O services
  2. Navigate to the /var/lib/containers directory
  3. Remove a random layer from any of the available container images (to corrupt the storage directory)
  4. Remove the /var/lib/crio/clean.shutdown file to simulate CRI-O unclean shutdown
  5. Start CRI-O and kubelet services
  6. Verify that both services are working correctly
  7. Verify that CRI-O run the check and repair to fix the corrupted storage directory on start-up (see service logs)

- Description for the changelog

Change the internal repair option to be enabled by default.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 9, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 9, 2024

@kwilczynski: This pull request references OCPNODE-1784 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set.

Details

In response to this:

- What I did

For a while now, CRI-O can attempt to repair the storage directory on start-up following an unclean shutdown, such as a node crash or unexpected restart. This allows CRI-O to recover from a storage directory corruption, alleviating potential crashes or termination with a fatal error once it's started back up.

However, this internal repair feature has to be enabled to take effect, so it's currently an opt-in solution. This feature has matured and can be turned on as the new default when deploying OpenShift clusters. As a result, our customers will benefit from improved cluster resilience.

Thus, turn the internal feature on as the new default.

Related:

- How to verify it

Deploy updated configuration manually or using either the Machine Config Operator.

Then, to verify the internal repair feature working, proceed using the following steps:

(assuming that the test will be performed using an OpenShift cluster with privileged access available)

  1. Stop kubelet and CRI-O services
  2. Navigate to the /var/lib/containers directory
  3. Remove a random layer from any of the available container images (to corrupt the storage directory)
  4. Remove the /var/lib/crio/clean.shutdown file to simulate CRI-O unclean shutdown
  5. Start CRI-O and kubelet services
  6. Verify that both services are working correctly
  7. Verify that CRI-O run the check and repair to fix the corrupted storage directory on start-up (see service logs)

- Description for the changelog

Change the internal repair option to be enabled by default.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kwilczynski
Copy link
Author

/assign kwilczynski

@kwilczynski
Copy link
Author

/cherry-pick release-4.17

@openshift-cherrypick-robot

@kwilczynski: once the present PR merges, I will cherry-pick it on top of release-4.17 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kwilczynski
Copy link
Author

/approve

@haircommander
Copy link
Member

/lgtm
/approve

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 9, 2024
@haircommander
Copy link
Member

/retest

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 9, 2024

@kwilczynski: This pull request references OCPNODE-1784 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set.

Details

In response to this:

- What I did

For a while now, CRI-O can attempt to repair the storage directory on start-up following an unclean shutdown, such as a node crash or unexpected restart. This allows CRI-O to recover from a storage directory corruption, alleviating potential crashes or termination with a fatal error once it's started back up.

However, this internal repair feature has to be enabled to take effect, so it's currently an opt-in solution. This feature has matured and can be turned on as the new default when deploying OpenShift clusters. As a result, our customers will benefit from improved cluster resilience.

Thus, turn the internal feature on as the new default.

Related:

- How to verify it

Deploy updated configuration manually or using either the Machine Config Operator.

Then, to verify the internal repair feature working, proceed using the following steps:

(assuming that the test will be performed using an OpenShift cluster with privileged access available)

  1. Stop kubelet and CRI-O services
  2. Navigate to the /var/lib/containers directory
  3. Remove a random layer from any of the available container images (to corrupt the storage directory)
  4. Remove the /var/lib/crio/clean.shutdown file to simulate CRI-O unclean shutdown
  5. Start CRI-O and kubelet services
  6. Verify that both services are working correctly
  7. Verify that CRI-O run the check and repair to fix the corrupted storage directory on start-up (see service logs)

- Description for the changelog

Change the internal repair option to be enabled by default.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kwilczynski kwilczynski changed the title OCPNODE-1784: Enable CRI-O internal repair feature as the default OCPNODE-2482: Enable CRI-O internal repair feature as the default Aug 9, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Aug 9, 2024

@kwilczynski: This pull request references OCPNODE-2482 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

Details

In response to this:

- What I did

For a while now, CRI-O can attempt to repair the storage directory on start-up following an unclean shutdown, such as a node crash or unexpected restart. This allows CRI-O to recover from a storage directory corruption, alleviating potential crashes or termination with a fatal error once it's started back up.

However, this internal repair feature has to be enabled to take effect, so it's currently an opt-in solution. This feature has matured and can be turned on as the new default when deploying OpenShift clusters. As a result, our customers will benefit from improved cluster resilience.

Thus, turn the internal feature on as the new default.

Related:

- How to verify it

Deploy updated configuration manually or using either the Machine Config Operator.

Then, to verify the internal repair feature working, proceed using the following steps:

(assuming that the test will be performed using an OpenShift cluster with privileged access available)

  1. Stop kubelet and CRI-O services
  2. Navigate to the /var/lib/containers directory
  3. Remove a random layer from any of the available container images (to corrupt the storage directory)
  4. Remove the /var/lib/crio/clean.shutdown file to simulate CRI-O unclean shutdown
  5. Start CRI-O and kubelet services
  6. Verify that both services are working correctly
  7. Verify that CRI-O run the check and repair to fix the corrupted storage directory on start-up (see service logs)

- Description for the changelog

Change the internal repair option to be enabled by default.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@kwilczynski
Copy link
Author

/retest

@haircommander
Copy link
Member

/skip

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD da94f1b and 2 for PR HEAD 1ba88c3 in total

@haircommander
Copy link
Member

/skip

1 similar comment
@haircommander
Copy link
Member

/skip

@haircommander
Copy link
Member

/retest

@kwilczynski
Copy link
Author

/test e2e-aws-ovn-upgrade

@haircommander
Copy link
Member

/skip
/retest

@haircommander
Copy link
Member

/retest

@kwilczynski
Copy link
Author

@cri-o/cri-o-maintainers, please have a look. Thank you!

@sohankunkerkar
Copy link
Member

/retest

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6caab8b and 1 for PR HEAD 1ba88c3 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 16d393f and 0 for PR HEAD 1ba88c3 in total

@openshift-ci-robot
Copy link
Contributor

/hold

Revision 1ba88c3 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2024
@kwilczynski kwilczynski force-pushed the feature/enable-internal-repair branch from 1ba88c3 to f8db91e Compare August 13, 2024 10:06
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2024
@kwilczynski
Copy link
Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2024
@haircommander
Copy link
Member

/skip
/retest

@haircommander
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, kwilczynski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d1a18e7 and 2 for PR HEAD f8db91e in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2024

@kwilczynski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-cherrypick-robot

@kwilczynski: new pull request created: #4535

Details

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.18.0-202408142211.p0.g0c688ef.assembly.stream.el9.
All builds following this will include this PR.

@kwilczynski kwilczynski deleted the feature/enable-internal-repair branch October 9, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants