Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kwilczynski
Copy link
Contributor

@kwilczynski kwilczynski commented Jul 25, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

When enabled, the internal repair feature uses the storage.CheckEverything() helper function to configure what the storage.Check() should check for when scanning the storage directory. The following options are enabled:

func CheckEverything() *CheckOptions {
	return &CheckOptions{
		LayerDigests:   true,
		LayerMountable: true,
		LayerContents:  true,
		LayerData:      true,
		ImageData:      true,
		ContainerData:  true,
	}
}

Two of these options, LayerDigests and LayerContents, are known to be I/O and CPU intensive, and both of these options are used to perform storage integrity checks and require the contents of the images to be read, checksums generated, and the current state compared against the desired state. As such, depending on the number of images, the layers each image has, and the number of files within each layer, it can take a considerable amount of time before the storage scan with both of these options enabled will complete.

Even though CRI-O only performs storage checks and repairs following an unclean shutdown, with the two checks mentioned earlier options enabled, it can take CRI-O some time to finish storage repairs. During this time, it won't be able to serve API requests. This can lead to the kubelet not being able to get responses to its requests in time, and eventually assuming that CRI-O is not responsive and proceed to make this specific node NotReady, which is undesirable.

Thus, to mitigate the problem with storage scans potentially taking a long time to complete, we are going to switch to a custom configuration for the storage.Check() that turns off these expensive checks and only turns on checks that allow CRI-O to perform a "quick" scan of the storage directory.

This quick scan should cover most of the common types of storage failures and take significantly less time to complete. However, this comes at the expense of container image integrity verification, where files that are missing from the image or files that have their content altered will no longer be detected.

The goal of the internal repair is to ensure that CRI-O can recover from issues surrounding a corrupted storage directory on start-up, with an added option only to remove corrupted images and containers when attempting a repair, compared to removing the entire storage directory content. Hence, trading scan run time for correctness is acceptable.

While at it, change the internal repair option to be enabled by default and add a new crio check sub-command, which allows to check the storage directory for errors and repair it, if needed.

Related:

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

None

Does this PR introduce a user-facing change?

Update the type of checks the internal repair feature performs on CRI-O's start-up following an unclean shutdown, enable the internal repair option by default, and add a new `crio check` sub-command.

@kwilczynski kwilczynski requested a review from mrunalp as a code owner July 25, 2024 15:35
@openshift-ci openshift-ci bot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jul 25, 2024
@openshift-ci openshift-ci bot requested review from QiWang19 and hasan4791 July 25, 2024 15:36
@kwilczynski
Copy link
Contributor Author

/assign kwilczynski

@kwilczynski kwilczynski changed the title Use custom storage check options for CRI-O internal wipe [WIP] Use custom storage check options for CRI-O internal wipe Jul 25, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2024
@kwilczynski
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 25, 2024
@codecov
Copy link

codecov bot commented Jul 25, 2024

Codecov Report

Attention: Patch coverage is 10.00000% with 117 lines in your changes missing coverage. Please review.

Project coverage is 49.19%. Comparing base (174ee6e) to head (aa6d034).
Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8417      +/-   ##
==========================================
- Coverage   49.41%   49.19%   -0.23%     
==========================================
  Files         153      154       +1     
  Lines       17151    17267     +116     
==========================================
+ Hits         8475     8494      +19     
- Misses       7607     7704      +97     
  Partials     1069     1069              

@kwilczynski kwilczynski force-pushed the feature/internal-repair-changes branch 2 times, most recently from aa6b4a2 to 7379453 Compare July 26, 2024 04:07
@kwilczynski
Copy link
Contributor Author

/retest

@kwilczynski kwilczynski force-pushed the feature/internal-repair-changes branch 2 times, most recently from 25f5ce7 to 4de322b Compare July 26, 2024 13:10
@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jul 26, 2024
@kwilczynski kwilczynski changed the title [WIP] Use custom storage check options for CRI-O internal wipe Use custom storage check options for CRI-O internal wipe Jul 26, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 26, 2024
@kwilczynski
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 26, 2024
@kwilczynski
Copy link
Contributor Author

/retest

@kwilczynski
Copy link
Contributor Author

@cri-o/cri-o-maintainers, please have a look. Thank you!

@kwilczynski kwilczynski force-pushed the feature/internal-repair-changes branch from 4de322b to 02b5ac0 Compare July 26, 2024 17:07
@kwilczynski
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2024

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-cgroupv2-e2e-crun, ci/prow/ci-crun-e2e

Details

In response to this:

/lgtm
/approve

/override ci/prow/ci-cgroupv2-e2e-crun
/override ci/prow/ci-crun-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, kwilczynski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2024
@haircommander
Copy link
Member

not ok 32 recover from badly corrupted storage directory
# (in test file crio-wipe.bats, line 363)
#   `umount -R -l -f "$TESTDIR"/crio/overlay' failed
# time="2024-08-09 12:22:23.605946349Z" level=info msg="Updating config from single file: /etc/crio/crio.conf"
# time="2024-08-09 12:22:23.605989403Z" level=info msg="Updating config from drop-in file: /etc/crio/crio.conf"
# time="2024-08-09 12:22:23.606935825Z" level=info msg="Updating config from path: /etc/crio/crio.conf.d"
# time="2024-08-09 12:22:23.606997365Z" level=info msg="Updating config from drop-in file: /etc/crio/crio.conf.d/01-ns-lifecycle.conf"
# time="2024-08-09 12:22:23.607093229Z" level=info msg="Updating config from drop-in file: /etc/crio/crio.conf.d/01-overlay.conf"
# time="2024-08-09 12:22:23.607125756Z" level=info msg="Updating config from drop-in file: /etc/crio/crio.conf.d/99-log-level.conf"
# level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL"
# time="2024-08-09 12:22:23.643382454Z" level=info msg="Updating config from path: "
# level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL"
# time="2024-08-09T12:22:23Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///tmp/tmp.pLNidEVRyE/crio.sock\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /tmp/tmp.pLNidEVRyE/crio.sock: connect: no such file or directory\""
# 1c2460bcafe6c2d536875edefccd6c0da43fa18caa2422d568ebbe5232b593b5
# umount: /tmp/tmp.pLNidEVRyE/crio/overlay: not mounted

kata failure. @kwilczynski can you override the test for now and open an issue so we can track the fix?

@kwilczynski kwilczynski force-pushed the feature/internal-repair-changes branch from dff04ff to beebf5e Compare August 9, 2024 14:11
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2024
@kwilczynski
Copy link
Contributor Author

[...]

# umount: /tmp/tmp.pLNidEVRyE/crio/overlay: not mounted

kata failure. @kwilczynski can you override the test for now and open an issue so we can track the fix?

@haircommander, should be resolved with a simple fix.

I am not entirely sure how Kata users c/storage.

If there are any more issues, I will exclude this test from being run under Kata, and have a look some more.

@haircommander
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2024
Krzysztof Wilczyński added 2 commits August 10, 2024 01:07
@kwilczynski kwilczynski force-pushed the feature/internal-repair-changes branch from beebf5e to aa6d034 Compare August 9, 2024 16:07
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2024
@haircommander
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2024
@haircommander
Copy link
Member

/override ci/prow/ci-cgroupv2-e2e-crun
/override ci/prow/ci-crun-e2e

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2024

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-cgroupv2-e2e-crun, ci/prow/ci-crun-e2e

Details

In response to this:

/override ci/prow/ci-cgroupv2-e2e-crun
/override ci/prow/ci-crun-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kwilczynski
Copy link
Contributor Author

[...]

If there are any more issues, I will exclude this test from being run under Kata, and have a look some more.

@haircommander, there is something else going on that makes these tests (for crio check and crio wipe) behave differently when run on Kata Containers per:

# umount: /tmp/tmp.r6XWDPgPOM/crio/overlay: not mounted
# rm: cannot remove '/tmp/tmp.r6XWDPgPOM/crio/overlay/49596f1dcc172a24792e553ad4ae1c39f7a7ddf2519d13f666be930c2581f060/merged': Device or resource busy
# rm: cannot remove '/tmp/tmp.r6XWDPgPOM/crio/overlay/7c35e529e22d16f1997139e990e44bc3263476c0eb397912061a29679b2515cd/merged': Device or resource busy

I excluded both tests from being run on Kata Containers for now.

@kwilczynski
Copy link
Contributor Author

/test ci-crun-e2e

@kwilczynski
Copy link
Contributor Author

/test ci-cgroupv2-e2e-crun

@haircommander
Copy link
Member

/override ci/prow/ci-cgroupv2-e2e-crun
/override ci/prow/ci-crun-e2e

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2024

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-cgroupv2-e2e-crun, ci/prow/ci-crun-e2e

Details

In response to this:

/override ci/prow/ci-cgroupv2-e2e-crun
/override ci/prow/ci-crun-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@haircommander
Copy link
Member

/retest

@haircommander
Copy link
Member

/override ci/prow/ci-crun-e2e

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 9, 2024

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-crun-e2e

Details

In response to this:

/override ci/prow/ci-crun-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants