-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Use custom storage check options for CRI-O internal wipe #8417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use custom storage check options for CRI-O internal wipe #8417
Conversation
|
/assign kwilczynski |
|
/hold |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8417 +/- ##
==========================================
- Coverage 49.41% 49.19% -0.23%
==========================================
Files 153 154 +1
Lines 17151 17267 +116
==========================================
+ Hits 8475 8494 +19
- Misses 7607 7704 +97
Partials 1069 1069 |
aa6b4a2 to
7379453
Compare
|
/retest |
25f5ce7 to
4de322b
Compare
|
/unhold |
|
/retest |
|
@cri-o/cri-o-maintainers, please have a look. Thank you! |
4de322b to
02b5ac0
Compare
|
/retest |
|
@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-cgroupv2-e2e-crun, ci/prow/ci-crun-e2e DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander, kwilczynski The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
kata failure. @kwilczynski can you override the test for now and open an issue so we can track the fix? |
dff04ff to
beebf5e
Compare
|
[...]
@haircommander, should be resolved with a simple fix. I am not entirely sure how Kata users c/storage. If there are any more issues, I will exclude this test from being run under Kata, and have a look some more. |
|
/lgtm |
Signed-off-by: Krzysztof Wilczyński <[email protected]>
Signed-off-by: Krzysztof Wilczyński <[email protected]>
beebf5e to
aa6d034
Compare
|
/lgtm |
|
/override ci/prow/ci-cgroupv2-e2e-crun |
|
@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-cgroupv2-e2e-crun, ci/prow/ci-crun-e2e DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[...]
@haircommander, there is something else going on that makes these tests (for # umount: /tmp/tmp.r6XWDPgPOM/crio/overlay: not mounted
# rm: cannot remove '/tmp/tmp.r6XWDPgPOM/crio/overlay/49596f1dcc172a24792e553ad4ae1c39f7a7ddf2519d13f666be930c2581f060/merged': Device or resource busy
# rm: cannot remove '/tmp/tmp.r6XWDPgPOM/crio/overlay/7c35e529e22d16f1997139e990e44bc3263476c0eb397912061a29679b2515cd/merged': Device or resource busyI excluded both tests from being run on Kata Containers for now. |
|
/test ci-crun-e2e |
|
/test ci-cgroupv2-e2e-crun |
|
/override ci/prow/ci-cgroupv2-e2e-crun |
|
@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-cgroupv2-e2e-crun, ci/prow/ci-crun-e2e DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
/override ci/prow/ci-crun-e2e |
|
@haircommander: Overrode contexts on behalf of haircommander: ci/prow/ci-crun-e2e DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
When enabled, the internal repair feature uses the
storage.CheckEverything()helper function to configure what thestorage.Check()should check for when scanning the storage directory. The following options are enabled:Two of these options,
LayerDigestsandLayerContents, are known to be I/O and CPU intensive, and both of these options are used to perform storage integrity checks and require the contents of the images to be read, checksums generated, and the current state compared against the desired state. As such, depending on the number of images, the layers each image has, and the number of files within each layer, it can take a considerable amount of time before the storage scan with both of these options enabled will complete.Even though CRI-O only performs storage checks and repairs following an unclean shutdown, with the two checks mentioned earlier options enabled, it can take CRI-O some time to finish storage repairs. During this time, it won't be able to serve API requests. This can lead to the kubelet not being able to get responses to its requests in time, and eventually assuming that CRI-O is not responsive and proceed to make this specific node
NotReady, which is undesirable.Thus, to mitigate the problem with storage scans potentially taking a long time to complete, we are going to switch to a custom configuration for the
storage.Check()that turns off these expensive checks and only turns on checks that allow CRI-O to perform a "quick" scan of the storage directory.This quick scan should cover most of the common types of storage failures and take significantly less time to complete. However, this comes at the expense of container image integrity verification, where files that are missing from the image or files that have their content altered will no longer be detected.
The goal of the internal repair is to ensure that CRI-O can recover from issues surrounding a corrupted storage directory on start-up, with an added option only to remove corrupted images and containers when attempting a repair, compared to removing the entire storage directory content. Hence, trading scan run time for correctness is acceptable.
While at it, change the internal repair option to be enabled by default and add a new
crio checksub-command, which allows to check the storage directory for errors and repair it, if needed.Related:
Which issue(s) this PR fixes:
None
Special notes for your reviewer:
None
Does this PR introduce a user-facing change?