Provide support for checkpoint and restore #4199

adrianreber · 2020-09-15T17:01:45Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Provide support for checkpoint and restore

This implements minimal container checkpoint and restore support in cri-o.

The minimal implementation offers the possibility to checkpoint and restore a container. Currently checkpoints do not survive a reboot, which would be a nice additional feature. As 'bundlePath' is on a tmpfs (/run) the checkpoint is gone after reboot. It would be possible to store the checkpoint somewhere else to survive a reboot, but to restore a container config.json is also needed.

Similar to Podman the plan is to support exporting checkpoints which will include all necessary information to restore a container after a reboot or on another system but this is not part of this initial implementation. Using 'crictl' this looks like:

# crictl checkpoint CTR-ID
# crictl restore CTR-ID

In addition to the simple checkpoint and restore support this implementation also offers the possibility to restore a container into another sandbox.

# crictl checkpoint CTR-ID
# crictl restore -p POD-ID CTR-ID # POD-ID being the destination pod

This requires the latest runc (opencontainers/runc#2583) version to be able to let runc tell CRIU to join the namespaces of the infrastructure container.

This is based on the Podman checkpoint/restore support.

Special notes for your reviewer:

As mentioned, this is only the initial, minimal change to get checkpoint/restore working. The corresponding API changes are not yet submitted to the corresponding upstream repository as I am not yet sure how that works. This is definitely, for now, a draft to get feedback about this implementation/approach.

Does this PR introduce a user-facing change?

This adds minimal checkpoint/restore support to cri-o. The minimal support allows to checkpoint a container and restore it again. Optionally it can be restored in another pod.

openshift-ci-robot · 2020-09-15T17:01:53Z

Hi @adrianreber. Thanks for your PR.

I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander · 2020-09-15T17:48:02Z

/ok-to-test

thanks for taking this on @adrianreber ! I am curious what the motivation is to add this to cri-o. is there desire for checkpoint/restore to be added to k8s?

haircommander · 2020-09-15T17:48:49Z

ah I just have looked at the issues, nevermind 😆

haircommander · 2020-09-15T17:50:35Z

you likely will have to submit a KEP for sig-node to discuss liability, use-cases, etc to get full support from the community and everything merged

adrianreber · 2020-09-17T07:03:44Z

@haircommander Thanks for the KEP pointer. I submitted one here: kubernetes/enhancements#1990

vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.proto

openshift-ci-robot · 2020-12-10T08:42:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adrianreber
To complete the pull request process, please assign fidencio after the PR has been reviewed.
You can assign the PR to them by writing /assign @fidencio in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yimin-zhao · 2020-12-10T09:14:49Z

Hi @adrianreber, I met an error when I try to reproduce your work in my local environment, here is the symptom:
I followed the tutorial of crio and created a redis server in a pod sandbox, then I ran checkpoint, it returned successfully, but when I do restore, the cmd line is blocked. So I checked the process in the background, redis was bringing back alive, but I find a few lines of errors in restore.log:

0.123143) Running post-restore scripts
(00.123149)     RPC
(00.126661) mnt: Switching to new ns to clean ghosts
(00.142560) Unlock network
(00.142568) Running network-unlock scripts
(00.142572)     RPC
iptables-restore: line 5 failed
(00.146070) Error (criu/util.c:631): exited, status=1
ip6tables-restore: line 5 failed
(00.148616) Error (criu/util.c:631): exited, status=1
(00.148890) pie: 11: seccomp: mode 0 on tid 11
(00.148984) pie: 10: seccomp: mode 0 on tid 10
(00.149063) pie: 9: seccomp: mode 0 on tid 9
(00.149137) pie: 1: seccomp: mode 0 on tid 1
....
(00.150002) 22099 (native) is going to execute the syscall 15, required is 15
(00.150021) 22099 was stopped
(00.150054) 22099 was trapped
(00.150060) 22099 (native) is going to execute the syscall 11, required is 11
(00.150115) 22099 was stopped
(00.150126) Running pre-resume scripts
(00.150130)     RPC
(00.169641) Restore finished successfully. Tasks resumed.
(00.169652) Writing stats
(00.169737) Running post-resume scripts
(00.169741)     RPC

Does this error matters? For me, it looks like there is something wrong when returning the final status.
I rebased your code onto cri-o's master branch, and also build the crictl with your crictl's pr.

Thanks

adrianreber · 2020-12-10T12:58:13Z

To be able to present a working implementation of kubectl drain 127.0.0.1 --checkpoint for kubernetes/enhancements#1990 I extended this PR from a proof of concept to a complete implementation of container and pod checkpoint/restore. With this PR I can checkpoint pods, reboot and restore those pods:

# crictl ps -a
CONTAINER           IMAGE                               CREATED             STATE               NAME                ATTEMPT             POD ID
1969cedbab7fc       quay.io/adrianreber/counter         2 seconds ago       Running             counter             0                   06e959aa6430e
b0793d84aeb18       quay.io/adrianreber/wildfly-hello   2 seconds ago       Running             wildfly             0                   06e959aa6430e
# curl `crictl inspect 1969cedbab7fc | jq -r '.info.runtimeSpec.annotations["io.kubernetes.cri-o.IP.0"]'`:8088
counter: 0
# curl `crictl inspect 1969cedbab7fc | jq -r '.info.runtimeSpec.annotations["io.kubernetes.cri-o.IP.0"]'`:8088
counter: 1
# curl `crictl inspect b0793d84aeb18 | jq -r '.info.runtimeSpec.annotations["io.kubernetes.cri-o.IP.0"]'`:8080/helloworld/
0
# curl `crictl inspect b0793d84aeb18 | jq -r '.info.runtimeSpec.annotations["io.kubernetes.cri-o.IP.0"]'`:8080/helloworld/
1
# crictl checkpoint --export=/tmp/cp.tar 06e959aa6430e
06e959aa6430e
# reboot
...
# crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID
# crictl restore --import=/tmp/cp.tar
6f304b70f09d5d9cf85dbbdc448aa794369e4b955e6cd284de08cea04c8dc77f
# crictl ps -a
CONTAINER           IMAGE                                      CREATED             STATE               NAME                ATTEMPT             POD ID
6f304b70f09d5       quay.io/adrianreber/wildfly-hello:latest   2 seconds ago       Running             wildfly             0                   160cdcd98301f
cd5a64efdb752       quay.io/adrianreber/counter:latest         3 seconds ago       Running             counter             0                   160cdcd98301f
# curl `crictl inspect 6f304b70f09d5 | jq -r '.info.runtimeSpec.annotations["io.kubernetes.cri-o.IP.0"]'`:8080/helloworld/
2
# curl crictl inspect cd5a64efdb752 | jq -r '.info.runtimeSpec.annotations["io.kubernetes.cri-o.IP.0"]'`:8088
counter: 2

Upon restore a new Pod is created and into this new Pod the checkpointed containers are restored into and stateful applications (I am using a Python server (counter) and a Java server (wildfly)) are continuing to run from the previous state.

This is all based on my Podman checkpoint/restore implementation.

adrianreber · 2020-12-10T13:09:31Z

@yimin-zhao Nice to see interest in this PR. I am always checkpointing and restoring a Wildfly and a Python based server.

I never tried redis before, but using the steps from the tutorial with this latest PR it works for me to checkpoint and restore the redis server. Can you retry with my latest changes?

adrianreber · 2022-07-19T14:25:24Z

Seems make bundle fails because I am using a git hash as version for cri-tools and it seems make bundle requires a released version. As long as there is no cri-tools release with checkpoint support this PR is stuck.

openshift-ci · 2022-07-22T01:45:40Z

@adrianreber: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/openshift-jenkins/integration_crun_cgroupv2	`f90f22c`	link	false	`/test integration_cgroupv2`
ci/openshift-jenkins/e2e_fedora	`f90f22c`	link	true	`/test e2e_fedora`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

internal/lib/restore.go

server/container_restore.go

scripts/github-actions-packages

pkg/config/config.go

This implements checkpoint and restore in cri-o. Everything is based on the Podman implementation and copied from Podman wherever possible (with small changes to fit into cri-o). Currently the functionality is implemented as defined in the "Forensic Container Checkpointing" KEP. Only container can be checkpointed and not pods. The container will always continue to run after the checkpoint. Using checkpoint via crictl could look something like this: # crictl checkpoint --export=archive.tar CTR_ID The checkpointed container can be restored in a new pod using: # crictl create # crictl start CTR_ID Signed-off-by: Adrian Reber <[email protected]>

Allow CNI plugins location to be set via environment variable. Signed-off-by: Adrian Reber <[email protected]>

Signed-off-by: Adrian Reber <[email protected]>

adrianreber · 2022-08-26T16:46:19Z

Applied all suggestions and force pushed.

haircommander · 2022-08-26T17:50:19Z

/approve

LGTM

@mrunalp @saschagrunert for the final lgtm?

openshift-ci · 2022-08-29T06:48:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adrianreber, haircommander, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [haircommander,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot requested review from saschagrunert and umohnani8 September 15, 2020 17:01

openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 15, 2020

This was referenced Sep 15, 2020

Implement checkpoint/restore commands kubernetes-sigs/cri-tools#662

Closed

Pod lifecycle checkpointing kubernetes/kubernetes#3949

Open

openshift-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 15, 2020

adrianreber mentioned this pull request Sep 17, 2020

Add Forensic Container Checkpointing KEP kubernetes/enhancements#1990

Merged

schrej reviewed Sep 21, 2020

View reviewed changes

vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.proto Outdated Show resolved Hide resolved

ashish-billore mentioned this pull request Oct 23, 2020

criu with k8s, how to? checkpoint-restore/criu#1242

Closed

adrianreber force-pushed the checkpoint-restore-support branch from 982a48c to a1cbaed Compare December 10, 2020 08:42

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2020

adrianreber force-pushed the checkpoint-restore-support branch from a1cbaed to 9897b14 Compare December 10, 2020 10:08

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2020

adrianreber mentioned this pull request Dec 10, 2020

[WIP] Add --checkpoint to drain kubernetes/kubernetes#97194

Closed

adrianreber force-pushed the checkpoint-restore-support branch 2 times, most recently from 682a227 to 637d0a9 Compare December 12, 2020 15:51

adrianreber force-pushed the checkpoint-restore-support branch from 2cd1af4 to f9ec4ad Compare July 19, 2022 11:34

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2022

adrianreber force-pushed the checkpoint-restore-support branch from f9ec4ad to f90f22c Compare July 19, 2022 11:47

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 23, 2022

haircommander reviewed Aug 25, 2022

View reviewed changes

internal/lib/restore.go Outdated Show resolved Hide resolved

adrianreber force-pushed the checkpoint-restore-support branch from f90f22c to 6e5880d Compare August 26, 2022 13:53

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 26, 2022

adrianreber force-pushed the checkpoint-restore-support branch 2 times, most recently from 4d5c8d1 to 056679b Compare August 26, 2022 14:47

haircommander reviewed Aug 26, 2022

View reviewed changes

server/container_restore.go Outdated Show resolved Hide resolved

haircommander reviewed Aug 26, 2022

View reviewed changes

scripts/github-actions-packages Show resolved Hide resolved

haircommander reviewed Aug 26, 2022

View reviewed changes

pkg/config/config.go Outdated Show resolved Hide resolved

adrianreber added 3 commits August 26, 2022 16:45

test: do not hard code CNI location

b033570

Allow CNI plugins location to be set via environment variable. Signed-off-by: Adrian Reber <[email protected]>

test: add checkpoint/restore tests

f253c4b

Signed-off-by: Adrian Reber <[email protected]>

adrianreber force-pushed the checkpoint-restore-support branch from 056679b to f253c4b Compare August 26, 2022 16:45

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2022

saschagrunert approved these changes Aug 29, 2022

View reviewed changes

openshift-ci bot assigned saschagrunert Aug 29, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2022

openshift-merge-robot merged commit f975faa into cri-o:main Aug 29, 2022

jrvaldes mentioned this pull request Jan 10, 2023

[release-1.25] Fix GitHub actions CI #6505

Merged

adrianreber mentioned this pull request Jan 24, 2023

REQUEST: New organization membership for adrianreber #6562

Closed

4 tasks

rst0git mentioned this pull request Oct 9, 2023

checkpoint: clean-up checkpoint dir after export #7355

Merged

yeazelm mentioned this pull request Feb 29, 2024

Checkpoint/Restart or Live Motion bottlerocket-os/bottlerocket#3803

Open

Provide support for checkpoint and restore #4199

Provide support for checkpoint and restore #4199

Uh oh!

Conversation

adrianreber commented Sep 15, 2020

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

openshift-ci-robot commented Sep 15, 2020

Uh oh!

haircommander commented Sep 15, 2020

Uh oh!

haircommander commented Sep 15, 2020

Uh oh!

haircommander commented Sep 15, 2020

Uh oh!

adrianreber commented Sep 17, 2020

Uh oh!

Uh oh!

openshift-ci-robot commented Dec 10, 2020

Uh oh!

yimin-zhao commented Dec 10, 2020

Uh oh!

adrianreber commented Dec 10, 2020

Uh oh!

adrianreber commented Dec 10, 2020

Uh oh!

adrianreber commented Jul 19, 2022

Uh oh!

openshift-ci bot commented Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrianreber commented Aug 26, 2022

Uh oh!

haircommander commented Aug 26, 2022

Uh oh!

openshift-ci bot commented Aug 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

openshift-ci bot commented Jul 22, 2022 •

edited

Loading