-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Provide support for checkpoint and restore #4199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide support for checkpoint and restore #4199
Conversation
|
Hi @adrianreber. Thanks for your PR. I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test thanks for taking this on @adrianreber ! I am curious what the motivation is to add this to cri-o. is there desire for checkpoint/restore to be added to k8s? |
|
ah I just have looked at the issues, nevermind 😆 |
|
you likely will have to submit a KEP for sig-node to discuss liability, use-cases, etc to get full support from the community and everything merged |
|
@haircommander Thanks for the KEP pointer. I submitted one here: kubernetes/enhancements#1990 |
982a48c to
a1cbaed
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: adrianreber The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @adrianreber, I met an error when I try to reproduce your work in my local environment, here is the symptom: Does this error matters? For me, it looks like there is something wrong when returning the final status. Thanks |
a1cbaed to
9897b14
Compare
|
To be able to present a working implementation of Upon restore a new Pod is created and into this new Pod the checkpointed containers are restored into and stateful applications (I am using a Python server (counter) and a Java server (wildfly)) are continuing to run from the previous state. This is all based on my Podman checkpoint/restore implementation. |
|
@yimin-zhao Nice to see interest in this PR. I am always checkpointing and restoring a Wildfly and a Python based server. I never tried redis before, but using the steps from the tutorial with this latest PR it works for me to checkpoint and restore the redis server. Can you retry with my latest changes? |
682a227 to
637d0a9
Compare
2cd1af4 to
f9ec4ad
Compare
f9ec4ad to
f90f22c
Compare
|
Seems |
|
@adrianreber: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
f90f22c to
6e5880d
Compare
4d5c8d1 to
056679b
Compare
This implements checkpoint and restore in cri-o. Everything is based on the Podman implementation and copied from Podman wherever possible (with small changes to fit into cri-o). Currently the functionality is implemented as defined in the "Forensic Container Checkpointing" KEP. Only container can be checkpointed and not pods. The container will always continue to run after the checkpoint. Using checkpoint via crictl could look something like this: # crictl checkpoint --export=archive.tar CTR_ID The checkpointed container can be restored in a new pod using: # crictl create # crictl start CTR_ID Signed-off-by: Adrian Reber <[email protected]>
Allow CNI plugins location to be set via environment variable. Signed-off-by: Adrian Reber <[email protected]>
Signed-off-by: Adrian Reber <[email protected]>
056679b to
f253c4b
Compare
|
Applied all suggestions and force pushed. |
|
/approve LGTM @mrunalp @saschagrunert for the final lgtm? |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adrianreber, haircommander, saschagrunert The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Provide support for checkpoint and restore
This implements minimal container checkpoint and restore support in cri-o.
The minimal implementation offers the possibility to checkpoint and restore a container. Currently checkpoints do not survive a reboot, which would be a nice additional feature. As 'bundlePath' is on a tmpfs (/run) the checkpoint is gone after reboot. It would be possible to store the checkpoint somewhere else to survive a reboot, but to restore a container config.json is also needed.
Similar to Podman the plan is to support exporting checkpoints which will include all necessary information to restore a container after a reboot or on another system but this is not part of this initial implementation. Using 'crictl' this looks like:
In addition to the simple checkpoint and restore support this implementation also offers the possibility to restore a container into another sandbox.
This requires the latest runc (opencontainers/runc#2583) version to be able to let runc tell CRIU to join the namespaces of the infrastructure container.
This is based on the Podman checkpoint/restore support.
Special notes for your reviewer:
As mentioned, this is only the initial, minimal change to get checkpoint/restore working. The corresponding API changes are not yet submitted to the corresponding upstream repository as I am not yet sure how that works. This is definitely, for now, a draft to get feedback about this implementation/approach.
Does this PR introduce a user-facing change?