-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Handle freeze on paused VMs during snapshot #15001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle freeze on paused VMs during snapshot #15001
Conversation
|
Hi @noamasu. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0xFelix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, I've added some comments.
| } else { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } else { | |
| return err | |
| } | |
| } | |
| return err |
nit: additional else not required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to check for not found and do a new error? why not return the error which is not found anyways?
| } else { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
tests/storage/snapshot.go
Outdated
| } else { | ||
| vmi = libvmifact.NewAlpine(libnet.WithMasqueradeNetworking()) | ||
| } | ||
| vmi.Namespace = testsuite.GetTestNamespace(nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also libvmi.WithNamespace. Can you put both libvmi Options into an opts slice?
tests/storage/snapshot.go
Outdated
| vm.Spec.Template.Spec.Domain.Devices.Disks = append(vm.Spec.Template.Spec.Domain.Devices.Disks, v1.Disk{ | ||
| Name: "blank", | ||
| DiskDevice: v1.DiskDevice{ | ||
| Disk: &v1.DiskTarget{ | ||
| Bus: v1.DiskBusVirtio, | ||
| }, | ||
| }, | ||
| }) | ||
| vm.Spec.Template.Spec.Volumes = append(vm.Spec.Template.Spec.Volumes, v1.Volume{ | ||
| Name: "blank", | ||
| VolumeSource: v1.VolumeSource{ | ||
| DataVolume: &v1.DataVolumeSource{ | ||
| Name: "dv-" + vm.Name, | ||
| }, | ||
| }, | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add that DV to the vmi first (with libvmi) and then create a VM from it?
tests/storage/snapshot.go
Outdated
| }, | ||
| }) | ||
|
|
||
| vm = libvmi.NewVirtualMachine(vmi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You call this func twice, also in L577
tests/storage/snapshot.go
Outdated
| Entry("[test_id:7000] with guest-agent", true), | ||
| Entry("[test_id:7001] no guest-agent", false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provide the libvmifact function here instead of a boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@0xFelix , i tried to do:
Entry("[test_id:7000] with guest-agent", libvmifact.NewFedora),
Entry("[test_id:7001] without guest-agent", libvmifact.NewAlpine),
insead of boolean,
but becuase i need this condition to wait for the guest agent:
if imageFunc == libvmifact.NewFedora {
Eventually(matcher.ThisVMI(vmi), 12*time.Minute, 2*time.Second).
Should(matcher.HaveConditionTrue(v1.VirtualMachineInstanceAgentConnected))
}
I cannot compare a function like imageFunc == libvmifact.NewFedora ... so seems like i have to pass a boolean anyways?
wondering what how to handle this if statement if i dont have the boolean
| // GET_AGENT - According to libvirt engineers this command shouldn't be used | ||
| // by KubeVirt, because it provides irrelevant information (version and supported commands). | ||
| func executeAgentCommands(commands []AgentCommand, agentPoller *AgentPoller) { | ||
| dom, err := agentPoller.Connection.LookupDomainByName(agentPoller.domainName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct that these changes are for virt-launcher's agent poller is spamming qemu-gest-agent command failures when VM is paused.? If so, can you put them into a separate commit please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to test this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i will make a separate commit for this, and add relevant tests to it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice addition to the PR, thanks
ShellyKa13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! a couple of comments, also I will appreciate if you divide the PR to several commit
1.snapshot source change
2. manager freeze unfreeze changes + ut
3. lifecycle change + ut
4. extra change for running agent commands + ut
5. functional test
|
|
||
| } | ||
|
|
||
| func (app *SubresourceAPIApp) UnpauseVMIRequestHandler(request *restful.Request, response *restful.Response) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add this case to the subresource unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont we also need to prevent from doing pause during snapshot? it might fail but should we even pass it along if it will just fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more I think about it I'm a bit conflicted on doing anything special in pause/unpause subresource since the VMI is not just paused/unpaused by users may happen when internal error occurs like disk i/o error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhenriks I think we should at least handle cases of unpause done by users..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I guess unpause is fine
pkg/virt-api/rest/lifecycle.go
Outdated
| writeError(statusErr, response) | ||
| return | ||
| } | ||
| if vm.Status.SnapshotInProgress != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to add such a check in every state change subresource, I assume it is possible that the subresource will be used directly without modifing the vm yaml (where we deny with the admitter), @mhenriks wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm as far as I know, the subresources usually update something in the status and then a controller applies the change to the spec. When that happen the webhook will be invoked, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified with Cursor :) and as I thought in this cases the subresource call itself doesnt change vm/vmi which doesnt call the webhook and then the subresource passes the request directly to virt-handler which passes it right away to trough the grpc connection to the virt-handler to libvirt..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
far be it from me to question the AI but I'm pretty sure not "every state change subresource" is processed like that. That may be the case for pause because that is fundamentally an operation on the VMI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhenriks technically any vm state change can come from virtctl(and I would assume UI also) bypassing the vm/vmi yaml change directly using the subresource. I think in such case the cases mentioned in the commit should be avoided is possible. We already do that in the webhook level so why not also subresource level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific about what subresource calls? I know that VM start/stop subresource updates the VM spec/status which should invoke the validating webhook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right I see you are right, they call the subresource directly and the subresource patches the vm with start and stop. So I guess only pause/unpause does not go through the yaml
| // GET_AGENT - According to libvirt engineers this command shouldn't be used | ||
| // by KubeVirt, because it provides irrelevant information (version and supported commands). | ||
| func executeAgentCommands(commands []AgentCommand, agentPoller *AgentPoller) { | ||
| dom, err := agentPoller.Connection.LookupDomainByName(agentPoller.domainName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice addition to the PR, thanks
| if paused, err := util.DomainIsPaused(dom); err != nil { | ||
| log.Log.Errorf("cannot determine domain state: %v", err) | ||
| return | ||
| } else if paused { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for the else
just do if paused
| } else { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to check for not found and do a new error? why not return the error which is not found anyways?
| } | ||
| defer domain.Free() | ||
|
|
||
| if paused, err := util.DomainIsPaused(domain); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you added this function you can also use it in other places which check if domain is pause like:
UnpauseVMI
tests/storage/snapshot.go
Outdated
| Entry("[test_id:7000] with guest-agent", true), | ||
| Entry("[test_id:7001] no guest-agent", false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
pkg/storage/snapshot/source.go
Outdated
|
|
||
| online := exists | ||
|
|
||
| running := exists && s.vm.Status.PrintableStatus == kubevirtv1.VirtualMachineStatusRunning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhenriks what do you think about using PrintableStatus? its not being relied upon to a big thing just whether to print on not the error message, but usually we prefer not to rely on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably okay for Running but I wouldn't rely on it for anything else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it would be best to just check the VMI conditions instead
| } | ||
| defer dom.Free() | ||
|
|
||
| if paused, err := util.DomainIsPaused(dom); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you added these changes only to executeAgentCommands()? because of the relevant QEMU commands in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the response!
i added it because the agent poller's executeAgentCommands spam the logs with failed qemu agent command failures when the VM is paused (qemu guest agent cannot execute commands if the VM is not running). do you think we need to add it to other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also fetchAndStoreGuestInfo() that is using libvirt API to get relevant info instead of QEMU directly as an abstraction above it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that since fetchAndStoreGuestInfo() doesn't rely on QEMU guest agent commands, so it's acceptable to retrieve information via the libvirt API even when the VM is paused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, i take it back, you are right.
domain.GetGuestInfo(infoTypes, 0) will also fail when the VM is paused, as these messages are also being spammed
{"component":"virt-launcher","level":"info","msg":"Polling API operations: 1","pos":"agent_poller.go:417","timestamp":"2025-06-29T18:45:50.316563Z"}
{"component":"virt-launcher","level":"error","msg":"Fetching guest info failed: virError(Code=55, Domain=10, Message='Requested operation is not valid: domain is not running')","pos":"agent_poller.go:432","times
i will work on a fix that will cover both fetchAndStoreGuestInfo and executeAgentCommands
sorry for the confusion :)
c0cf59f to
680daf9
Compare
@noamasu can you supply some documentation for this? I don't think buffers are flushed when a VM is paused. Consider that a VM may be paused automatically by an I/O error |
pkg/storage/snapshot/source.go
Outdated
|
|
||
| s.state = &sourceState{ | ||
| online: online, | ||
| running: running, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just have online be false if the VM is paused and not add a new state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asked another way, why can't we simply treat a paused VM as offline?
|
Hey @mhenriks, thank you for the feedback.
after looking into this - sorry for this inaccuracy, you are right, pause will keep memory unflushed - risking a non-stable snapshot of the filesystem. looks like just flushing the suspended vm is also not possible since you need a running VM to do so. indeed snapshot of a paused VM is not safe, much like a running VM (memory not flushed to disk)... knowing that's the case what do you think is the best way to deal with this as I see 2 options here:
|
@noamasu I think this is the better option, may even want to create a new indication for this case |
680daf9 to
44c254b
Compare
ec76470 to
d0da165
Compare
|
I think this is looking pretty good, what are your thoughts @ShellyKa13? |
ShellyKa13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of suggestions
pkg/storage/snapshot/snapshot.go
Outdated
| } | ||
|
|
||
| // Check for paused VM | ||
| if source.Paused() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you say about checking this before the guestagent and in case of paused not add the guestagent indication since it doesnt affect the snapshot. i.e
if source.Paused{
...
else if source.GuestAgent{
...
} else{
...
}
pkg/storage/snapshot/source.go
Outdated
| return nil | ||
| } | ||
|
|
||
| if s.Paused() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not add it to the first if:
if !s.Locked() || !s.GuestAgent() || s.Paused() {
| VMSnapshotNoGuestAgentIndication Indication = "NoGuestAgent" | ||
| VMSnapshotGuestAgentIndication Indication = "GuestAgent" | ||
| VMSnapshotQuiesceFailedIndication Indication = "QuiesceFailed" | ||
| VMSnapshotPausedIndication Indication = "Paused" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probable be in the previous commit, or move the indications part from previous commit to this commit
pkg/virt-api/rest/lifecycle.go
Outdated
| writeError(statusErr, response) | ||
| return | ||
| } | ||
| if vm.Status.SnapshotInProgress != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified with Cursor :) and as I thought in this cases the subresource call itself doesnt change vm/vmi which doesnt call the webhook and then the subresource passes the request directly to virt-handler which passes it right away to trough the grpc connection to the virt-handler to libvirt..
|
|
||
| } | ||
|
|
||
| func (app *SubresourceAPIApp) UnpauseVMIRequestHandler(request *restful.Request, response *restful.Response) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont we also need to prevent from doing pause during snapshot? it might fail but should we even pass it along if it will just fail?
35e2247 to
ef98e10
Compare
pkg/storage/snapshot/source.go
Outdated
| return nil | ||
| } | ||
|
|
||
| if s.Paused() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably should first check the paused condition before the guestagent one
tests/storage/snapshot.go
Outdated
| return updatedVMI.Status.FSFreezeStatus | ||
| }, 30*time.Second, 2*time.Second).Should(BeEmpty()) | ||
| }) | ||
| DescribeTable("should succeed snapshot when VM is paused with Paused indication", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing new line before this block
tests/storage/snapshot.go
Outdated
| checkOnlineSnapshotExpectedContentSource(vm, contentName, true) | ||
| }, | ||
|
|
||
| Entry("[test_id:7000] with guest-agent", libvmifact.NewFedora, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if we really need to run this test twice for with or without guest agent.. I think with guest agent is enough
|
|
||
| if source.GuestAgent() { | ||
| if source.Paused() { | ||
| indications = sets.Insert(indications, snapshotv1.VMSnapshotPausedIndication) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about a unit test for this?
|
|
||
| } | ||
|
|
||
| func (app *SubresourceAPIApp) UnpauseVMIRequestHandler(request *restful.Request, response *restful.Response) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhenriks I think we should at least handle cases of unpause done by users..
pkg/virt-api/rest/lifecycle.go
Outdated
| writeError(statusErr, response) | ||
| return | ||
| } | ||
| if vm.Status.SnapshotInProgress != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhenriks technically any vm state change can come from virtctl(and I would assume UI also) bypassing the vm/vmi yaml change directly using the subresource. I think in such case the cases mentioned in the commit should be avoided is possible. We already do that in the webhook level so why not also subresource level?
pkg/virt-api/rest/lifecycle.go
Outdated
| writeError(statusErr, response) | ||
| return | ||
| } | ||
| if vm.Status.SnapshotInProgress != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dont think we can do this in the unfreeze since the snapshot is still in progress when you call unfreeze to complete the snapshot. And also there is an automatic unfreeze mechanism we use the prevent a case the vm is kept frozen if something happen to the snapshot and it for some reason doesnt unfreeze.
pkg/virt-api/rest/lifecycle.go
Outdated
| } | ||
| } | ||
| _, err := app.fetchVirtualMachine(name, namespace) | ||
| vm, err := app.fetchVirtualMachine(name, namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also in this case Im not sure we want to prevent migrate vm call.. I believe the snapshot will just fail if the vm was migrated in the process and if not dont think it will cause data change anyways
ef98e10 to
4d6d98b
Compare
|
/lgtm |
|
Pull requests that are marked with After that period the bot marks them with the label /label needs-approver-review |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mhenriks The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Required labels detected, running phase 2 presubmits: |
|
/remove-label needs-approver-review |
|
I think we should backport it the relevant issue #10759 was opened several versions ago |
|
/cherry-pick release-1.6 |
|
@noamasu: only kubevirt org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-1.6 |
|
@ShellyKa13: new pull request created: #15385 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-1.5 |
|
@ShellyKa13: #15001 failed to apply on top of branch "release-1.5": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Manual backport of kubevirt#15001 Validates vm.Status.SnapshotInProgress and refuses operations if it's non-nil, preventing VMs from being modified mid-snapshot. This ensures snapshot consistency by preventing state changes during the snapshot process, which could lead to undefined behavior or inconsistent snapshots. Operations protected: - UnpauseVMIRequestHandler Signed-off-by: Noam Assouline <[email protected]>
Manual backport of kubevirt#15001 Validates vm.Status.SnapshotInProgress and refuses operations if it's non-nil, preventing VMs from being modified mid-snapshot. This ensures snapshot consistency by preventing state changes during the snapshot process, which could lead to undefined behavior or inconsistent snapshots. Operations protected: - UnpauseVMIRequestHandler Signed-off-by: Noam Assouline <[email protected]>
manual backport of kubevirt#15001 Introduces VMSnapshotPausedIndication to provide programmatic indication when snapshots are taken of paused VMs. This allows users and tooling to identify snapshots that may have consistency issues due to unflushed memory buffers. The indication appears in the VirtualMachineSnapshot status, consistent with existing indications like QuiesceFailed and NoGuestAgent. Includes functional tests to verify the indication is properly set. Signed-off-by: Noam Assouline <[email protected]> lala
Manual backport of kubevirt#15001 Validates vm.Status.SnapshotInProgress and refuses operations if it's non-nil, preventing VMs from being modified mid-snapshot. This ensures snapshot consistency by preventing state changes during the snapshot process, which could lead to undefined behavior or inconsistent snapshots. Operations protected: - UnpauseVMIRequestHandler Signed-off-by: Noam Assouline <[email protected]>
manual backport of kubevirt#15001 Introduces VMSnapshotPausedIndication to provide programmatic indication when snapshots are taken of paused VMs. This allows users and tooling to identify snapshots that may have consistency issues due to unflushed memory buffers. The indication appears in the VirtualMachineSnapshot status, consistent with existing indications like QuiesceFailed and NoGuestAgent. Includes functional tests to verify the indication is properly set. Signed-off-by: Noam Assouline <[email protected]>
Manual backport of kubevirt#15001 Validates vm.Status.SnapshotInProgress and refuses operations if it's non-nil, preventing VMs from being modified mid-snapshot. This ensures snapshot consistency by preventing state changes during the snapshot process, which could lead to undefined behavior or inconsistent snapshots. Operations protected: - UnpauseVMIRequestHandler Signed-off-by: Noam Assouline <[email protected]>
VM snapshot with qemu-guest-agent fails when the domain is paused, because virt-controller sends a freeze/unfreeze request during snapshotting. This freeze/unfreeze is handled by virt-launcher, but it currently fails on paused VMs. Since paused VMs have a consistent filesystem state (similar to frozen), this change makes virt-launcher treat freeze as successful when the domain is paused. This allows snapshotting to continue safely.
What this PR does
Before this PR:
After this PR:
vm.Status.SnapshotInProgressis not nil.References
Fixes #10759
Why we need it and why it was done in this way
The following tradeoffs were made:
The following alternatives were considered:
Links to places where the discussion took place:
Special notes for your reviewer
@ShellyKa13
@codingben (check out the change for agent_poller.go)
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.
Release note