Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add VM in unhealthy state alerts runbooks#300

Merged
sradco merged 1 commit intokubevirt:mainfrom
sradco:add_runbook_for_vm_stuck_in_status_alerts
Aug 24, 2025
Merged

Add VM in unhealthy state alerts runbooks#300
sradco merged 1 commit intokubevirt:mainfrom
sradco:add_runbook_for_vm_stuck_in_status_alerts

Conversation

@sradco
Copy link
Collaborator

@sradco sradco commented Jul 18, 2025

What this PR does / why we need it:
This PR adds alerts runbooks for the VirtualMachineStuckInUnhealthyState and VirtualMachineStuckOnNode alerts,
added in kubevirt/kubevirt#15227.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes #
https://issues.redhat.com/browse/CNV-49530

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:


@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label Jul 18, 2025
@sradco
Copy link
Collaborator Author

sradco commented Jul 20, 2025

@vladikr @enp0s3 @machadovilaca @avlitman please review the following runbooks and see if they are accurate.

@sradco sradco force-pushed the add_runbook_for_vm_stuck_in_status_alerts branch from df79bf0 to 88456dd Compare July 27, 2025 08:36
@sradco
Copy link
Collaborator Author

sradco commented Jul 28, 2025

Hi @0xFelix , Thank you for the review. I updated the runbooks based on your comments. Can you please have another look?

Copy link
Member

@0xFelix 0xFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your efforts. I appreciate that you used LLMs to create these documents, but in my opinion, they are quite verbose, unspecific, and even incorrect in several places.

### 3. Check Image Availability (for containerDisk)
```bash
# If using containerDisk, test image accessibility
kubectl run test-pull --image=<vm-disk-image> --dry-run=client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks wrong to me, what is it supposed to do exactly?

1. **Clear image cache** on the problematic node:
```bash
# On the node (requires node access):
crictl rmi <problematic-image>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, not sure that always applies

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the updated version

@sradco sradco force-pushed the add_runbook_for_vm_stuck_in_status_alerts branch from 88456dd to 6cb0ab4 Compare August 10, 2025 09:02
@sradco sradco force-pushed the add_runbook_for_vm_stuck_in_status_alerts branch 5 times, most recently from 27b1776 to e2cf4cc Compare August 11, 2025 07:23
Comment on lines 86 to 97
# If using containerDisk, test image accessibility
docker pull <vm-disk-image>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks it only locally though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there could be local node issues.


### 5. Review VM Specification
```bash
# Validate VM spec for common issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I know common issues?

kubectl get vm <vm-name> -n <namespace> \
-o jsonpath='{.spec.template.spec}'

# Check for resource requests/limits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?


### 4. Verify KubeVirt Configuration
```bash
# Check KubeVirt installation status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check if CR status is Available/Ready instead?

@vladikr
Copy link
Member

vladikr commented Aug 12, 2025

Overall, this is a very nice, well-organized PR! I just had a few comments.

@sradco sradco force-pushed the add_runbook_for_vm_stuck_in_status_alerts branch from e2cf4cc to 0c24031 Compare August 17, 2025 09:58
@sradco
Copy link
Collaborator Author

sradco commented Aug 17, 2025

@vladikr @0xFelix @nunnatsa Thank you for your review.
I updated the runbooks. Please review again.

Copy link

@enp0s3 enp0s3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sradco Thanks for the detailed work! I have a general question - to which user role the alert is intended? because I think that not all the mitigation or prevention steps could be done by a role that is lower than cluster-admin

@sradco sradco force-pushed the add_runbook_for_vm_stuck_in_status_alerts branch from 0c24031 to 811476a Compare August 20, 2025 12:55
@sradco
Copy link
Collaborator Author

sradco commented Aug 20, 2025

Hi @enp0s3 , @vladikr , @nunnatsa , @0xFelix , Thank you again for the reviews. I updated the pr based on your suggestions.

@sradco
Copy link
Collaborator Author

sradco commented Aug 24, 2025

@sradco Thanks for the detailed work! I have a general question - to which user role the alert is intended? because I think that not all the mitigation or prevention steps could be done by a role that is lower than cluster-admin

Thanks @enp0s3 I added an additional line in the Escalation section, "- You don't have enough permissions to run the diagnosis and/or mitigation steps." based on your question.

@sradco
Copy link
Collaborator Author

sradco commented Aug 24, 2025

Thank you all, your review was very important and appreciated to make it accurate and meaningful!
I will go ahead an merge it, to unblock the alert.
Please feel free to add additional comments/questions and I will address them in a separate PR.

@sradco sradco merged commit 735c738 into kubevirt:main Aug 24, 2025
2 checks passed
github-actions bot pushed a commit that referenced this pull request Aug 24, 2025
…vm_stuck_in_status_alerts

Add VM in unhealthy state alerts runbooks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants