Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@machadovilaca
Copy link
Member

@machadovilaca machadovilaca commented Jun 6, 2025

What this PR does

Before this PR:

No GuestAgentInfo metrics reported

After this PR:

GuestAgentInfo cpu load metrics reported

# HELP kubevirt_vmi_guest_load_15m Guest system load average over 15 minutes as reported by the guest agent. Load is defined as the number of processes in the runqueue or waiting for disk I/O.
# TYPE kubevirt_vmi_guest_load_15m gauge
kubevirt_vmi_guest_load_15m{kubernetes_vmi_label_kubevirt_io_domain="vm1",kubernetes_vmi_label_kubevirt_io_nodeName="minikube",name="vm1",namespace="default",node="minikube"} 0

# HELP kubevirt_vmi_guest_load_1m Guest system load average over 1 minute as reported by the guest agent. Load is defined as the number of processes in the runqueue or waiting for disk I/O.
# TYPE kubevirt_vmi_guest_load_1m gauge
kubevirt_vmi_guest_load_1m{kubernetes_vmi_label_kubevirt_io_domain="vm1",kubernetes_vmi_label_kubevirt_io_nodeName="minikube",name="vm1",namespace="default",node="minikube"} 0.0029296875

# HELP kubevirt_vmi_guest_load_5m Guest system load average over 5 minutes as reported by the guest agent. Load is defined as the number of processes in the runqueue or waiting for disk I/O.
# TYPE kubevirt_vmi_guest_load_5m gauge
kubevirt_vmi_guest_load_5m{kubernetes_vmi_label_kubevirt_io_domain="vm1",kubernetes_vmi_label_kubevirt_io_nodeName="minikube",name="vm1",namespace="default",node="minikube"} 0.01611328125
  • Fixes #

kubevirt/enhancements#67

jira-ticket: https://issues.redhat.com/browse/CNV-50883

References

Why we need it and why it was done in this way

The following tradeoffs were made:

The following alternatives were considered:

Links to places where the discussion took place:

Special notes for your reviewer

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note

Add GuestAgentInfo info metrics

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jun 6, 2025
@kubevirt-bot kubevirt-bot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/buildsystem Denotes an issue or PR that relates to changes in the build system. sig/compute sig/observability Denotes an issue or PR that relates to observability. size/XXL labels Jun 6, 2025
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @machadovilaca - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@machadovilaca machadovilaca force-pushed the add-guestloadinfo branch 2 times, most recently from 02231c6 to 9b1e901 Compare June 9, 2025 09:18
@machadovilaca
Copy link
Member Author

/cc @enp0s3 @sradco

@kubevirt-bot kubevirt-bot requested review from enp0s3 and sradco June 16, 2025 10:10
@sradco
Copy link
Contributor

sradco commented Jun 16, 2025

Hi, Where did this request came from?
Is there a Jira item?

Guest hostname reported by the guest agent. The value is always 1. Type: Gauge.

### kubevirt_vmi_guest_load_15m
Guest system load average over 15 minutes as reported by the guest agent. Type: Gauge.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'load' mean in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

average length of the cpu queue over the period of time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this info in the metrics description

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guest system load is the more accurate description as used the kernel

@machadovilaca
Copy link
Member Author

Hi, Where did this request came from? Is there a Jira item?

hidden in the comment section of the PR description again, moved out

### kubevirt_vmi_filesystem_used_bytes
Used VM filesystem capacity in bytes. Type: Gauge.

### kubevirt_vmi_guest_hostname_info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this be a separate metric and not add it to the kubevirt_vm/i_info metrics?

@kubevirt-bot kubevirt-bot added lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 17, 2025
@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 20, 2025
}

guestInfo := vmiReport.vmiStats.GuestAgentInfo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we also need to check whether guestInfo.Load isn't nil since its a pointer:

	// Load contains the system load averages (1M, 5M, 15M) from the guest agent
	Load *VirtualMachineInstanceGuestOSLoad `json:"load,omitempty"`
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L87 checks if vmiReport.vmiStats.GuestAgentInfo is not nil, but it doesn't check if vmiReport.vmiStats.GuestAgentInfo.Load is not nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, missed it, updated

@enp0s3
Copy link
Contributor

enp0s3 commented Jul 28, 2025

/approve
/hold

Lets add the tests and fix the Load field validation, then feel free to unhold

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 28, 2025
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enp0s3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2025
})

Context("Deep copy behavior", func() {
It("should store a deep copy of guest info in cache", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the value of this test. DeepCopy is generated could be tested without going through the cache but I don't think we need to do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 65 to 70
scraper.mutex.Lock()
scraper.cache[socketFile] = &guestAgentInfoCache{
timestamp: time.Now(),
info: guestInfo.DeepCopy(),
}
scraper.mutex.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could extract blocks like this using the mutexes into helpers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@machadovilaca machadovilaca force-pushed the add-guestloadinfo branch 2 times, most recently from 05987f1 to 9b02250 Compare July 28, 2025 13:24
Copy link
Member

@0xFelix 0xFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the failures in lane sig-compute related?

}

guestInfo := vmiReport.vmiStats.GuestAgentInfo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L87 checks if vmiReport.vmiStats.GuestAgentInfo is not nil, but it doesn't check if vmiReport.vmiStats.GuestAgentInfo.Load is not nil.


By("checking if entry is within timeout")
cached, exists := getCacheEntry(socketFile)
withinTimeout := exists && time.Since(cached.timestamp) < cacheTimeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not verify controller logic? I'd like to see a test that verifies that the controller doesn't update if the cached value is still within the timeout range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand what you mean, it was making the requests to the scrapper cache
but the validation had to be local because of the way the functions were built
i refactored the scrapper to allow us to test each function individually


By("checking if entry is outside timeout")
cached, exists := getCacheEntry(socketFile)
withinTimeout := exists && time.Since(cached.timestamp) < cacheTimeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, this doesn't test controller logic, but rather a condition in the test code. Was this generated by AI?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 145 to 98
It("should not clean up non-expired cache entries", func() {
By("adding a fresh cache entry")
addCacheEntry(socketFile, time.Now(), guestInfo)

By("calling cleanup immediately")
scraper.cleanupExpiredCache()

By("verifying cache entry still exists")
exists := cacheEntryExists(socketFile)
Expect(exists).To(BeTrue())
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is a duplicate of the case above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

// Load average over 1 minute
Load1m float64 `json:"load1m,omitempty"`
// Load5mSet indicates whether the 5 minute load average is set
Load5mSet bool `json:"load5mSet,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does set mean? Why can't the actual field be a ptr instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use the same format as libvirt.DomainGuestInfoLoad
so they set it like that

we could still make the conversion to pointer
but in every other domainstat where this is the case, we've been using the set
so I think should be better to keep it consistent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to an example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's internal API though. Which brings me back to I don't think we should litter the external API too much. Didn't we want to add this to the metrics endpoint provided by virt-launcher?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this struct isn't used in the VMI's status.

@machadovilaca machadovilaca force-pushed the add-guestloadinfo branch 2 times, most recently from 566d98f to 9d407d7 Compare July 30, 2025 17:49
@machadovilaca
Copy link
Member Author

/retest

@machadovilaca
Copy link
Member Author

Are the failures in lane sig-compute related?

@0xFelix it was related, fixed

func (d *GuestAgentInfoScraper) getCacheEntry(socketFile string) (*VirtualMachineInstanceStats, time.Time, bool) {
now := time.Now()

d.mutex.RLock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You call RLock but then you write to d.cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it does not make sense to have a RW mutex I think

By("checking if entry is outside timeout")
_, timestamp, exists := scraper.getCacheEntry(socketFile)
Expect(exists).To(BeFalse())
Expect(time.Since(timestamp)).To(BeNumerically(">", cacheTimeout))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp should be an empty time.Time{}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, updated

Comment on lines 119 to 121
if vmStats.GuestAgentInfo != nil && vmStats.GuestAgentInfo.Hostname != "" {
d.addCacheEntry(socketFile, time.Now(), vmStats.GuestAgentInfo)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test for this behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

// Load average over 1 minute
Load1m float64 `json:"load1m,omitempty"`
// Load5mSet indicates whether the 5 minute load average is set
Load5mSet bool `json:"load5mSet,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to an example?

Signed-off-by: João Vilaça <[email protected]>
@kubevirt-bot
Copy link
Contributor

@machadovilaca: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-verify-rpms 92b67bb link false /test pull-kubevirt-verify-rpms
pull-kubevirt-check-tests-for-flakes 1f0dfc1 link false /test pull-kubevirt-check-tests-for-flakes
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@0xFelix 0xFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 1, 2025
@kubevirt-commenter-bot
Copy link

Required labels detected, running phase 2 presubmits:
/test pull-kubevirt-e2e-k8s-1.31-windows2016
/test pull-kubevirt-e2e-kind-1.33-vgpu
/test pull-kubevirt-e2e-kind-sriov
/test pull-kubevirt-e2e-k8s-1.33-ipv6-sig-network
/test pull-kubevirt-e2e-k8s-1.31-sig-network
/test pull-kubevirt-e2e-k8s-1.31-sig-storage
/test pull-kubevirt-e2e-k8s-1.31-sig-compute
/test pull-kubevirt-e2e-k8s-1.31-sig-operator
/test pull-kubevirt-e2e-k8s-1.32-sig-network
/test pull-kubevirt-e2e-k8s-1.32-sig-storage
/test pull-kubevirt-e2e-k8s-1.32-sig-compute
/test pull-kubevirt-e2e-k8s-1.32-sig-operator

@machadovilaca
Copy link
Member Author

/unhold

/cc @enp0s3

@kubevirt-bot kubevirt-bot requested a review from enp0s3 August 1, 2025 13:43
@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 1, 2025
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-bot kubevirt-bot merged commit 544a8a2 into kubevirt:main Aug 1, 2025
41 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/launcher area/monitoring dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/buildsystem Denotes an issue or PR that relates to changes in the build system. sig/compute sig/observability Denotes an issue or PR that relates to observability. size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants