Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mhenriks
Copy link
Member

@mhenriks mhenriks commented May 19, 2025

A VM with <= 2G memory will get 8 total PCI ports by default and at least 3 guaranteed free hotplug ports

A VM with > 2G memory will get 16 total PCI ports by default and at least 6 guaranteed free hotplug ports

See #14754 (comment) for detailed rationalization for this scheme stated much more clearly than I ever could

What this PR does

Before this PR:

After this PR:

Fixes # https://issues.redhat.com/browse/CNV-57873

Why we need it and why it was done in this way

The following tradeoffs were made:

The following alternatives were considered:

Links to places where the discussion took place:

Special notes for your reviewer

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note

Allocate more PCI ports for hotplug 

@kubevirt-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels May 19, 2025
@kubevirt-bot kubevirt-bot requested a review from Barakmor1 May 19, 2025 02:41
@kubevirt-bot kubevirt-bot requested a review from fossedihelm May 19, 2025 02:41
@mhenriks
Copy link
Member Author

cc @EdDev

@Acedus
Copy link
Contributor

Acedus commented May 19, 2025

/cc

@kubevirt-bot kubevirt-bot requested a review from Acedus May 19, 2025 14:35
}

func getHotplugNetworkInterfaceCount(vmi *v1.VirtualMachineInstance) int {
interfacesDecleared := len(vmi.Spec.Domain.Devices.Interfaces)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

Copy link

@stefanha stefanha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@berrange ^ you might have thoughts on PCI hotplug port policies.

}

func getHotplugNetworkInterfaceCount(vmi *v1.VirtualMachineInstance) int {
interfacesDecleared := len(vmi.Spec.Domain.Devices.Interfaces)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Decleared/Declared/

logger := log.Log.Object(vmi)
// old behavior for small VMs
portsToAllocate := getHotplugNetworkInterfaceCount(vmi)
pciDevsOnRoot := vmi.Annotations[v1.PlacePCIDevicesOnRootComplex] == "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is suppose to set this annotation? @EdDev do you know?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's the user

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3054 is the backstory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, then my question is if we should move it to a proper API or deprecate it? Although, out of scope of this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that if we could rationalize all of this and craft a proper API, it would be preferred.

})

var _ = Describe("domain network interfaces resources", func() {
var _ = PDescribe("domain network interfaces resources", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why skipping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken, will fix before getting out of draft mode and implementation is ok'd

@alicefr
Copy link
Member

alicefr commented May 20, 2025

@mhenriks unit tests would be a good idea

@berrange
Copy link
Contributor

berrange commented May 20, 2025

If VM has less than 2G memory, there is no difference in behavior, up to 3 additional ports will be added

For VMs with more than 2G, the number of "cold plug" ports will be computed (usually around 5 or 6 for simple VMs) and that number will be subtracted from 32 to get the number of hotplug ports to be allocated. So if your VM is using 5 ports, 27 ports will be allocated

Mixing an "additional port count" with a "total port count" rule is going to lead to surprising behaviour

Consider two VMs

  • One with 32 cold plugged devices and 1.8GB of RAM
  • One with 32 cold plugged devices and 2.1GB of RAM

The former will get given 3 additional ports, while the latter won't get given any. This is unhelpful behaviour IMHO.

In Q35 / PCIe world, the there is nothing particularly special about the number 32, because libvirt should be using multi-function when adding pcie-root-ports - it'll fill up slot 1 with 7 functions, then fill up slot 2 with 7 functions, etc, etc. So you get 217 pcie-root-ports before it even thinks about needing to add extra buses.

The practical limit is more about performance of QEMU, EDK2 and Linux when mapping devices into memory.

I would suggest you could write a rule in a different way, such as:

  • Small VMs: 8 PCI root ports by default, with a minimum of 3 free
    • => VM with 4 cold plugged devices would get 8 ports
    • => VM with 11 cold plugged devices would get 13 ports
  • Large VMs: 32 PCI root ports by default, with a minimum of 6 free
    • => VM with 20 cold plugged devices would get 32 ports
    • => VM with 40 cold plugged devices would get 46 ports

Or there are many other variations you could come up with, which don't trigger wierd configurations as you cross the cliff-edge between small & large VM size.

32 was chosen to be compatible with legacy machine types

Is this referring to the old i440fx "PCI" based machine type ? If so, note that in practice the effective device limit was 30, because port 0 is reserved and port 1 is hardcoded to a built-in device.

IMHO you don't particularly need to go as high as matching 32 by default. I would say a rule of "16 by default with at least 6 free" might be a better tradeoff.

@berrange
Copy link
Contributor

| If VM has less than 2G memory, there is no difference in behavior, up to 3 additional ports will be added

Also how was that 2 GB memory threshold decided upon ? Is there some benchmark / formula used to come up with that ? It would be wise to include this info in the commit for the future maintainers 5 years later trying to understand why this threshold exists.

@mhenriks
Copy link
Member Author

Thanks for the feedback @berrange!

IMHO you don't particularly need to go as high as matching 32 by default. I would say a rule of "16 by default with at least 6 free" might be a better tradeoff.

I like this

Also how was that 2 GB memory threshold decided upon ? Is there some benchmark / formula used to come up with that ? It would be wise to include this info in the commit for the future maintainers 5 years later trying to understand why this threshold exists.

Pretty arbitrary. What do you think about 1G as the lower limit?

@berrange
Copy link
Contributor

Also how was that 2 GB memory threshold decided upon ? Is there some benchmark / formula used to come up with that ? It would be wise to include this info in the commit for the future maintainers 5 years later trying to understand why this threshold exists.

Pretty arbitrary. What do you think about 1G as the lower limit?

For the sake of public record, here is an image & info I shared privately via an earlier email. It shows a QEMU guest with 2 GB of RAM, comparing i440fx machine type, against q35 with a varying number of PCIE-root-ports attached

q35-usage

NB, two distinct y-scales there.

The yellow line shows QEMU memory overhead before the guest CPUS start, on right hand y-axis.

You can see a linear increase in QEMU resident memory as the number of pre-added pcie-root-port devices is increased. It works out at about 700 KB memory overhead per pcie-root-port before vCPUs start. This is the fixed overhead inside QEMU just for configuring the device.

IOW adding 32 pcie-root-ports consumes about 20 MB of extra host RAM

NB This overhead is extra QEMU memory overhead, not falling under the guest RAM size allocation.

20 MB is not very significant when you consider even "small" guest sizes to run a useful workload are measured in multiples of GB.

The purple and blue lines show the memory overhead after the guest has been running for 2 minutes, on the left hand y-axis.

IOW, this is shows the extra dynamic runtime overhead in QEMU, the runtime overhead in EDK2 and the runtime overhead in guest Linux kernel.

I observed that the peak overhead is worse than the steady state overhead, presumably some initialization tasks in linux/edk2 trigger some peaks, before releasing memory and getting back to a more steady state. So I wouldn't worry about the peak overhead, just the average (steady state) overhead.

Some of this extra runtime memory overhead will be falling under the guest RAM size allocation. IOW would not be imposing extra overhead on the host RAM utilization, but instead reducing memory that the guest OS can use for userspace apps. This is showing about 2.8 MB of overhead per pcie-root-port. I can't distinguish what amount of falling under the guest RAM, vs what amount is extra host RAM overhead, but I suspect most is attributable to the guest. IOW, of that 2.8 MB, lets assume 2 MB is guest overhead and 800 KB is host overhead.

So overall, in terms of host RAM utilization, you can bank on 20 MB usage per guest, with 32 pcie-root-ports pre-created. The guest will perhaps see 64 MB memory overhead in the guest kernel.

You can call this wasteful, but is it significant waste ? For a 1 GB guest, this 64 MB is a 6% "waste". For a 4 GB guest, this is a 1.5% "waste"

I picked 32 pcie-root-ports as this gives a match in terms of hotplug functionality between q35 and the old i440fx machine type. You don't have to pick 32 for kubevirt.

Bear in mind that every guest will already have some nmumber of pcie-root-ports to support the existing cold plugged PCI devices. You could easily be using 5 (virtio-scsi, virtio-net, virtio-balloon, virtio-vga, virtio-rng) in a generic guest. So rather than consider the overhead of adding 32, consider the delta overhead from the current starting point (32-5 == 27).

Enough of the background info, back to the question at hand for this PR.

If we consider the more conservative suggested rule:

  • Small VMs: 8 PCI root ports by default, with a minimum of 3 free
  • Large VMs: 16 PCI root ports by default, with a minimum of 6 free

Lets consider 95% of guests will probably just have 5 devices present by default so

  • Small VM rule -> 3 extra free slots
  • Large VM rule -> 11 extra free slots

Now consider the threshold >= 2GB (as this patch does)

So we have the following behaviour

  • For a 0.5 GB guest (small) -> ~1% guest RAM overhead for the 3 free slots
  • For a 1.0 GB guest (small) -> ~0.6% guest RAM overhead for the 3 free slots
  • For a 1.9 GB guest (small) -> ~0.3% guest RAM overhead for the 3 free slots
  • For a 2.0 GB guest (large) -> ~1% guest RAM overhead for the 11 free slots
  • For a 4.0 GB guest (large) -> ~0.5% guest RAM overhead for the 11 free slots

Now consider the lower >= 1 GB threshold

  • For a 0.5 GB guest (small) -> ~1% guest RAM overhead for the 3 free slots
  • For a 0.9 GB guest (small) -> ~0.6% guest RAM overhead for the 3 free slots
  • For a 1.0 GB guest (large) -> ~2% guest RAM overhead for the 11 free slots
  • For a 2.0 GB guest (large) -> ~1% guest RAM overhead for the 11 free slots
  • For a 4.0 GB guest (large) -> ~0.5% guest RAM overhead for the 11 free slots

There is obviously a jump in guest RAM usage % at the threshold, and the 2GB vs 1GB decision only affects VM sizes near the threshold.

Given that guests that are small are probably unlikely to have many extra PCI devices, I'd be inclined to stick with the 2 GB threshold and NOT the lower 1 GB.

I would also be inclined to make the threshold "> 2GB", not ">= 2GB",

@mhenriks mhenriks force-pushed the more-hotplug-ports branch from ab3019a to 9066417 Compare June 9, 2025 01:34
@mhenriks mhenriks changed the title Allocate up to 32 PCI ports for VirtIO hotplug (network/disk) Allocate up to more PCI ports for hotplug (network/disk) Jun 9, 2025
@mhenriks mhenriks changed the title Allocate up to more PCI ports for hotplug (network/disk) Allocate more PCI ports for hotplug network/disk Jun 9, 2025
@mhenriks mhenriks changed the title Allocate more PCI ports for hotplug network/disk Allocate more PCI ports for hotplug Jun 9, 2025
@mhenriks mhenriks marked this pull request as ready for review June 9, 2025 01:46
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2025
@kubevirt-bot kubevirt-bot requested a review from 0xFelix June 9, 2025 01:46
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mhenriks - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.


func PlacePCIDevicesOnRootComplex(spec *api.DomainSpec) (err error) {
assigner := newRootSlotAssigner()
func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Document semantics of iteratePCIAddresses

Please add a brief comment explaining how the callback filter works (handling nil, empty, and PCI-only cases) and the iteration order.

Suggested change
func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) {
/*
iteratePCIAddresses iterates over all device interfaces in the given DomainSpec in their defined order.
For each interface, it invokes the provided callback only if the address is nil, has an empty Type, or is of Type PCI.
Other address types are skipped and left unchanged.
The callback can modify and return a new address, or return an error to stop iteration.
*/
func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) {

@mhenriks mhenriks marked this pull request as draft June 9, 2025 21:16
@mhenriks mhenriks force-pushed the more-hotplug-ports branch from 14b4a4f to e09ba52 Compare June 13, 2025 19:43
@kubevirt-bot kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 13, 2025
@Acedus
Copy link
Contributor

Acedus commented Jun 15, 2025

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2025
@mhenriks
Copy link
Member Author

/retest-required

@mhenriks
Copy link
Member Author

@vladikr can you take another look?

defaultTotalPorts := hotplugDefaultTotalPorts
minFreePorts := hotplugMinRequiredFreePorts

if domainSpec.Memory.Value > hotplugLargeMemoryThreshold {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope domainSpec.Memory can't be nil...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a pointer

@vladikr
Copy link
Member

vladikr commented Jun 23, 2025

/approve
Thank you @mhenriks !

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vladikr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2025
@kubevirt-commenter-bot
Copy link

Required labels detected, running phase 2 presubmits:
/test pull-kubevirt-e2e-windows2016
/test pull-kubevirt-e2e-kind-1.30-vgpu
/test pull-kubevirt-e2e-kind-sriov
/test pull-kubevirt-e2e-k8s-1.32-ipv6-sig-network
/test pull-kubevirt-e2e-k8s-1.30-sig-network
/test pull-kubevirt-e2e-k8s-1.30-sig-storage
/test pull-kubevirt-e2e-k8s-1.30-sig-compute
/test pull-kubevirt-e2e-k8s-1.30-sig-operator
/test pull-kubevirt-e2e-k8s-1.31-sig-network
/test pull-kubevirt-e2e-k8s-1.31-sig-storage
/test pull-kubevirt-e2e-k8s-1.31-sig-compute
/test pull-kubevirt-e2e-k8s-1.31-sig-operator

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

✋🧢

/hold

Dear @mhenriks

⚠️ this pull request exceeds the number of retests that are allowed per individual commit.

🔎 Please check that the changes you committed are fine and that there are no infrastructure issues present!

Details Checklist:

💬 How we calculate the number of retests: The number of retest comments are the number of /test or /retest comments after the latest commit only.

👌 After all issues have been resolved, you can remove the hold on this pull request by commenting /unhold on it.

🙇 Thank you, your friendly referee automation, on behalf of the @sig-buildsystem and the KubeVirt community!

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2025
@mhenriks
Copy link
Member Author

/unhold

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2025
@mhenriks
Copy link
Member Author

/retest-required

@kubevirt-commenter-bot
Copy link

✋🧢

/hold

Dear @mhenriks

⚠️ this pull request exceeds the number of retests that are allowed per individual commit.

🔎 Please check that the changes you committed are fine and that there are no infrastructure issues present!

Details Checklist:

💬 How we calculate the number of retests: The number of retest comments are the number of /test or /retest comments after the latest commit only.

👌 After all issues have been resolved, you can remove the hold on this pull request by commenting /unhold on it.

🙇 Thank you, your friendly referee automation, on behalf of the @sig-buildsystem and the KubeVirt community!

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2025
@mhenriks
Copy link
Member Author

/test pull-kubevirt-goveralls

@mhenriks
Copy link
Member Author

/unhold

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2025
@kubevirt-bot kubevirt-bot merged commit 5d6771a into kubevirt:main Jun 24, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/launcher dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/compute size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants