Allocate more PCI ports for hotplug #14754

mhenriks · 2025-05-19T02:40:57Z

A VM with <= 2G memory will get 8 total PCI ports by default and at least 3 guaranteed free hotplug ports

A VM with > 2G memory will get 16 total PCI ports by default and at least 6 guaranteed free hotplug ports

See #14754 (comment) for detailed rationalization for this scheme stated much more clearly than I ever could

What this PR does

Before this PR:

After this PR:

Fixes # https://issues.redhat.com/browse/CNV-57873

Why we need it and why it was done in this way

The following tradeoffs were made:

The following alternatives were considered:

Links to places where the discussion took place:

Special notes for your reviewer

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Design: A design document was considered and is present (link) or not required
PR: The PR description is expressive enough and will help future contributors
Code: Write code that humans can understand and Keep it simple
Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
Upgrade: Impact of this change on upgrade flows was considered and addressed if required
Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
Community: Announcement to kubevirt-dev was considered

Release note

Allocate more PCI ports for hotplug

kubevirt-bot · 2025-05-19T02:41:00Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

mhenriks · 2025-05-19T02:41:42Z

cc @EdDev

Acedus · 2025-05-19T14:35:54Z

/cc

alicefr · 2025-05-20T14:46:45Z

pkg/virt-launcher/virtwrap/nichotplug.go

 }

+func getHotplugNetworkInterfaceCount(vmi *v1.VirtualMachineInstance) int {
+	interfacesDecleared := len(vmi.Spec.Domain.Devices.Interfaces)


stefanha

@berrange ^ you might have thoughts on PCI hotplug port policies.

stefanha · 2025-05-20T14:48:51Z

pkg/virt-launcher/virtwrap/nichotplug.go

 }

+func getHotplugNetworkInterfaceCount(vmi *v1.VirtualMachineInstance) int {
+	interfacesDecleared := len(vmi.Spec.Domain.Devices.Interfaces)


s/Decleared/Declared/

pkg/virt-launcher/virtwrap/converter/pci-placement.go

alicefr · 2025-05-20T15:03:48Z

pkg/virt-launcher/virtwrap/manager.go

+	logger := log.Log.Object(vmi)
+	// old behavior for small VMs
+	portsToAllocate := getHotplugNetworkInterfaceCount(vmi)
+	pciDevsOnRoot := vmi.Annotations[v1.PlacePCIDevicesOnRootComplex] == "true"


Who is suppose to set this annotation? @EdDev do you know?

Pretty sure it's the user: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md

Yeah, it's the user

#3054 is the backstory

Well, then my question is if we should move it to a proper API or deprecate it? Although, out of scope of this PR

I would say that if we could rationalize all of this and craft a proper API, it would be preferred.

alicefr · 2025-05-20T15:04:12Z

pkg/virt-launcher/virtwrap/nichotplug_test.go

 })

-var _ = Describe("domain network interfaces resources", func() {
+var _ = PDescribe("domain network interfaces resources", func() {


Why skipping?

Broken, will fix before getting out of draft mode and implementation is ok'd

alicefr · 2025-05-20T15:16:31Z

@mhenriks unit tests would be a good idea

berrange · 2025-05-20T16:06:11Z

If VM has less than 2G memory, there is no difference in behavior, up to 3 additional ports will be added

For VMs with more than 2G, the number of "cold plug" ports will be computed (usually around 5 or 6 for simple VMs) and that number will be subtracted from 32 to get the number of hotplug ports to be allocated. So if your VM is using 5 ports, 27 ports will be allocated

Mixing an "additional port count" with a "total port count" rule is going to lead to surprising behaviour

Consider two VMs

One with 32 cold plugged devices and 1.8GB of RAM
One with 32 cold plugged devices and 2.1GB of RAM

The former will get given 3 additional ports, while the latter won't get given any. This is unhelpful behaviour IMHO.

In Q35 / PCIe world, the there is nothing particularly special about the number 32, because libvirt should be using multi-function when adding pcie-root-ports - it'll fill up slot 1 with 7 functions, then fill up slot 2 with 7 functions, etc, etc. So you get 217 pcie-root-ports before it even thinks about needing to add extra buses.

The practical limit is more about performance of QEMU, EDK2 and Linux when mapping devices into memory.

I would suggest you could write a rule in a different way, such as:

Small VMs: 8 PCI root ports by default, with a minimum of 3 free
- => VM with 4 cold plugged devices would get 8 ports
- => VM with 11 cold plugged devices would get 13 ports
Large VMs: 32 PCI root ports by default, with a minimum of 6 free
- => VM with 20 cold plugged devices would get 32 ports
- => VM with 40 cold plugged devices would get 46 ports

Or there are many other variations you could come up with, which don't trigger wierd configurations as you cross the cliff-edge between small & large VM size.

32 was chosen to be compatible with legacy machine types

Is this referring to the old i440fx "PCI" based machine type ? If so, note that in practice the effective device limit was 30, because port 0 is reserved and port 1 is hardcoded to a built-in device.

IMHO you don't particularly need to go as high as matching 32 by default. I would say a rule of "16 by default with at least 6 free" might be a better tradeoff.

berrange · 2025-05-20T16:13:16Z

| If VM has less than 2G memory, there is no difference in behavior, up to 3 additional ports will be added

Also how was that 2 GB memory threshold decided upon ? Is there some benchmark / formula used to come up with that ? It would be wise to include this info in the commit for the future maintainers 5 years later trying to understand why this threshold exists.

mhenriks · 2025-05-20T16:37:56Z

Thanks for the feedback @berrange!

IMHO you don't particularly need to go as high as matching 32 by default. I would say a rule of "16 by default with at least 6 free" might be a better tradeoff.

I like this

Also how was that 2 GB memory threshold decided upon ? Is there some benchmark / formula used to come up with that ? It would be wise to include this info in the commit for the future maintainers 5 years later trying to understand why this threshold exists.

Pretty arbitrary. What do you think about 1G as the lower limit?

berrange · 2025-05-21T09:35:26Z

Also how was that 2 GB memory threshold decided upon ? Is there some benchmark / formula used to come up with that ? It would be wise to include this info in the commit for the future maintainers 5 years later trying to understand why this threshold exists.

Pretty arbitrary. What do you think about 1G as the lower limit?

For the sake of public record, here is an image & info I shared privately via an earlier email. It shows a QEMU guest with 2 GB of RAM, comparing i440fx machine type, against q35 with a varying number of PCIE-root-ports attached

NB, two distinct y-scales there.

The yellow line shows QEMU memory overhead before the guest CPUS start, on right hand y-axis.

You can see a linear increase in QEMU resident memory as the number of pre-added pcie-root-port devices is increased. It works out at about 700 KB memory overhead per pcie-root-port before vCPUs start. This is the fixed overhead inside QEMU just for configuring the device.

IOW adding 32 pcie-root-ports consumes about 20 MB of extra host RAM

NB This overhead is extra QEMU memory overhead, not falling under the guest RAM size allocation.

20 MB is not very significant when you consider even "small" guest sizes to run a useful workload are measured in multiples of GB.

The purple and blue lines show the memory overhead after the guest has been running for 2 minutes, on the left hand y-axis.

IOW, this is shows the extra dynamic runtime overhead in QEMU, the runtime overhead in EDK2 and the runtime overhead in guest Linux kernel.

I observed that the peak overhead is worse than the steady state overhead, presumably some initialization tasks in linux/edk2 trigger some peaks, before releasing memory and getting back to a more steady state. So I wouldn't worry about the peak overhead, just the average (steady state) overhead.

Some of this extra runtime memory overhead will be falling under the guest RAM size allocation. IOW would not be imposing extra overhead on the host RAM utilization, but instead reducing memory that the guest OS can use for userspace apps. This is showing about 2.8 MB of overhead per pcie-root-port. I can't distinguish what amount of falling under the guest RAM, vs what amount is extra host RAM overhead, but I suspect most is attributable to the guest. IOW, of that 2.8 MB, lets assume 2 MB is guest overhead and 800 KB is host overhead.

So overall, in terms of host RAM utilization, you can bank on 20 MB usage per guest, with 32 pcie-root-ports pre-created. The guest will perhaps see 64 MB memory overhead in the guest kernel.

You can call this wasteful, but is it significant waste ? For a 1 GB guest, this 64 MB is a 6% "waste". For a 4 GB guest, this is a 1.5% "waste"

I picked 32 pcie-root-ports as this gives a match in terms of hotplug functionality between q35 and the old i440fx machine type. You don't have to pick 32 for kubevirt.

Bear in mind that every guest will already have some nmumber of pcie-root-ports to support the existing cold plugged PCI devices. You could easily be using 5 (virtio-scsi, virtio-net, virtio-balloon, virtio-vga, virtio-rng) in a generic guest. So rather than consider the overhead of adding 32, consider the delta overhead from the current starting point (32-5 == 27).

Enough of the background info, back to the question at hand for this PR.

If we consider the more conservative suggested rule:

Small VMs: 8 PCI root ports by default, with a minimum of 3 free
Large VMs: 16 PCI root ports by default, with a minimum of 6 free

Lets consider 95% of guests will probably just have 5 devices present by default so

Small VM rule -> 3 extra free slots
Large VM rule -> 11 extra free slots

Now consider the threshold >= 2GB (as this patch does)

So we have the following behaviour

For a 0.5 GB guest (small) -> ~1% guest RAM overhead for the 3 free slots
For a 1.0 GB guest (small) -> ~0.6% guest RAM overhead for the 3 free slots
For a 1.9 GB guest (small) -> ~0.3% guest RAM overhead for the 3 free slots
For a 2.0 GB guest (large) -> ~1% guest RAM overhead for the 11 free slots
For a 4.0 GB guest (large) -> ~0.5% guest RAM overhead for the 11 free slots

Now consider the lower >= 1 GB threshold

For a 0.5 GB guest (small) -> ~1% guest RAM overhead for the 3 free slots
For a 0.9 GB guest (small) -> ~0.6% guest RAM overhead for the 3 free slots
For a 1.0 GB guest (large) -> ~2% guest RAM overhead for the 11 free slots
For a 2.0 GB guest (large) -> ~1% guest RAM overhead for the 11 free slots
For a 4.0 GB guest (large) -> ~0.5% guest RAM overhead for the 11 free slots

There is obviously a jump in guest RAM usage % at the threshold, and the 2GB vs 1GB decision only affects VM sizes near the threshold.

Given that guests that are small are probably unlikely to have many extra PCI devices, I'd be inclined to stick with the 2 GB threshold and NOT the lower 1 GB.

I would also be inclined to make the threshold "> 2GB", not ">= 2GB",

sourcery-ai

Hey @mhenriks - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 2 issues found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

pkg/virt-launcher/virtwrap/manager.go

sourcery-ai · 2025-06-09T01:49:29Z

pkg/virt-launcher/virtwrap/converter/pci-placement.go


-func PlacePCIDevicesOnRootComplex(spec *api.DomainSpec) (err error) {
-	assigner := newRootSlotAssigner()
+func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) {


suggestion: Document semantics of iteratePCIAddresses

Please add a brief comment explaining how the callback filter works (handling nil, empty, and PCI-only cases) and the iteration order.

Suggested change

func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) {

/*

iteratePCIAddresses iterates over all device interfaces in the given DomainSpec in their defined order.

For each interface, it invokes the provided callback only if the address is nil, has an empty Type, or is of Type PCI.

Other address types are skipped and left unchanged.

The callback can modify and return a new address, or return an error to stop iteration.

*/

func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) {

Signed-off-by: Michael Henriksen <[email protected]>

Acedus · 2025-06-15T06:39:14Z

/lgtm

mhenriks · 2025-06-17T12:19:09Z

/retest-required

mhenriks · 2025-06-17T12:19:52Z

@vladikr can you take another look?

vladikr · 2025-06-20T20:54:31Z

pkg/virt-launcher/virtwrap/manager.go

+	defaultTotalPorts := hotplugDefaultTotalPorts
+	minFreePorts := hotplugMinRequiredFreePorts
+
+	if domainSpec.Memory.Value > hotplugLargeMemoryThreshold {


I hope domainSpec.Memory can't be nil...

It's not a pointer

vladikr · 2025-06-23T12:13:32Z

/approve
Thank you @mhenriks !

kubevirt-bot · 2025-06-23T12:13:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vladikr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/virt-launcher/OWNERS~~ [vladikr]
~~tests/OWNERS~~ [vladikr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kubevirt-commenter-bot · 2025-06-23T12:13:43Z

Required labels detected, running phase 2 presubmits:
/test pull-kubevirt-e2e-windows2016
/test pull-kubevirt-e2e-kind-1.30-vgpu
/test pull-kubevirt-e2e-kind-sriov
/test pull-kubevirt-e2e-k8s-1.32-ipv6-sig-network
/test pull-kubevirt-e2e-k8s-1.30-sig-network
/test pull-kubevirt-e2e-k8s-1.30-sig-storage
/test pull-kubevirt-e2e-k8s-1.30-sig-compute
/test pull-kubevirt-e2e-k8s-1.30-sig-operator
/test pull-kubevirt-e2e-k8s-1.31-sig-network
/test pull-kubevirt-e2e-k8s-1.31-sig-storage
/test pull-kubevirt-e2e-k8s-1.31-sig-compute
/test pull-kubevirt-e2e-k8s-1.31-sig-operator

kubevirt-commenter-bot · 2025-06-23T15:22:54Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2025-06-23T23:22:43Z

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

kubevirt-commenter-bot · 2025-06-23T23:22:47Z

✋🧢

/hold

Dear @mhenriks

⚠️ this pull request exceeds the number of retests that are allowed per individual commit.

🔎 Please check that the changes you committed are fine and that there are no infrastructure issues present!

Details

Checklist:

your changes compile
your tests succeed locally, i.e. for k/kubevirt see
- getting-started - testing and
- CONTRIBUTING.md - testing
no linting errors
no recurring e2e test errors, i.e. in the pull-kubevirt-check-tests-for-flakes lane
no infrastructure issues, i.e.
- GitHub Status,
- quay.io status at status.redhat.com or
- KubeVirt prow status

💬 How we calculate the number of retests: The number of retest comments are the number of /test or /retest comments after the latest commit only.

👌 After all issues have been resolved, you can remove the hold on this pull request by commenting /unhold on it.

🙇 Thank you, your friendly referee automation, on behalf of the @sig-buildsystem and the KubeVirt community!

mhenriks · 2025-06-24T02:03:06Z

/unhold

mhenriks · 2025-06-24T02:03:40Z

/retest-required

kubevirt-commenter-bot · 2025-06-24T02:03:43Z

✋🧢

/hold

Dear @mhenriks

⚠️ this pull request exceeds the number of retests that are allowed per individual commit.

🔎 Please check that the changes you committed are fine and that there are no infrastructure issues present!

Details

Checklist:

your changes compile
your tests succeed locally, i.e. for k/kubevirt see
- getting-started - testing and
- CONTRIBUTING.md - testing
no linting errors
no recurring e2e test errors, i.e. in the pull-kubevirt-check-tests-for-flakes lane
no infrastructure issues, i.e.
- GitHub Status,
- quay.io status at status.redhat.com or
- KubeVirt prow status

💬 How we calculate the number of retests: The number of retest comments are the number of /test or /retest comments after the latest commit only.

👌 After all issues have been resolved, you can remove the hold on this pull request by commenting /unhold on it.

🙇 Thank you, your friendly referee automation, on behalf of the @sig-buildsystem and the KubeVirt community!

mhenriks · 2025-06-24T02:04:50Z

/test pull-kubevirt-goveralls

mhenriks · 2025-06-24T02:14:12Z

/unhold

kubevirt-bot requested a review from Barakmor1 May 19, 2025 02:41

kubevirt-bot added the size/L label May 19, 2025

kubevirt-bot requested a review from fossedihelm May 19, 2025 02:41

kubevirt-bot added area/launcher sig/compute labels May 19, 2025

kubevirt-bot requested a review from Acedus May 19, 2025 14:35

alicefr reviewed May 20, 2025

View reviewed changes

stefanha reviewed May 20, 2025

View reviewed changes

alicefr reviewed May 20, 2025

View reviewed changes

mhenriks force-pushed the more-hotplug-ports branch from ab3019a to 9066417 Compare June 9, 2025 01:34

mhenriks changed the title ~~Allocate up to 32 PCI ports for VirtIO hotplug (network/disk)~~ Allocate up to more PCI ports for hotplug (network/disk) Jun 9, 2025

mhenriks changed the title ~~Allocate up to more PCI ports for hotplug (network/disk)~~ Allocate more PCI ports for hotplug network/disk Jun 9, 2025

mhenriks changed the title ~~Allocate more PCI ports for hotplug network/disk~~ Allocate more PCI ports for hotplug Jun 9, 2025

mhenriks marked this pull request as ready for review June 9, 2025 01:46

kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2025

kubevirt-bot requested a review from 0xFelix June 9, 2025 01:46

sourcery-ai bot reviewed Jun 9, 2025

View reviewed changes

mhenriks marked this pull request as draft June 9, 2025 21:16

mhenriks added 2 commits June 13, 2025 15:43

pci hotplug: unit test for calculateHotplugPortCount

4d93eca

Signed-off-by: Michael Henriksen <[email protected]>

pci ports: functests for hotplug PCI port allocation

e09ba52

Signed-off-by: Michael Henriksen <[email protected]>

mhenriks force-pushed the more-hotplug-ports branch from 14b4a4f to e09ba52 Compare June 13, 2025 19:43

kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 13, 2025

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2025

vladikr reviewed Jun 20, 2025

View reviewed changes

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2025

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2025

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2025

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2025

kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2025

kubevirt-bot merged commit 5d6771a into kubevirt:main Jun 24, 2025
40 checks passed

mhenriks mentioned this pull request Jul 4, 2025

Update docs for declarative hotplug with virtio bus support kubevirt/user-guide#906

Merged

8 tasks

lyarwood mentioned this pull request Sep 2, 2025

WIP/DNM virt-launcher: Fix hot plug PCI port generation #15581

Closed

8 tasks

wheatdog mentioned this pull request Nov 17, 2025

docs(hep): NIC Hotplug [skip ci] harvester/harvester#9295

Merged

-func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) {
+/*
+iteratePCIAddresses iterates over all device interfaces in the given DomainSpec in their defined order.
+For each interface, it invokes the provided callback only if the address is nil, has an empty Type, or is of Type PCI.
+Other address types are skipped and left unchanged.
+The callback can modify and return a new address, or return an error to stop iteration.
+*/
+func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) {

Allocate more PCI ports for hotplug #14754

Allocate more PCI ports for hotplug #14754

Uh oh!

Conversation

mhenriks commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Why we need it and why it was done in this way

Special notes for your reviewer

Checklist

Release note

Uh oh!

kubevirt-bot commented May 19, 2025

Uh oh!

mhenriks commented May 19, 2025

Uh oh!

Acedus commented May 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefanha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alicefr commented May 20, 2025

Uh oh!

berrange commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

berrange commented May 20, 2025

Uh oh!

mhenriks commented May 20, 2025

Uh oh!

berrange commented May 21, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Acedus commented Jun 15, 2025

Uh oh!

mhenriks commented Jun 17, 2025

Uh oh!

mhenriks commented Jun 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vladikr commented Jun 23, 2025

Uh oh!

kubevirt-bot commented Jun 23, 2025

Uh oh!

kubevirt-commenter-bot commented Jun 23, 2025

Uh oh!

kubevirt-commenter-bot commented Jun 23, 2025

Uh oh!

kubevirt-commenter-bot commented Jun 23, 2025

Uh oh!

mhenriks commented May 19, 2025 •

edited

Loading

berrange commented May 20, 2025 •

edited

Loading