-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Allocate more PCI ports for hotplug #14754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Skipping CI for Draft Pull Request. |
|
cc @EdDev |
|
/cc |
| } | ||
|
|
||
| func getHotplugNetworkInterfaceCount(vmi *v1.VirtualMachineInstance) int { | ||
| interfacesDecleared := len(vmi.Spec.Domain.Devices.Interfaces) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
stefanha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@berrange ^ you might have thoughts on PCI hotplug port policies.
| } | ||
|
|
||
| func getHotplugNetworkInterfaceCount(vmi *v1.VirtualMachineInstance) int { | ||
| interfacesDecleared := len(vmi.Spec.Domain.Devices.Interfaces) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Decleared/Declared/
| logger := log.Log.Object(vmi) | ||
| // old behavior for small VMs | ||
| portsToAllocate := getHotplugNetworkInterfaceCount(vmi) | ||
| pciDevsOnRoot := vmi.Annotations[v1.PlacePCIDevicesOnRootComplex] == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is suppose to set this annotation? @EdDev do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure it's the user: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#3054 is the backstory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, then my question is if we should move it to a proper API or deprecate it? Although, out of scope of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that if we could rationalize all of this and craft a proper API, it would be preferred.
| }) | ||
|
|
||
| var _ = Describe("domain network interfaces resources", func() { | ||
| var _ = PDescribe("domain network interfaces resources", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why skipping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken, will fix before getting out of draft mode and implementation is ok'd
|
@mhenriks unit tests would be a good idea |
Mixing an "additional port count" with a "total port count" rule is going to lead to surprising behaviour Consider two VMs
The former will get given 3 additional ports, while the latter won't get given any. This is unhelpful behaviour IMHO. In Q35 / PCIe world, the there is nothing particularly special about the number 32, because libvirt should be using multi-function when adding pcie-root-ports - it'll fill up slot 1 with 7 functions, then fill up slot 2 with 7 functions, etc, etc. So you get 217 pcie-root-ports before it even thinks about needing to add extra buses. The practical limit is more about performance of QEMU, EDK2 and Linux when mapping devices into memory. I would suggest you could write a rule in a different way, such as:
Or there are many other variations you could come up with, which don't trigger wierd configurations as you cross the cliff-edge between small & large VM size.
Is this referring to the old i440fx "PCI" based machine type ? If so, note that in practice the effective device limit was 30, because port 0 is reserved and port 1 is hardcoded to a built-in device. IMHO you don't particularly need to go as high as matching 32 by default. I would say a rule of "16 by default with at least 6 free" might be a better tradeoff. |
|
| If VM has less than 2G memory, there is no difference in behavior, up to 3 additional ports will be added Also how was that 2 GB memory threshold decided upon ? Is there some benchmark / formula used to come up with that ? It would be wise to include this info in the commit for the future maintainers 5 years later trying to understand why this threshold exists. |
|
Thanks for the feedback @berrange!
I like this
Pretty arbitrary. What do you think about 1G as the lower limit? |
For the sake of public record, here is an image & info I shared privately via an earlier email. It shows a QEMU guest with 2 GB of RAM, comparing i440fx machine type, against q35 with a varying number of PCIE-root-ports attached NB, two distinct y-scales there. The yellow line shows QEMU memory overhead before the guest CPUS start, on right hand y-axis. You can see a linear increase in QEMU resident memory as the number of pre-added pcie-root-port devices is increased. It works out at about 700 KB memory overhead per IOW adding 32 NB This overhead is extra QEMU memory overhead, not falling under the guest RAM size allocation. 20 MB is not very significant when you consider even "small" guest sizes to run a useful workload are measured in multiples of GB. The purple and blue lines show the memory overhead after the guest has been running for 2 minutes, on the left hand y-axis. IOW, this is shows the extra dynamic runtime overhead in I observed that the peak overhead is worse than the steady state overhead, presumably some initialization tasks in Some of this extra runtime memory overhead will be falling under the guest RAM size allocation. IOW would not be imposing extra overhead on the host RAM utilization, but instead reducing memory that the guest OS can use for userspace apps. This is showing about 2.8 MB of overhead per So overall, in terms of host RAM utilization, you can bank on 20 MB usage per guest, with 32 pcie-root-ports pre-created. The guest will perhaps see 64 MB memory overhead in the guest kernel. You can call this wasteful, but is it significant waste ? For a 1 GB guest, this 64 MB is a 6% "waste". For a 4 GB guest, this is a 1.5% "waste" I picked 32 Bear in mind that every guest will already have some nmumber of pcie-root-ports to support the existing cold plugged PCI devices. You could easily be using 5 ( Enough of the background info, back to the question at hand for this PR. If we consider the more conservative suggested rule:
Lets consider 95% of guests will probably just have 5 devices present by default so
Now consider the threshold >= 2GB (as this patch does) So we have the following behaviour
Now consider the lower >= 1 GB threshold
There is obviously a jump in guest RAM usage % at the threshold, and the 2GB vs 1GB decision only affects VM sizes near the threshold. Given that guests that are small are probably unlikely to have many extra PCI devices, I'd be inclined to stick with the 2 GB threshold and NOT the lower 1 GB. I would also be inclined to make the threshold "> 2GB", not ">= 2GB", |
ab3019a to
9066417
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mhenriks - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
|
||
| func PlacePCIDevicesOnRootComplex(spec *api.DomainSpec) (err error) { | ||
| assigner := newRootSlotAssigner() | ||
| func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Document semantics of iteratePCIAddresses
Please add a brief comment explaining how the callback filter works (handling nil, empty, and PCI-only cases) and the iteration order.
| func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) { | |
| /* | |
| iteratePCIAddresses iterates over all device interfaces in the given DomainSpec in their defined order. | |
| For each interface, it invokes the provided callback only if the address is nil, has an empty Type, or is of Type PCI. | |
| Other address types are skipped and left unchanged. | |
| The callback can modify and return a new address, or return an error to stop iteration. | |
| */ | |
| func iteratePCIAddresses(spec *api.DomainSpec, callback func(address *api.Address) (*api.Address, error)) (err error) { |
Signed-off-by: Michael Henriksen <[email protected]>
Signed-off-by: Michael Henriksen <[email protected]>
14b4a4f to
e09ba52
Compare
|
/lgtm |
|
/retest-required |
|
@vladikr can you take another look? |
| defaultTotalPorts := hotplugDefaultTotalPorts | ||
| minFreePorts := hotplugMinRequiredFreePorts | ||
|
|
||
| if domainSpec.Memory.Value > hotplugLargeMemoryThreshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope domainSpec.Memory can't be nil...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a pointer
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vladikr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Required labels detected, running phase 2 presubmits: |
|
/retest-required |
1 similar comment
|
/retest-required |
|
✋🧢 /hold Dear @mhenriks 🔎 Please check that the changes you committed are fine and that there are no infrastructure issues present! DetailsChecklist:
💬 How we calculate the number of retests: The number of retest comments are the number of 👌 After all issues have been resolved, you can remove the hold on this pull request by commenting 🙇 Thank you, your friendly referee automation, on behalf of the @sig-buildsystem and the KubeVirt community! |
|
/unhold |
|
/retest-required |
|
✋🧢 /hold Dear @mhenriks 🔎 Please check that the changes you committed are fine and that there are no infrastructure issues present! DetailsChecklist:
💬 How we calculate the number of retests: The number of retest comments are the number of 👌 After all issues have been resolved, you can remove the hold on this pull request by commenting 🙇 Thank you, your friendly referee automation, on behalf of the @sig-buildsystem and the KubeVirt community! |
|
/test pull-kubevirt-goveralls |
|
/unhold |
A VM with <= 2G memory will get 8 total PCI ports by default and at least 3 guaranteed free hotplug ports
A VM with > 2G memory will get 16 total PCI ports by default and at least 6 guaranteed free hotplug ports
See #14754 (comment) for detailed rationalization for this scheme stated much more clearly than I ever could
What this PR does
Before this PR:
After this PR:
Fixes # https://issues.redhat.com/browse/CNV-57873
Why we need it and why it was done in this way
The following tradeoffs were made:
The following alternatives were considered:
Links to places where the discussion took place:
Special notes for your reviewer
Checklist
This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.
Release note