-
Notifications
You must be signed in to change notification settings - Fork 24
feat: Use PCI devices as the base for discovery #194
feat: Use PCI devices as the base for discovery #194
Conversation
204fb84 to
d3422b4
Compare
c7dfe1b to
6ce5c98
Compare
aojea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand well this proposal and what problem solves, virtual interfaces are in scope and should keep working as today
|
I prefer to use the device filesystem to detect the physical interfaces, since the path is stable across namespaces and allow dranet to not have to go into the network namespaces, think that with the OCI spec changes, runc can move network interfaces and we may drop all root capabilities for dranet. I did a test in a local machine and it works fine |
|
my apologies I didn't realize pci was tried in #183 (comment) , I commented there but bringing the comment to this PR for visibility I think that we do not need to overcomplicate this:
We want physical devices to be stable and do not disappear from the resourceslice, so we need to discover them from the pcibus, a gemini query tells me we can use the class for that, I double checked and indeed 0x02 is for network controllers https://sigops.acm.illinois.edu/old/roll_your_own/7.c.1.html In my example it works, it finds the 3 network devices, despite one of them is already in a namespace The golang code should. be something like This can be also used to detect the infiniband devices @vsoch is suggesting in other PR, the key is to deduplicate these devices from the ones obtained via the netlink list, but that sounds simple, as we already know from an interface name if is physical, so we just don't process it in that loop |
|
Thanks for the review. Allow me to clarify the two main ideas, the first of which I agree can be changed, but I'll require some additional thoughts about the second one before I make additional modifications. 1. PCI Address vs Permanent MAC as the device name? (VERDICT: Agree to change to PCI Address)As you have already pointed out previously, network interface names (like eth0, eth1, etc) are not permament. When publishing a device in ResourceSlice, we need some identifier which remains somewhat permanent so that we can consistently map it to the correct network interface. "Permanent MAC address" (NOT EQUAL to the temporary mac address) and the "PCI Device" seem like two reasonable candidates. Right now I had choosen "Permanent MAC address" as the option because I assumed it's possible for multiple network interfaces to share the same PCI address, but the permanent MAC address will always be unique. But I don't have a real life example for this and I'm happy to use the PCI Address as the "stable device name in ResourceSlice". (Yes PCI Address as the primary identifier is what this PR was about: #183) 2. Exec into other network namespaces to gather information (NEEDS DISCUSSION)If DraNet running on a Node restarts for any reason, it should be able to rebuild the ResourceSlice for that node. For all the network interfaces which are not allocated to some Pod, this works out naturally since we can re-list the network interfaces and PCI devices and get all the information. But this does not work for network interfaces which are already allocated to some pod. The part that's missing is that while you do have the PCI device information (and it's attribute/characteristic) still available in the host's network namespace, you do NOT have the network related information (like permanent MAC address) for network interfaces which have been moved to a network namespace. Atleast one immediate reason why this is important is that if you can't figure out the permanent MAC address of the interface, you cannot map them to the device attributes for the GCE Metadata Server and you end up with a ResourceSlice which is missing attributes for such devices. As an example, consider a PCI network device "0000:c0:14.0": $ ls /sys/devices/pci0000:c0/0000:c0:14.0
ari_enabled class consistent_dma_mask_bits device driver enable link local_cpus msi_bus net power remove resource resource1 revision subsystem_device uevent
broken_parity_status config d3cold_allowed dma_mask_bits driver_override irq local_cpulist modalias msi_irqs numa_node power_state rescan resource0 resource2 subsystem subsystem_vendor vendorI can find the network interfaces associated with this and details about the network interface: $ ls /sys/devices/pci0000:c0/0000:c0:14.0/net
eth1
$ ls /sys/devices/pci0000:c0/0000:c0:14.0/net/eth1
addr_assign_type broadcast carrier_down_count dev_port duplex ifalias link_mode napi_defer_hard_irqs phys_port_id power speed testing type
addr_len carrier carrier_up_count device flags ifindex mtu netdev_group phys_port_name proto_down statistics threaded uevent
address carrier_changes dev_id dormant gro_flush_timeout iflink name_assign_type operstate phys_switch_id queues subsystem tx_queue_len
$ cat /sys/devices/pci0000:c0/0000:c0:14.0/net/eth1/address
06:4d:54:dc:1d:e8But if I move the network interface to a network namespace, this information is no longer accessible from the host's network namespace. (It is not visible in sysfs AND also not available through netlink) $ sudo ip netns add temp-ns
$ ip netns
temp-ns
$ sudo ip link set eth1 netns temp-ns
# No interface info associated with the PCI device
$ ls /sys/devices/pci0000:c0/0000:c0:14.0/net/ -l
total 0
$ cat /sys/devices/pci0000:c0/0000:c0:14.0/net/eth1/address
cat: '/sys/devices/pci0000:c0/0000:c0:14.0/net/eth1/address': No such file or directoryThis means that we cannot map the PCI device to it's GCE Metadata information without some additional network information, like the "permanent MAC address". For instance, these are the fields from GCE Metadata, one of which we have to use to map our PCI device to it's metadata. (The values below have been randomized from actual source) {
"accessConfigs": [
{
"externalIp": "137.186.202.64",
"type": "ONE_TO_ONE_NAT"
}
],
"dnsServers": [
"226.128.54.133"
],
"forwardedIps": [],
"gateway": "234.165.66.132",
"ip": "2.63.165.114",
"ipAliases": [
"172.78.227.94/14"
],
"mac": "06:4d:54:dc:1d:e8",
"mtu": 1457,
"network": "projects/826694234596/networks/wxisbtewlo-vpc",
"nicType": "GVNIC",
"physicalNicId": "1",
"subnetmask": "254.224.192.252",
"targetInstanceIps": []
}This means that if we really have to rebuild the ResourceSlice properly, we will have to fetch the network interface information from the individual network namespaces. E.g. # Compare this to the previous output where this file was no longer present in the host's network namespace
$ sudo ip netns exec temp-ns cat /sys/devices/pci0000:c0/0000:c0:14.0/net/eth1/address
06:4d:54:dc:1d:e8 |
|
|
this is what gemini suggested package inventory
import (
"path/filepath"
"os"
"github.com/vishvananda/netlink"
"k8s.io/klog/v2"
)
// Represents a fully discovered network device, merging PCI and Netlink info.
type DiscoveredInterface struct {
PCIDevice // From your sysfs.go
InterfaceName string
MACAddress string
InHostNamespace bool // Flag to indicate if it was found via netlink
// ... other netlink attributes
}
func DiscoverAndMerge() (map[string]*DiscoveredInterface, error) {
// --- Step 1: Get all physical PCI network devices (from your existing code) ---
pciDevices, err := DiscoverPCIDevices()
if err != nil {
return nil, err
}
// Create a map for easy lookup, keyed by PCI address.
mergedInterfaces := make(map[string]*DiscoveredInterface)
for _, dev := range pciDevices {
// Only consider network controllers
if dev.Class == 0x020000 {
mergedInterfaces[dev.Address] = &DiscoveredInterface{
PCIDevice: *dev,
InHostNamespace: false, // Assume it's not in the host NS until proven otherwise
}
}
}
// --- Step 2: Get all interfaces from the host's netlink view ---
links, err := netlink.LinkList()
if err != nil {
return nil, err
}
for _, link := range links {
// For each link, find its PCI address by resolving the sysfs symlink.
// /sys/class/net/eth0/device -> ../../../0000:00:04.0
devicePath := filepath.Join("/sys/class/net", link.Attrs().Name, "device")
realPath, err := os.Readlink(devicePath)
if err != nil {
// This is expected for virtual devices (docker0, etc.), so we just skip them.
continue
}
pciAddress := filepath.Base(realPath) // Extracts "0000:00:04.0"
// --- Step 3: Merge the data ---
if discoveredDev, ok := mergedInterfaces[pciAddress]; ok {
// We found a matching PCI device! Update it with netlink info.
klog.Infof("Matching host interface %s with PCI device %s", link.Attrs().Name, pciAddress)
discoveredDev.InterfaceName = link.Attrs().Name
discoveredDev.MACAddress = link.Attrs().HardwareAddr.String()
discoveredDev.InHostNamespace = true
}
}
return mergedInterfaces, nil
} |
(In the hypothetical example above, both ens4 and ens5 are exposed by same PCI device...perhaps some form of virtualization?) Again, I haven't found a real life example for this so we can cross that bridge when we come to it.
|
|
if the interface is attached to a pod what is the value of adding the attributes? the system just need to know the device is attached and can not be allocated ... once is released it will be "discoverable" again |
Good question. This brings us to the Cluster Autoscaler integration we were discussing in #178. Essentially, if you initially published a ResourceSlice which had, let's say the At this point, you have one pod scheduled on a Node using this ResourceClaimTemplate. But now let's say some new pods are created which use the same ResourceClaimTemplate and there are no available devices, the Cluster Autoscaler will try to recreate Nodes and their ResourceSlices to see which one of the Nodes+ResourceSlice will satisfy this claim. It does so based on some existing Node and it's ResourceSlice. At this point, if this ResourceSlice is missing this attribute (because we removed it since we failed to discover it due some pod claiming the network interface), the autoscaler will fail to scale up correctly as it will see no Node can satisfy the claim if it scales up. The main expectation of DRA being that "ResourceSlices are expected to be somewhat constant, like CPU and Memory" |
|
Does this also bring us back to the issue that we can assign the same device to multiple pods if we do not remove them from the resource slice? There is no semantic for preventing allocation of a device multiple time right? |
Ah so for confirmation, if we are talking about something similar to #139, then actually even if we remove the device from the ResourceSlice, we can still end up allocating multiple pods to the same device if all those pods use the same exact ResourceClaim. Yes that does not have a solution yet, sadly! Some people could argue that this can be treated as a user misconfiguration wherein they should not use multiple ResourceClaims with multiple pods and if they do so, that will only work when the device under consideration is shareable. As long as one uses ResourceClaimTemplate, this should not be a problem since it will create unique ResourceClaims for each pod. |
|
Ah gotcha, makes sense! |
|
there are some attributes that are dynamic like ips, mac, ... and other that are static, rdma is static and is the same no matter namespace or no ... autoscaler must only do calls on static and well defined attributes, pci cross those checks. The MRDMA type is something I do not understand why we want to publish, it's gce custom thing that does not make sense to expose, users just want an interface with RDMA |
This is correct. In fact, the current autoscaler doesn't distinguish between the two kinds of attribute and results in an infinite scale loop in scenarios that I raised here: https://kubernetes.slack.com/archives/C09R1LV8S/p1754006888690929
The MRDMA is just one example, but the same thing would apply to other attributes like VPC name as well. In total, the following attributes won't be published if we don't exec into network's namespace: Open questions
|
Right now we remove entirely the device, we are going to change that to just remove the non-discoverable Attributes from host but leave the device name for two use case:
I think is a good next step to try this and it give us the opportunity to iterate without breaking changes
My thinking is that for physical network interfaces the device address is unambiguous for the user and the system, and if we define the name in an attribute that solves the UX, I think is worth pursuing that path |
|
Thanks for the discussion. I find some comfort knowing that we both understand the choices we are making and it's implications, and our idea is to incorporate things once we get user feedback -- this is perfectly reasonable to me! What you have in mind is exactly what PR #183 does. Since we've had lot's of discussion here, I'll try to bring those changes here. Also just want to give a heads up that with this new logic, our KIND tests involving "dummy" interface may not work directly since they are not backed by a PCI device. We will figure out some alternative. |
why will not work? we are not going to break any behavior #194 (comment) I'm just saying we only do an additional list walking the pci path and merging the physical with its link attributes when they are in the host namespace, nothing else changes |
|
So you are saying we still want the devices in I'll keep them for now. Let's take this one step at a time. |
eec16ee to
801491d
Compare
|
you have to rebase |
This change refactors the device discovery mechanism to use PCI devices as the fundamental unit of inventory. Network interface attributes are then discovered and associated with these PCI devices. For Network interfaces which are not associated with a PCI Device (like virtual interfaces), they are added as their own device. Previously, network interfaces were the primary resource. However, when a network interface is moved into a pod's network namespace, it disappears from the host's view, causing the resource to be removed from the node's ResourceSlice. This could lead to race conditions and incorrect device availability information. By treating the PCI device as the stable, base resource, we ensure that the device remains visible in the ResourceSlice even when its associated network interface is moved. This commit also introduces the `ghw` library for PCI device discovery and removes the manual sysfs parsing.
…Host and NetworkInterfaceConfigInPod
…them in tests
This is now required because not all devices have the exact same attributes.
E.g. PCI Devices whose network interface is moved to the pod's network namespace
don't have the network interface related attributes like:
- dra.net/alias:
- dra.net/ebpf:
- dra.net/encapsulation:
- dra.net/ifName:
- dra.net/ipv4:
- dra.net/mac:
- dra.net/mtu:
- gce.dra.net/networkName:
- gce.dra.net/networkProjectNumber:
- gce.dra.net/rdmaType
This means that the user needs to check for their existence if referring to them.
801491d to
cceb50a
Compare
gauravkghildiyal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, updated and rebased.
michaelasp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a major change with lots of back and forth, glad to see we've come to a good conclusion. Approving :)
Updates the third-party dependencies in order for importing projects that have stricter dependency compliance restrictions. For `github.com/jaypipes/pcidb`, uplifts it to the latest 1.1.0 release, which removed the archived `github.com/mitchellh/go-homedir` dependency. For `howett.net/plist`, there hasn't been a tagged release of that repo since December, 2023, so I manually `go get` the latest commit from January 2025 for that repository. `howett.net/plist` is used for MacOS support. For `github.com/StackExchange/wmi`, I changed this to `github.com/yusufpapurcu/wmi` which is the officially-supported fork of the `github.com/StackExchange/wmi` repo. Closes Issue #323 Related Issue jaypipes/pcidb#36 Related google/dranet#194 Signed-off-by: Jay Pipes <[email protected]>
Updates the third-party dependencies in order for importing projects that have stricter dependency compliance restrictions. For `github.com/jaypipes/pcidb`, uplifts it to the latest 1.1.0 release, which removed the archived `github.com/mitchellh/go-homedir` dependency. For `howett.net/plist`, there hasn't been a tagged release of that repo since December, 2023, so I manually `go get` the latest commit from January 2025 for that repository. `howett.net/plist` is used for MacOS support. For `github.com/StackExchange/wmi`, I changed this to `github.com/yusufpapurcu/wmi` which is the officially-supported fork of the `github.com/StackExchange/wmi` repo. Closes Issue jaypipes#323 Related Issue jaypipes/pcidb#36 Related google/dranet#194 Signed-off-by: Jay Pipes <[email protected]>
Updates the third-party dependencies in order for importing projects that have stricter dependency compliance restrictions. For `github.com/jaypipes/pcidb`, uplifts it to the latest 1.1.0 release, which removed the archived `github.com/mitchellh/go-homedir` dependency. For `howett.net/plist`, there hasn't been a tagged release of that repo since December, 2023, so I manually `go get` the latest commit from January 2025 for that repository. `howett.net/plist` is used for MacOS support. For `github.com/StackExchange/wmi`, I changed this to `github.com/yusufpapurcu/wmi` which is the officially-supported fork of the `github.com/StackExchange/wmi` repo. Closes Issue jaypipes#323 Related Issue jaypipes/pcidb#36 Related google/dranet#194 Signed-off-by: Jay Pipes <[email protected]>
Updates the third-party dependencies in order for importing projects that have stricter dependency compliance restrictions. For `github.com/jaypipes/pcidb`, uplifts it to the latest 1.1.0 release, which removed the archived `github.com/mitchellh/go-homedir` dependency. For `howett.net/plist`, there hasn't been a tagged release of that repo since December, 2023, so I manually `go get` the latest commit from January 2025 for that repository. `howett.net/plist` is used for MacOS support. For `github.com/StackExchange/wmi`, I changed this to `github.com/yusufpapurcu/wmi` which is the officially-supported fork of the `github.com/StackExchange/wmi` repo. Closes Issue jaypipes#323 Related Issue jaypipes/pcidb#36 Related google/dranet#194 Signed-off-by: Jay Pipes <[email protected]>
Updates the third-party dependencies in order for importing projects that have stricter dependency compliance restrictions. For `github.com/jaypipes/pcidb`, uplifts it to the latest 1.1.0 release, which removed the archived `github.com/mitchellh/go-homedir` dependency. For `howett.net/plist`, there hasn't been a tagged release of that repo since December, 2023, so I manually `go get` the latest commit from January 2025 for that repository. `howett.net/plist` is used for MacOS support. For `github.com/StackExchange/wmi`, I changed this to `github.com/yusufpapurcu/wmi` which is the officially-supported fork of the `github.com/StackExchange/wmi` repo. Closes Issue jaypipes#323 Related Issue jaypipes/pcidb#36 Related google/dranet#194 Signed-off-by: Jay Pipes <[email protected]>
This change refactors the device discovery mechanism to use PCI devices as the
fundamental unit of inventory. Network interface attributes are then discovered
and associated with these PCI devices. For Network interfaces which are not
associated with a PCI Device (like virtual interfaces), they are added as their
own device.
Fixes #178