feat: Use PCI devices as the base for discovery #194

gauravkghildiyal · 2025-08-15T09:04:39Z

This change refactors the device discovery mechanism to use PCI devices as the
fundamental unit of inventory. Network interface attributes are then discovered
and associated with these PCI devices. For Network interfaces which are not
associated with a PCI Device (like virtual interfaces), they are added as their
own device.

Fixes #178

pkg/driver/dra_hooks.go

pkg/inventory/db.go

.github/workflows/bats.yml

aojea

I do not understand well this proposal and what problem solves, virtual interfaces are in scope and should keep working as today

aojea · 2025-08-16T09:11:13Z

I prefer to use the device filesystem to detect the physical interfaces, since the path is stable across namespaces and allow dranet to not have to go into the network namespaces, think that with the OCI spec changes, runc can move network interfaces and we may drop all root capabilities for dranet.

I did a test in a local machine and it works fine

$ find /sys -name eth2

/sys/class/net/eth2
/sys/devices/pci0000:00/0000:00:06.0/virtio3/net/eth2

 $ sudo ip netns add test1
aojea@gke-dranet-test-dranet1-bee68b65-44sl ~ $ sudo ip link set eth2 netns test1
aojea@gke-dranet-test-dranet1-bee68b65-44sl ~ $ sudo ip netns exec test1 ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: eth2: <BROADCAST,MULTICAST> mtu 8244 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 42:01:ff:ff:16:17 brd ff:ff:ff:ff:ff:ff
aojea@gke-dranet-test-dranet1-bee68b65-44sl ~ $ find /sys -name eth2
find: ‘/sys/kernel/tracing’: Permission denied
find: ‘/sys/kernel/debug’: Permission denied
find: ‘/sys/fs/pstore’: Permission denied
find: ‘/sys/fs/bpf’: Permission denied
aojea@gke-dranet-test-dranet1-bee68b65-44sl ~ $ ls /sys/devices/pci0000:00/0000:00:06.0
ari_enabled               device           irq            msi_irqs     resource          subsystem_vendor
broken_parity_status      dma_mask_bits    link           numa_node    resource0         uevent
class                     driver           local_cpulist  power        resource1         vendor
config                    driver_override  local_cpus     power_state  revision          virtio3
consistent_dma_mask_bits  enable           modalias       remove       subsystem
d3cold_allowed            firmware_node    msi_bus        rescan       subsystem_device

aojea · 2025-08-16T09:30:35Z

my apologies I didn't realize pci was tried in #183 (comment) , I commented there but bringing the comment to this PR for visibility

I think that we do not need to overcomplicate this:

we detect all network devices in the root namespaces independently if they are virtual or physical, the existing netlink logic is valid.
some of these interface are physical, we already have logic to detect those

We want physical devices to be stable and do not disappear from the resourceslice, so we need to discover them from the pcibus, a gemini query tells me we can use the class for that, I double checked and indeed 0x02 is for network controllers https://sigops.acm.illinois.edu/old/roll_your_own/7.c.1.html

#!/bin/bash

# Find all 'class' files within PCI device directories
find /sys/devices/pci* -name class | while read -r class_file; do
  
  # Check if the class code starts with 0x02 (Network Controller)
  if grep -q "^0x02" "$class_file"; then
  
    # Get the parent directory of the 'class' file (the device's main directory)
    device_dir=$(dirname "$class_file")
    
    echo "Found Network Device at: $device_dir"
    echo "---"
  fi
done

In my example it works, it finds the 3 network devices, despite one of them is already in a namespace

 ./find_network_devices.sh 
Found Network Device at: /sys/devices/pci0000:00/0000:00:04.0
---
Found Network Device at: /sys/devices/pci0000:00/0000:00:06.0
---
Found Network Device at: /sys/devices/pci0000:00/0000:00:05.0
---

The golang code should. be something like

// The root path for PCI devices in sysfs
	pciRoot := "/sys/devices/pci0000:00"

	fmt.Println("Searching for network device paths...")

	// Walk the filesystem starting from the PCI root
	err := filepath.Walk(pciRoot, func(path string, info fs.FileInfo, err error) error {
		if err != nil {
			// Can't access a file or directory, print error and continue
			fmt.Printf("Error accessing path %s: %v\n", path, err)
			return nil
		}

		// We are only interested in files named 'class'
		if !info.IsDir() && info.Name() == "class" {
			// Read the content of the 'class' file
			content, readErr := os.ReadFile(path)
			if readErr != nil {
				// Failed to read, skip this file
				return nil
			}

			// Check if the class code indicates a network controller (starts with 0x02)
			if strings.HasPrefix(strings.TrimSpace(string(content)), "0x02") {
				// This is a network device. Print its parent directory path.
				devicePath := filepath.Dir(path)
				fmt.Println(devicePath)
			}
		}
		return nil
	})

	if err != nil {
		fmt.Printf("Error walking the path %s: %v\n", pciRoot, err)
	}

This can be also used to detect the infiniband devices @vsoch is suggesting in other PR, the key is to deduplicate these devices from the ones obtained via the netlink list, but that sounds simple, as we already know from an interface name if is physical, so we just don't process it in that loop

gauravkghildiyal · 2025-08-18T21:24:50Z

Thanks for the review. Allow me to clarify the two main ideas, the first of which I agree can be changed, but I'll require some additional thoughts about the second one before I make additional modifications.

1. PCI Address vs Permanent MAC as the device name? (VERDICT: Agree to change to PCI Address)

As you have already pointed out previously, network interface names (like eth0, eth1, etc) are not permament. When publishing a device in ResourceSlice, we need some identifier which remains somewhat permanent so that we can consistently map it to the correct network interface. "Permanent MAC address" (NOT EQUAL to the temporary mac address) and the "PCI Device" seem like two reasonable candidates. Right now I had choosen "Permanent MAC address" as the option because I assumed it's possible for multiple network interfaces to share the same PCI address, but the permanent MAC address will always be unique. But I don't have a real life example for this and I'm happy to use the PCI Address as the "stable device name in ResourceSlice". (Yes PCI Address as the primary identifier is what this PR was about: #183)

2. Exec into other network namespaces to gather information (NEEDS DISCUSSION)

If DraNet running on a Node restarts for any reason, it should be able to rebuild the ResourceSlice for that node. For all the network interfaces which are not allocated to some Pod, this works out naturally since we can re-list the network interfaces and PCI devices and get all the information. But this does not work for network interfaces which are already allocated to some pod.

The part that's missing is that while you do have the PCI device information (and it's attribute/characteristic) still available in the host's network namespace, you do NOT have the network related information (like permanent MAC address) for network interfaces which have been moved to a network namespace. Atleast one immediate reason why this is important is that if you can't figure out the permanent MAC address of the interface, you cannot map them to the device attributes for the GCE Metadata Server and you end up with a ResourceSlice which is missing attributes for such devices.

As an example, consider a PCI network device "0000:c0:14.0":

$ ls /sys/devices/pci0000:c0/0000:c0:14.0
ari_enabled           class   consistent_dma_mask_bits  device         driver           enable  link           local_cpus  msi_bus   net        power        remove  resource   resource1  revision   subsystem_device  uevent
broken_parity_status  config  d3cold_allowed            dma_mask_bits  driver_override  irq     local_cpulist  modalias    msi_irqs  numa_node  power_state  rescan  resource0  resource2  subsystem  subsystem_vendor  vendor

I can find the network interfaces associated with this and details about the network interface:

$ ls /sys/devices/pci0000:c0/0000:c0:14.0/net
eth1

$ ls /sys/devices/pci0000:c0/0000:c0:14.0/net/eth1
addr_assign_type  broadcast        carrier_down_count  dev_port  duplex             ifalias  link_mode         napi_defer_hard_irqs  phys_port_id    power       speed       testing       type
addr_len          carrier          carrier_up_count    device    flags              ifindex  mtu               netdev_group          phys_port_name  proto_down  statistics  threaded      uevent
address           carrier_changes  dev_id              dormant   gro_flush_timeout  iflink   name_assign_type  operstate             phys_switch_id  queues      subsystem   tx_queue_len

$ cat /sys/devices/pci0000:c0/0000:c0:14.0/net/eth1/address
06:4d:54:dc:1d:e8

But if I move the network interface to a network namespace, this information is no longer accessible from the host's network namespace. (It is not visible in sysfs AND also not available through netlink)

$ sudo ip netns add temp-ns

$ ip netns
temp-ns

$ sudo ip link set eth1 netns temp-ns

# No interface info associated with the PCI device
$ ls /sys/devices/pci0000:c0/0000:c0:14.0/net/ -l
total 0

$ cat /sys/devices/pci0000:c0/0000:c0:14.0/net/eth1/address
cat: '/sys/devices/pci0000:c0/0000:c0:14.0/net/eth1/address': No such file or directory

This means that we cannot map the PCI device to it's GCE Metadata information without some additional network information, like the "permanent MAC address". For instance, these are the fields from GCE Metadata, one of which we have to use to map our PCI device to it's metadata. (The values below have been randomized from actual source)

{
  "accessConfigs": [
    {
      "externalIp": "137.186.202.64",
      "type": "ONE_TO_ONE_NAT"
    }
  ],
  "dnsServers": [
    "226.128.54.133"
  ],
  "forwardedIps": [],
  "gateway": "234.165.66.132",
  "ip": "2.63.165.114",
  "ipAliases": [
    "172.78.227.94/14"
  ],
  "mac": "06:4d:54:dc:1d:e8",
  "mtu": 1457,
  "network": "projects/826694234596/networks/wxisbtewlo-vpc",
  "nicType": "GVNIC",
  "physicalNicId": "1",
  "subnetmask": "254.224.192.252",
  "targetInstanceIps": []
}

This means that if we really have to rebuild the ResourceSlice properly, we will have to fetch the network interface information from the individual network namespaces. E.g.

# Compare this to the previous output where this file was no longer present in the host's network namespace
$ sudo ip netns exec temp-ns cat /sys/devices/pci0000:c0/0000:c0:14.0/net/eth1/address
06:4d:54:dc:1d:e8

aojea · 2025-08-18T22:23:02Z

AFAIK the Linux Device model guarantee the uniqueness, each network interface listed in /sys/class/net/ must point to a unique device path because each symlink represents a distinct network device recognized by the kernel.
If the node restarts, all your network devices start in the host namespace and the namespaces disappear , is not that the pods remain running, you need to recover all the entire state from scratch

aojea · 2025-08-18T22:28:18Z

this is what gemini suggested

package inventory

import (
	"path/filepath"
	"os"

	"github.com/vishvananda/netlink"
	"k8s.io/klog/v2"
)

// Represents a fully discovered network device, merging PCI and Netlink info.
type DiscoveredInterface struct {
	PCIDevice   // From your sysfs.go
	InterfaceName string
	MACAddress  string
	InHostNamespace bool // Flag to indicate if it was found via netlink
	// ... other netlink attributes
}

func DiscoverAndMerge() (map[string]*DiscoveredInterface, error) {
	// --- Step 1: Get all physical PCI network devices (from your existing code) ---
	pciDevices, err := DiscoverPCIDevices()
	if err != nil {
		return nil, err
	}

	// Create a map for easy lookup, keyed by PCI address.
	mergedInterfaces := make(map[string]*DiscoveredInterface)
	for _, dev := range pciDevices {
		// Only consider network controllers
		if dev.Class == 0x020000 {
			mergedInterfaces[dev.Address] = &DiscoveredInterface{
				PCIDevice:     *dev,
				InHostNamespace: false, // Assume it's not in the host NS until proven otherwise
			}
		}
	}

	// --- Step 2: Get all interfaces from the host's netlink view ---
	links, err := netlink.LinkList()
	if err != nil {
		return nil, err
	}

	for _, link := range links {
		// For each link, find its PCI address by resolving the sysfs symlink.
		// /sys/class/net/eth0/device -> ../../../0000:00:04.0
		devicePath := filepath.Join("/sys/class/net", link.Attrs().Name, "device")
		realPath, err := os.Readlink(devicePath)
		if err != nil {
			// This is expected for virtual devices (docker0, etc.), so we just skip them.
			continue
		}
		pciAddress := filepath.Base(realPath) // Extracts "0000:00:04.0"

		// --- Step 3: Merge the data ---
		if discoveredDev, ok := mergedInterfaces[pciAddress]; ok {
			// We found a matching PCI device! Update it with netlink info.
			klog.Infof("Matching host interface %s with PCI device %s", link.Attrs().Name, pciAddress)
			discoveredDev.InterfaceName = link.Attrs().Name
			discoveredDev.MACAddress = link.Attrs().HardwareAddr.String()
			discoveredDev.InHostNamespace = true
		}
	}

	return mergedInterfaces, nil
}

gauravkghildiyal · 2025-08-18T22:28:51Z

AFAIK the Linux Device model guarantee the uniqueness, each network interface listed in /sys/class/net/ must point to a unique device path because each symlink represents a distinct network device recognized by the kernel.

Yes the device path will be unique, but it was my assumption that the uniqueness could be because of the interface name. So for example, something like this:

❯ ls /sys/class/net -l
lrwxrwxrwx 1 root root 0 Aug 18 15:24 ens4 -> ../../devices/pci0000:00/0000:00:04.0/virtio1/net/ens4
lrwxrwxrwx 1 root root 0 Aug 18 15:24 ens5 -> ../../devices/pci0000:00/0000:00:04.0/virtio1/net/ens5

(In the hypothetical example above, both ens4 and ens5 are exposed by same PCI device...perhaps some form of virtualization?)

Again, I haven't found a real life example for this so we can cross that bridge when we come to it.

If the node restarts, all your network devices start in the host namespace and the namespaces disappear , is not that the pods remain running, you need to recover all the entire state from scratch

I am NOT referring to a Node restart, but just the DraNet pod restart due some form of pod lifecycle events (pods are susceptible to this after all). I concur and understand this is not a problem if the Node itself restarts.

aojea · 2025-08-18T22:33:38Z

if the interface is attached to a pod what is the value of adding the attributes? the system just need to know the device is attached and can not be allocated ... once is released it will be "discoverable" again

gauravkghildiyal · 2025-08-18T22:43:13Z

if the interface is attached to a pod what is the value of adding the attributes? the system just need to know the device is attached and can not be allocated ... once is released it will be "discoverable" again

Good question. This brings us to the Cluster Autoscaler integration we were discussing in #178. Essentially, if you initially published a ResourceSlice which had, let's say the gce.dra.net/rdmaType = MRDMA attribute (derived from GCE Metadata) and some ResourceClaimTemplate is selecting a device based on this attribute:

apiVersion: resource.k8s.io/v1beta1
kind:  ResourceClaimTemplate
metadata:
  name: claim-any-mrdma-nic
spec:
  spec:
    devices:
      requests:
      - name: request-any-mrdma-nic
        deviceClassName: dranet-cloud
        count: 1
        selectors:
        - cel:
            expression: device.attributes["dra.net"].rdma == true && device.attributes["gce.dra.net"].rdmaType == "MRDMA"

At this point, you have one pod scheduled on a Node using this ResourceClaimTemplate. But now let's say some new pods are created which use the same ResourceClaimTemplate and there are no available devices, the Cluster Autoscaler will try to recreate Nodes and their ResourceSlices to see which one of the Nodes+ResourceSlice will satisfy this claim. It does so based on some existing Node and it's ResourceSlice. At this point, if this ResourceSlice is missing this attribute (because we removed it since we failed to discover it due some pod claiming the network interface), the autoscaler will fail to scale up correctly as it will see no Node can satisfy the claim if it scales up.

The main expectation of DRA being that "ResourceSlices are expected to be somewhat constant, like CPU and Memory"

michaelasp · 2025-08-18T22:59:57Z

Does this also bring us back to the issue that we can assign the same device to multiple pods if we do not remove them from the resource slice? There is no semantic for preventing allocation of a device multiple time right?

gauravkghildiyal · 2025-08-18T23:06:13Z

assign the same device to multiple pods if we do not remove them from the resource slice?

Ah so for confirmation, if we are talking about something similar to #139, then actually even if we remove the device from the ResourceSlice, we can still end up allocating multiple pods to the same device if all those pods use the same exact ResourceClaim. Yes that does not have a solution yet, sadly! Some people could argue that this can be treated as a user misconfiguration wherein they should not use multiple ResourceClaims with multiple pods and if they do so, that will only work when the device under consideration is shareable.

As long as one uses ResourceClaimTemplate, this should not be a problem since it will create unique ResourceClaims for each pod.

michaelasp · 2025-08-18T23:11:43Z

Ah gotcha, makes sense!

pkg/inventory/db.go

aojea · 2025-08-19T06:32:33Z

there are some attributes that are dynamic like ips, mac, ... and other that are static, rdma is static and is the same no matter namespace or no ... autoscaler must only do calls on static and well defined attributes, pci cross those checks.

The MRDMA type is something I do not understand why we want to publish, it's gce custom thing that does not make sense to expose, users just want an interface with RDMA

gauravkghildiyal · 2025-08-19T06:58:24Z

there are some attributes that are dynamic like ips, mac, ... and other that are static, rdma is static and is the same no matter namespace or no ... autoscaler must only do calls on static and well defined attributes, pci cross those checks.

This is correct. In fact, the current autoscaler doesn't distinguish between the two kinds of attribute and results in an infinite scale loop in scenarios that I raised here: https://kubernetes.slack.com/archives/C09R1LV8S/p1754006888690929

The MRDMA type is something I do not understand why we want to publish, it's gce custom thing that does not make sense to expose, users just want an interface with RDMA

The MRDMA is just one example, but the same thing would apply to other attributes like VPC name as well. In total, the following attributes won't be published if we don't exec into network's namespace:

dra.net/alias:
dra.net/ebpf:
dra.net/encapsulation:
dra.net/ifName:
dra.net/ipv4:
dra.net/mac:
dra.net/mtu:
gce.dra.net/networkName:
gce.dra.net/networkProjectNumber:
gce.dra.net/rdmaType

Open questions

Would we say we are fine with having these attributes missing from the ResourceSlice and that users should not be using these static attributes in their ResourceClaims to select devices? If our answer is yes for now (which I DO understand the logic behind), then great, let's not do the namespace exec thing -- I just want us to be explicit here and have us be on the same page regarding what it means
In the other approach (from [OBSOLETE - TO_BE_DELETED] TODO: Use PCI Device as the immutable device and add network interface data as its (mutable) attributes #183), is it reasonable if the device name is the PCI Address -- based on your other comments, I think we agree there but want to confirm before making a change now.

aojea · 2025-08-19T07:13:09Z

there are a lot of open question we are not equipped to answer without user feedback.

Right now we remove entirely the device, we are going to change that to just remove the non-discoverable Attributes from host but leave the device name for two use case:

autoscaler to be aware of it, although I still very unclear the use cases for nic and autoscaler to me at this point, we have hypothesis but I still don't see how an user will do that, since all CUJs are GPU driven today
the multiple pod use the same claim thing, that I've found surprising and that seems modeled after a GPU use case too

I think is a good next step to try this and it give us the opportunity to iterate without breaking changes

nvidia relies a lot on the nvlib and these low level libraries, and the fact that they own the entire ecosystem
Network devices are different and the naming is complicated and not consistent across OS https://wiki.debian.org/NetworkInterfaceNames

My thinking is that for physical network interfaces the device address is unambiguous for the user and the system, and if we define the name in an attribute that solves the UX, I think is worth pursuing that path

gauravkghildiyal · 2025-08-19T07:22:06Z

Thanks for the discussion. I find some comfort knowing that we both understand the choices we are making and it's implications, and our idea is to incorporate things once we get user feedback -- this is perfectly reasonable to me!

What you have in mind is exactly what PR #183 does. Since we've had lot's of discussion here, I'll try to bring those changes here.

Also just want to give a heads up that with this new logic, our KIND tests involving "dummy" interface may not work directly since they are not backed by a PCI device. We will figure out some alternative.

aojea · 2025-08-19T10:50:50Z

Also just want to give a heads up that with this new logic, our KIND tests involving "dummy" interface may not work directly since they are not backed by a PCI device. We will figure out some alternative.

why will not work? we are not going to break any behavior #194 (comment)

I'm just saying we only do an additional list walking the pci path and merging the physical with its link attributes when they are in the host namespace, nothing else changes

gauravkghildiyal · 2025-08-19T15:40:53Z

So you are saying we still want the devices in /sys/devices/virtual/net (like dummy). The logic in your comment #194 (comment) also ignores those. (Or maybe you were trying to say in addition to that we do network interface discovery separately as well, where we don't ignore devices in /sys/devices/virtual/net)

I'll keep them for now. Let's take this one step at a time.

pkg/inventory/db.go

aojea · 2025-08-27T19:34:56Z

you have to rebase

This change refactors the device discovery mechanism to use PCI devices as the fundamental unit of inventory. Network interface attributes are then discovered and associated with these PCI devices. For Network interfaces which are not associated with a PCI Device (like virtual interfaces), they are added as their own device. Previously, network interfaces were the primary resource. However, when a network interface is moved into a pod's network namespace, it disappears from the host's view, causing the resource to be removed from the node's ResourceSlice. This could lead to race conditions and incorrect device availability information. By treating the PCI device as the stable, base resource, we ensure that the device remains visible in the ResourceSlice even when its associated network interface is moved. This commit also introduces the `ghw` library for PCI device discovery and removes the manual sysfs parsing.

…interface name

…Host and NetworkInterfaceConfigInPod

… of device

…them in tests This is now required because not all devices have the exact same attributes. E.g. PCI Devices whose network interface is moved to the pod's network namespace don't have the network interface related attributes like: - dra.net/alias: - dra.net/ebpf: - dra.net/encapsulation: - dra.net/ifName: - dra.net/ipv4: - dra.net/mac: - dra.net/mtu: - gce.dra.net/networkName: - gce.dra.net/networkProjectNumber: - gce.dra.net/rdmaType This means that the user needs to check for their existence if referring to them.

…deps

gauravkghildiyal

Thank you, updated and rebased.

pkg/inventory/db.go

michaelasp

This was a major change with lots of back and forth, glad to see we've come to a good conclusion. Approving :)

Updates the third-party dependencies in order for importing projects that have stricter dependency compliance restrictions. For `github.com/jaypipes/pcidb`, uplifts it to the latest 1.1.0 release, which removed the archived `github.com/mitchellh/go-homedir` dependency. For `howett.net/plist`, there hasn't been a tagged release of that repo since December, 2023, so I manually `go get` the latest commit from January 2025 for that repository. `howett.net/plist` is used for MacOS support. For `github.com/StackExchange/wmi`, I changed this to `github.com/yusufpapurcu/wmi` which is the officially-supported fork of the `github.com/StackExchange/wmi` repo. Closes Issue #323 Related Issue jaypipes/pcidb#36 Related google/dranet#194 Signed-off-by: Jay Pipes <[email protected]>

Updates the third-party dependencies in order for importing projects that have stricter dependency compliance restrictions. For `github.com/jaypipes/pcidb`, uplifts it to the latest 1.1.0 release, which removed the archived `github.com/mitchellh/go-homedir` dependency. For `howett.net/plist`, there hasn't been a tagged release of that repo since December, 2023, so I manually `go get` the latest commit from January 2025 for that repository. `howett.net/plist` is used for MacOS support. For `github.com/StackExchange/wmi`, I changed this to `github.com/yusufpapurcu/wmi` which is the officially-supported fork of the `github.com/StackExchange/wmi` repo. Closes Issue jaypipes#323 Related Issue jaypipes/pcidb#36 Related google/dranet#194 Signed-off-by: Jay Pipes <[email protected]>

gauravkghildiyal requested review from aojea and michaelasp August 15, 2025 09:04

gauravkghildiyal force-pushed the discover-from-all-ns branch from 204fb84 to d3422b4 Compare August 15, 2025 09:07

gauravkghildiyal mentioned this pull request Aug 15, 2025

[OBSOLETE - TO_BE_DELETED] TODO: Use PCI Device as the immutable device and add network interface data as its (mutable) attributes #183

Closed

gauravkghildiyal force-pushed the discover-from-all-ns branch 3 times, most recently from c7dfe1b to 6ce5c98 Compare August 15, 2025 09:57

gauravkghildiyal mentioned this pull request Aug 15, 2025

chore: More verbose errors for nsAttachNetdev and nsDetachNetdev #196

Merged

aojea reviewed Aug 16, 2025

View reviewed changes

pkg/driver/dra_hooks.go Outdated Show resolved Hide resolved

aojea reviewed Aug 16, 2025

View reviewed changes

pkg/inventory/db.go Outdated Show resolved Hide resolved

aojea reviewed Aug 16, 2025

View reviewed changes

.github/workflows/bats.yml Outdated Show resolved Hide resolved

aojea previously requested changes Aug 16, 2025

View reviewed changes

michaelasp reviewed Aug 18, 2025

View reviewed changes

pkg/inventory/db.go Outdated Show resolved Hide resolved

gauravkghildiyal force-pushed the discover-from-all-ns branch from eec16ee to 801491d Compare August 26, 2025 03:52

aojea reviewed Aug 26, 2025

View reviewed changes

pkg/inventory/db.go Show resolved Hide resolved

aojea mentioned this pull request Aug 26, 2025

feat: discovery of native rdma devices #151

Open

BenTheElder mentioned this pull request Aug 26, 2025

PCI database embedded fallback #211

Open

aojea reviewed Aug 27, 2025

View reviewed changes

pkg/inventory/db.go Show resolved Hide resolved

aojea previously approved these changes Aug 27, 2025

View reviewed changes

gauravkghildiyal added 13 commits August 27, 2025 19:37

fix: Add a check to the CEL expression in default filter

651afcc

refactor: Introduce constants for device attributes

d441585

fix: Add OriginalInterfaceName to PodCfg to allow restoration of the …

f1ba313

…interface name

chore: Address review comments from michaelasp

bacf7cf

fix: Do not publish network interface which has unknown PCI device

0b6e7c9

docs: Add comment about usage of dra.net as the domain

c5c6550

feat: Extract embedded PCI DB file when needed

4793314

refactor: Rename PodConfig subfields to have NetworkInterfaceConfigIn…

70cf828

…Host and NetworkInterfaceConfigInPod

fix: Don't aggregate new errors if we failed to get network interface…

0be7289

… of device

chore: Address review comments from BenTheElder

202abcf

build: Bump dep github.com/jaypipes/ghw to v0.19.0 to minimize other …

cceb50a

…deps

gauravkghildiyal dismissed aojea’s stale review via cceb50a August 27, 2025 19:38

gauravkghildiyal force-pushed the discover-from-all-ns branch from 801491d to cceb50a Compare August 27, 2025 19:38

gauravkghildiyal commented Aug 27, 2025

View reviewed changes

pkg/inventory/db.go Show resolved Hide resolved

michaelasp approved these changes Aug 27, 2025

View reviewed changes

gauravkghildiyal merged commit c0d6038 into google:main Aug 27, 2025
6 checks passed

feat: Use PCI devices as the base for discovery #194

feat: Use PCI devices as the base for discovery #194

Uh oh!

Conversation

gauravkghildiyal commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aojea left a comment

Choose a reason for hiding this comment

Uh oh!

aojea commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aojea commented Aug 16, 2025

Uh oh!

gauravkghildiyal commented Aug 18, 2025

1. PCI Address vs Permanent MAC as the device name? (VERDICT: Agree to change to PCI Address)

2. Exec into other network namespaces to gather information (NEEDS DISCUSSION)

Uh oh!

aojea commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aojea commented Aug 18, 2025

Uh oh!

gauravkghildiyal commented Aug 18, 2025

Uh oh!

aojea commented Aug 18, 2025

Uh oh!

gauravkghildiyal commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelasp commented Aug 18, 2025

Uh oh!

gauravkghildiyal commented Aug 18, 2025

Uh oh!

michaelasp commented Aug 18, 2025

Uh oh!

Uh oh!

aojea commented Aug 19, 2025

Uh oh!

gauravkghildiyal commented Aug 19, 2025

Open questions

Uh oh!

aojea commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gauravkghildiyal commented Aug 19, 2025

Uh oh!

aojea commented Aug 19, 2025

Uh oh!

gauravkghildiyal commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

aojea commented Aug 27, 2025

Uh oh!

gauravkghildiyal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaelasp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gauravkghildiyal commented Aug 15, 2025 •

edited

Loading

aojea commented Aug 16, 2025 •

edited

Loading

aojea commented Aug 18, 2025 •

edited

Loading

gauravkghildiyal commented Aug 18, 2025 •

edited

Loading

aojea commented Aug 19, 2025 •

edited

Loading