Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Dec 9, 2025. It is now read-only.

Conversation

@gauravkghildiyal
Copy link
Member

@gauravkghildiyal gauravkghildiyal commented Oct 15, 2025

The problem

During high-churn scenarios, it was observed that pods were sometimes getting scheduled to nodes that did not have the required network devices, causing these pods to become stuck in a failure state.

The root cause was traced to our driver's handling of netlink calls. Under heavy load, the kernel can return partial results for netlink state dumps, indicating this with the NLM_F_DUMP_INTR flag, which the netlink library surfaces as ErrDumpInterrupted. The driver in some cases was processing these incomplete results, leading to the incorrect inclusion of the default network interface in the ResourceSlice.

Although a subsequent sync would correct this, the window was large enough during high churn for the scheduler to assign the default interface as an additional device to the pod.

Solution:

This PR fixes the issue by wrapping the critical netlink call sites with a retry mechanism that specifically handles the ErrDumpInterrupted error. This ensures that the driver always works with a complete and accurate list of network devices from the kernel, preventing the publication of incorrect ResourceSlice information.


The change has the following commits, all of which deal with this replacement

  1. refactor: Remove redundant error checks involving netlink.ErrDumpInterrupted
  2. feat: Add AddrList to nlwrap handle
  3. refactor: Add wrappers for RDMA netlink functions
  4. refactor: Replace direct netlink calls with nlwrap
  5. feat: Add a wrapper for netlink to retry on ErrDumpInterrupted

This commit replaces numerous direct calls to the `netlink` library with
the `nlwrap` wrapper. This ensures that the updated calls will benefit
from the retry logic implemented in the wrapper, making the code more
resilient to `ErrDumpInterrupted` errors.

Most of the wrapped functions were already available in the forked
wrapper.
This commit introduces wrappers for the `RdmaLinkByName` and
`RdmaSystemGetNetnsMode` netlink functions. These wrappers provide retry
logic to handle `ErrDumpInterrupted` errors.

The corresponding callsites in the driver have been updated to use the
new wrappers.
@aojea
Copy link
Collaborator

aojea commented Oct 15, 2025

can we also add a rule in the golangci linter to avoid new code to not use the wrapper https://golangci-lint.run/docs/linters/configuration/#forbidigo

Is just adding this as the cilium folks did https://github.com/cilium/cilium/blob/ca13b76e7affc7ac0f0799d769ec4e76b3ba0809/.golangci.yaml#L43-L49

Copy link
Collaborator

@aojea aojea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it, just use klog to not have different loggers, despite the code is forked we are allowed to modify it under apache.

Also add the linter config I added in one of the comments so we don't get bitten for mistakes like I did here, sorry about that

@gauravkghildiyal
Copy link
Member Author

Thanks for sharing the linter example, that's great. Updated.

@aojea aojea merged commit 67dd9e5 into google:main Oct 16, 2025
6 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants