This repository was archived by the owner on Dec 9, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 24
fix: Retry netlink calls on ErrDumpInterrupted by using a wrapper #263
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
9425740 to
8906e70
Compare
This commit replaces numerous direct calls to the `netlink` library with the `nlwrap` wrapper. This ensures that the updated calls will benefit from the retry logic implemented in the wrapper, making the code more resilient to `ErrDumpInterrupted` errors. Most of the wrapped functions were already available in the forked wrapper.
This commit introduces wrappers for the `RdmaLinkByName` and `RdmaSystemGetNetnsMode` netlink functions. These wrappers provide retry logic to handle `ErrDumpInterrupted` errors. The corresponding callsites in the driver have been updated to use the new wrappers.
a415fb3 to
aee299e
Compare
aojea
reviewed
Oct 15, 2025
Collaborator
|
can we also add a rule in the golangci linter to avoid new code to not use the wrapper https://golangci-lint.run/docs/linters/configuration/#forbidigo Is just adding this as the cilium folks did https://github.com/cilium/cilium/blob/ca13b76e7affc7ac0f0799d769ec4e76b3ba0809/.golangci.yaml#L43-L49 |
aojea
suggested changes
Oct 15, 2025
Collaborator
aojea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it, just use klog to not have different loggers, despite the code is forked we are allowed to modify it under apache.
Also add the linter config I added in one of the comments so we don't get bitten for mistakes like I did here, sorry about that
Member
Author
|
Thanks for sharing the linter example, that's great. Updated. |
aojea
approved these changes
Oct 16, 2025
aojea
approved these changes
Oct 16, 2025
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The problem
During high-churn scenarios, it was observed that pods were sometimes getting scheduled to nodes that did not have the required network devices, causing these pods to become stuck in a failure state.
The root cause was traced to our driver's handling of netlink calls. Under heavy load, the kernel can return partial results for netlink state dumps, indicating this with the
NLM_F_DUMP_INTRflag, which the netlink library surfaces asErrDumpInterrupted. The driver in some cases was processing these incomplete results, leading to the incorrect inclusion of the default network interface in the ResourceSlice.Although a subsequent sync would correct this, the window was large enough during high churn for the scheduler to assign the default interface as an additional device to the pod.
Solution:
This PR fixes the issue by wrapping the critical netlink call sites with a retry mechanism that specifically handles the
ErrDumpInterruptederror. This ensures that the driver always works with a complete and accurate list of network devices from the kernel, preventing the publication of incorrect ResourceSlice information.The change has the following commits, all of which deal with this replacement