Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Dec 9, 2025. It is now read-only.

Conversation

@gauravkghildiyal
Copy link
Member

@gauravkghildiyal gauravkghildiyal commented Oct 12, 2025

  • The rate limiter for publishing inventory state to the Kubernetes API server is now configurable. The default minimum interval between ResourceSlice updates has been reduced from 5 seconds to 2 seconds, with a burst of 5. This ensures that (only) when inventory changes, i.e.

    if err := netlink.LinkSubscribe(nlChannel, doneCh); err != nil {
    the updates are propagated to the cluster more quickly, reducing pod scheduling delays caused by stale resource data.

  • An on-demand scan is now triggered if a requested device is not found in the local cache. This resolves allocation failures caused by race conditions where a device is released and immediately re-claimed (by a different pod). The scan instantly updates the driver's local state, ensuring newly freed devices are immediately available for allocation, independent of the rate-limited ResourceSlice update.

@gauravkghildiyal gauravkghildiyal changed the title feat: Add configurable discovery rate limiting [Not ready for review] feat: Add configurable discovery rate limiting Oct 12, 2025
@gauravkghildiyal gauravkghildiyal marked this pull request as draft October 12, 2025 05:35
@gauravkghildiyal gauravkghildiyal changed the title [Not ready for review] feat: Add configurable discovery rate limiting feat: Improve device allocation reliability in high churn scenarios Oct 12, 2025
@gauravkghildiyal gauravkghildiyal marked this pull request as ready for review October 12, 2025 19:55
This commit introduces configurable rate limiting for the inventory discovery
process. Previously, a fixed 5-second rate limit with a burst of 1 could delay
processing of netlink updates, leading to failures during high pod churn
scenarios.

Command-line flags have been added to control the inventory discovery rate limit
and burst size. The default values have been adjusted to be more responsive to
rapid pod lifecycle events, ensuring that device state is updated promptly.
Previously, if a device was released by a pod and immediately claimed by
another, the inventory might not have had a chance to update.

Now, if a device is not found in the local store, a new scan is triggered to
ensure that newly available devices are discovered before failing. This improves
the reliability of device allocation during high pod churn.
@gauravkghildiyal gauravkghildiyal merged commit d014973 into google:main Oct 13, 2025
6 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants