[25.0] libnetwork/networkdb: backport all the fixes #50511

corhere · 2025-07-24T22:17:23Z

- What I did

Backported all the outstanding fixes for NetworkDB to 25.0.

- How I did it

Backporting parts of fix staticcheck linting issues for golangci-lint v2 #49885 was helpful for the cherry-picks to apply more cleanly.

- How to verify it

CI tests
Running the convergence test locally: go test -tags slowtests ./libnetwork/networkdb

- Human readable description for the release notes

Improved the reliability of the Swarm overlay network control plane by fixing longstanding issues with NetworkDB

- A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Matthieu MOREL <[email protected]> Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 7013997) Signed-off-by: Cory Snider <[email protected]>

Signed-off-by: Matthieu MOREL <[email protected]>

Gracefully leaving the memberlist cluster is a best-effort operation. Failing to successfully broadcast the leave message to a peer should not prevent NetworkDB from cleaning up the memberlist instance on close. But that was not the case in practice. Log the error returned from (*memberlist.Memberlist).Leave instead of returning it and proceed with shutting down irrespective of whether Leave() returns an error. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 16ed51d) Signed-off-by: Cory Snider <[email protected]>

The loopback-test fixes seem to be sufficient to resolve the flakiness of all the tests aside from TestFlakyNetworkDBIslands. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 697c17c) Signed-off-by: Cory Snider <[email protected]>

The map key for nDB.networks is the network ID. The struct field is not actually used anywhere in practice. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 30b27ab) Signed-off-by: Cory Snider <[email protected]>

When joining a network that was previously joined but not yet reaped, NetworkDB replaces the network struct value with a zeroed-out one with the entries count copied over. This is also the case when joining a network that is currently joined! Consequently, joining a network has the side effect of clearing the broadcast queue. If the queue is cleared while messages are still pending broadcast, convergence may be delayed until the next bulk sync cycle. Make it an error to join a network twice without leaving. Retain the existing broadcast queue when rejoining a network that has not yet been reaped. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 51f3182) Signed-off-by: Cory Snider <[email protected]>

NetworkDB uses a muli-dimensional map of struct network to keep track of network attachments for both remote nodes and the local node. Only a subset of the struct fields are used for remote nodes' network attachments. The tableBroadcasts pointer field in particular is always initialized for network values representing local attachments (read: nDB.networks[nDB.config.NodeID]) and always nil for remote attachments. Consequently, unnecessary defensive nil-pointer checks are peppered throughout the code despite the aforementioned invariant. Enshrine the invariant that tableBroadcasts is initialized iff the network attachment is for the local node in the type system. Pare down struct network to only the fields needed for remote network attachments and move the local-only fields into a new struct thisNodeNetwork. Elide the unnecessary nil-checks. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit dbb0d88) Signed-off-by: Cory Snider <[email protected]>

Log more details when assertions fail to provide a more complete picture of what went wrong when TestCRUDTableEntries fails. Log the state of each NetworkDB instance at various points in TestCRUDTableEntries to provide an even more complete picture. Increase the global logger verbosity in tests so warnings and debug logs are printed to the test log. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit e9a7154) Signed-off-by: Cory Snider <[email protected]>

A network node is responsible for both broadcasting table events for entries it owns and for rebroadcasting table events from other nodes it has received. Table events to be broadcast are added to a single queue per network, including events for rebroadcasting. As the memberlist TransmitLimitedQueue is (to a first approximation) LIFO, a flood of events from other nodes could delay the broadcasting of locally-generated events indefinitely. Prioritize broadcasting local events by splitting up the queues and only pulling from the rebroadcast queue if there is free space in the gossip packet after draining the local-broadcast queue. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 6ec6e09) Signed-off-by: Cory Snider <[email protected]>

The rejoinClusterBootStrap periodic task rejoins with the bootstrap nodes if none of them are members of the cluster. It correlates the cluster nodes with the bootstrap list by comparing IP addresses, ignoring ports. In normal operation this works out fine as every node has a unique IP address, but in unit tests every node listens on a distinct port of 127.0.0.1. This situation causes the check to incorrectly filter out all nodes from the list, mistaking them for the local node. Filter out the local node using pointer equality of the *node to avoid any ambiguity. Correlate the remote nodes by IP:port so that the check behaves the same in tests and in production. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 1e1be54) Signed-off-by: Cory Snider <[email protected]>

With rejoinClusterBootStrap fixed in tests, split clusters should reliably self-heal in tests as well as production. Work around the other source of flakiness in TestNetworkDBIslands: timing out waiting for a failed node to transition to gracefully left. This flake happens when one of the leaving nodes sends its NodeLeft message to the other leaving node, and the second is shut down before it has a chance to rebroadcast the message to the remaining nodes. The proper fix would be to leverage memberlist's own bookkeeping instead of duplicating it poorly with user messages, but doing so requires a change in the memberlist module. Instead have the test check that the sum of failed+left nodes is expected instead of waiting for all nodes to have failed==3 && left==0. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit aff444d) Signed-off-by: Cory Snider <[email protected]>

NetworkDB's JoinNetwork function enqueues a message onto a TransmitLimitedQueue while holding the NetworkDB mutex locked for writing. The TransmitLimitedQueue has its own synchronization; it locks its mutex when enqueueing a message. Locking order: 1. (NetworkDB).RWMutex.Lock() 2. (TransmitLimitedQueue).mu.Lock() NetworkDB's gossip periodic task calls GetBroadcasts on the same TransmitLimitedQueue to retrieve the enqueued messages. GetBroadcasts invokes the queue's NumNodes callback while the mutex is locked. The NumNodes callback function that NetworkDB sets locks the NetworkDB mutex for reading to take the length of the nodes map. Locking order: 1. (TransmitLimitedQueue).mu.Lock() 2. (NetworkDB).RWMutex.RLock() If one goroutine calls GetBroadcasts on the queue concurrently with another goroutine calling JoinNetwork on the NetworkDB, the goroutines may deadlock due to the lock inversion. Fix the deadlock by caching the number of nodes in an atomic variable so that the NumNodes callback can load the value without blocking or violating Go's memory model. And fix a similar deadlock situation with the table-event broadcast queues. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 08bde5e) Signed-off-by: Cory Snider <[email protected]>

When a node leaves a network, all entries owned by that node are implicitly deleted. The other NetworkDB nodes handle the leave by setting the deleted flag on the entries owned by the left node in their local stores. This behaviour is problematic as it results in two conflicting entries with the same Lamport timestamp propagating through the cluster. Consider two NetworkDB nodes, A, and B, which are both joined to some network. Node A in quick succession leaves the network, immediately rejoins it, then creates an entry. If Node B processes the entry-creation event first, it will add the entry to its local store then set the deleted flag upon processing the network-leave. No matter how many times B bulk-syncs with A, B will ignore the live entry for having the same timestamp as its local tombstone entry. Once this situation occurs, the only way to recover is for the entry to get updated by A with a new timestamp. There is no need for a node to store forged tombstones for another node's entries. All nodes will purge the entries naturally when they process the network-leave or node-leave event. Simply delete the non-owned entries from the local store so there is no inconsistent state to interfere with convergence when nodes rejoin a network. Have nodes update their local store with tombstones for entries when leaving a network so that after a rapid leave-then-rejoin the entry deletions propagate to nodes which may have missed the leave event. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 21d9109) Signed-off-by: Cory Snider <[email protected]>

Add a feature to NetworkDB to log the encryption keys to a file for the Wireshark memberlist plugin to consume, configured using an environment variable. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit ebfafa1) Signed-off-by: Cory Snider <[email protected]>

Add a property-based test which asserts that a cluster of NetworkDB nodes always eventually converges to a consistent state. As this test takes a long time to run it is build-tagged to be excluded from CI. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit d8730dc) Signed-off-by: Cory Snider <[email protected]>

TestNetworkDBAlwaysConverges will occasionally find a failure where one entry is missing on one node even after waiting a full five minutes. One possible explanation is that the selection of nodes to gossip with is biased in some way. Test that the mRandomNodes function picks a uniformly distributed sample of node IDs of sufficient length. The new test reveals that mRandomNodes may sometimes pick out a sample of fewer than m nodes even when the number of nodes to pick from (excluding the local node) is >= m. Put the test behind an xfail tag so it is opt-in to run, without interfering with CI or bisecting. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 5799deb) Signed-off-by: Cory Snider <[email protected]>

The property test for the mRandomNodes function revealed that it may sometimes pick out a sample of fewer than m nodes even when the number of nodes to pick from (excluding the local node) is >= m. Rewrite it using a random shuffle or permutation so that it always picks a uniformly-distributed sample of the requested size whenever the population is large enough. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit ac5f464) Signed-off-by: Cory Snider <[email protected]>

akerouanton

LGTM

fix(ST1016): Use consistent method receiver names

e53cf6b

Signed-off-by: Matthieu MOREL <[email protected]> Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 7013997) Signed-off-by: Cory Snider <[email protected]>

corhere added this to the 25.0.13 milestone Jul 24, 2025

corhere added area/networking Networking kind/bugfix PR's that fix bugs area/networking/d/overlay Networking labels Jul 24, 2025

mmorel-35 and others added 16 commits July 25, 2025 16:20

fix redefines-builtin-id from revive

bacba37

Signed-off-by: Matthieu MOREL <[email protected]>

corhere force-pushed the backport-25.0/libn/all-the-networkdb-fixes branch from 089be93 to 728de37 Compare July 25, 2025 20:20

corhere added the impact/changelog label Jul 25, 2025

corhere marked this pull request as ready for review July 25, 2025 20:27

corhere requested review from cpuguy83 and tonistiigi as code owners July 25, 2025 20:27

corhere requested review from akerouanton, robmry and thaJeztah July 25, 2025 20:28

robmry approved these changes Jul 28, 2025

View reviewed changes

corhere mentioned this pull request Jul 28, 2025

[25.0] libnetwork/overlay: backport all the fixes #50551

Merged

corhere added this to 🔦 Maintainer spotlight Jul 31, 2025

github-project-automation bot moved this to New in 🔦 Maintainer spotlight Jul 31, 2025

akerouanton approved these changes Aug 7, 2025

View reviewed changes

corhere merged commit 59f062b into moby:25.0 Aug 7, 2025
135 of 136 checks passed

corhere deleted the backport-25.0/libn/all-the-networkdb-fixes branch August 7, 2025 15:44

thaJeztah moved this from New to Complete in 🔦 Maintainer spotlight Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[25.0] libnetwork/networkdb: backport all the fixes #50511

[25.0] libnetwork/networkdb: backport all the fixes #50511

Uh oh!

corhere commented Jul 24, 2025 •

edited

Loading

Uh oh!

akerouanton left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[25.0] libnetwork/networkdb: backport all the fixes #50511

[25.0] libnetwork/networkdb: backport all the fixes #50511

Uh oh!

Conversation

corhere commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akerouanton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

corhere commented Jul 24, 2025 •

edited

Loading