Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented Jul 24, 2025

- What I did

Backported all the outstanding fixes for NetworkDB to 25.0.

- How I did it

- How to verify it

  • CI tests
  • Running the convergence test locally: go test -tags slowtests ./libnetwork/networkdb

- Human readable description for the release notes

Improved the reliability of the Swarm overlay network control plane by fixing longstanding issues with NetworkDB

- A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Matthieu MOREL <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 7013997)
Signed-off-by: Cory Snider <[email protected]>
@corhere corhere added this to the 25.0.13 milestone Jul 24, 2025
@corhere corhere added area/networking Networking kind/bugfix PR's that fix bugs area/networking/d/overlay Networking labels Jul 24, 2025
mmorel-35 and others added 16 commits July 25, 2025 16:20
Gracefully leaving the memberlist cluster is a best-effort operation.
Failing to successfully broadcast the leave message to a peer should not
prevent NetworkDB from cleaning up the memberlist instance on close. But
that was not the case in practice. Log the error returned from
(*memberlist.Memberlist).Leave instead of returning it and proceed with
shutting down irrespective of whether Leave() returns an error.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 16ed51d)
Signed-off-by: Cory Snider <[email protected]>
The loopback-test fixes seem to be sufficient to resolve the flakiness
of all the tests aside from TestFlakyNetworkDBIslands.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 697c17c)
Signed-off-by: Cory Snider <[email protected]>
The map key for nDB.networks is the network ID. The struct field is not
actually used anywhere in practice.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 30b27ab)
Signed-off-by: Cory Snider <[email protected]>
When joining a network that was previously joined but not yet reaped,
NetworkDB replaces the network struct value with a zeroed-out one with
the entries count copied over. This is also the case when joining a
network that is currently joined! Consequently, joining a network has
the side effect of clearing the broadcast queue. If the queue is cleared
while messages are still pending broadcast, convergence may be delayed
until the next bulk sync cycle.

Make it an error to join a network twice without leaving. Retain the
existing broadcast queue when rejoining a network that has not yet been
reaped.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 51f3182)
Signed-off-by: Cory Snider <[email protected]>
NetworkDB uses a muli-dimensional map of struct network to keep track of
network attachments for both remote nodes and the local node. Only a
subset of the struct fields are used for remote nodes' network
attachments. The tableBroadcasts pointer field in particular is
always initialized for network values representing local attachments
(read: nDB.networks[nDB.config.NodeID]) and always nil for remote
attachments. Consequently, unnecessary defensive nil-pointer checks are
peppered throughout the code despite the aforementioned invariant.

Enshrine the invariant that tableBroadcasts is initialized iff the
network attachment is for the local node in the type system. Pare down
struct network to only the fields needed for remote network attachments
and move the local-only fields into a new struct thisNodeNetwork. Elide
the unnecessary nil-checks.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit dbb0d88)
Signed-off-by: Cory Snider <[email protected]>
Log more details when assertions fail to provide a more complete picture
of what went wrong when TestCRUDTableEntries fails. Log the state of
each NetworkDB instance at various points in TestCRUDTableEntries to
provide an even more complete picture.

Increase the global logger verbosity in tests so warnings and debug logs
are printed to the test log.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit e9a7154)
Signed-off-by: Cory Snider <[email protected]>
A network node is responsible for both broadcasting table events for
entries it owns and for rebroadcasting table events from other nodes it
has received. Table events to be broadcast are added to a single queue
per network, including events for rebroadcasting. As the memberlist
TransmitLimitedQueue is (to a first approximation) LIFO, a flood of
events from other nodes could delay the broadcasting of
locally-generated events indefinitely. Prioritize broadcasting local
events by splitting up the queues and only pulling from the rebroadcast
queue if there is free space in the gossip packet after draining the
local-broadcast queue.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 6ec6e09)
Signed-off-by: Cory Snider <[email protected]>
The rejoinClusterBootStrap periodic task rejoins with the bootstrap
nodes if none of them are members of the cluster. It correlates the
cluster nodes with the bootstrap list by comparing IP addresses,
ignoring ports. In normal operation this works out fine as every node
has a unique IP address, but in unit tests every node listens on a
distinct port of 127.0.0.1. This situation causes the check to
incorrectly filter out all nodes from the list, mistaking them for the
local node.

Filter out the local node using pointer equality of the *node to avoid
any ambiguity. Correlate the remote nodes by IP:port so that the check
behaves the same in tests and in production.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 1e1be54)
Signed-off-by: Cory Snider <[email protected]>
With rejoinClusterBootStrap fixed in tests, split clusters should
reliably self-heal in tests as well as production. Work around the other
source of flakiness in TestNetworkDBIslands: timing out waiting for a
failed node to transition to gracefully left. This flake happens when
one of the leaving nodes sends its NodeLeft message to the other leaving
node, and the second is shut down before it has a chance to rebroadcast
the message to the remaining nodes. The proper fix would be to leverage
memberlist's own bookkeeping instead of duplicating it poorly with user
messages, but doing so requires a change in the memberlist module.
Instead have the test check that the sum of failed+left nodes is
expected instead of waiting for all nodes to have failed==3 && left==0.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit aff444d)
Signed-off-by: Cory Snider <[email protected]>
NetworkDB's JoinNetwork function enqueues a message onto a
TransmitLimitedQueue while holding the NetworkDB mutex locked for
writing. The TransmitLimitedQueue has its own synchronization;
it locks its mutex when enqueueing a message. Locking order:
  1. (NetworkDB).RWMutex.Lock()
  2. (TransmitLimitedQueue).mu.Lock()

NetworkDB's gossip periodic task calls GetBroadcasts on the same
TransmitLimitedQueue to retrieve the enqueued messages. GetBroadcasts
invokes the queue's NumNodes callback while the mutex is locked. The
NumNodes callback function that NetworkDB sets locks the NetworkDB mutex
for reading to take the length of the nodes map. Locking order:
  1. (TransmitLimitedQueue).mu.Lock()
  2. (NetworkDB).RWMutex.RLock()

If one goroutine calls GetBroadcasts on the queue concurrently with
another goroutine calling JoinNetwork on the NetworkDB, the goroutines
may deadlock due to the lock inversion.

Fix the deadlock by caching the number of nodes in an atomic variable so
that the NumNodes callback can load the value without blocking or
violating Go's memory model. And fix a similar deadlock situation with
the table-event broadcast queues.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 08bde5e)
Signed-off-by: Cory Snider <[email protected]>
When a node leaves a network, all entries owned by that node are
implicitly deleted. The other NetworkDB nodes handle the leave by
setting the deleted flag on the entries owned by the left node in their
local stores. This behaviour is problematic as it results in two
conflicting entries with the same Lamport timestamp propagating
through the cluster.

Consider two NetworkDB nodes, A, and B, which are both joined to some
network. Node A in quick succession leaves the network, immediately
rejoins it, then creates an entry. If Node B processes the
entry-creation event first, it will add the entry to its local store
then set the deleted flag upon processing the network-leave. No matter
how many times B bulk-syncs with A, B will ignore the live entry for
having the same timestamp as its local tombstone entry. Once this
situation occurs, the only way to recover is for the entry to get
updated by A with a new timestamp.

There is no need for a node to store forged tombstones for another
node's entries. All nodes will purge the entries naturally when they
process the network-leave or node-leave event. Simply delete the
non-owned entries from the local store so there is no inconsistent state
to interfere with convergence when nodes rejoin a network. Have nodes
update their local store with tombstones for entries when leaving a
network so that after a rapid leave-then-rejoin the entry deletions
propagate to nodes which may have missed the leave event.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 21d9109)
Signed-off-by: Cory Snider <[email protected]>
Add a feature to NetworkDB to log the encryption keys to a file for the
Wireshark memberlist plugin to consume, configured using an environment
variable.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit ebfafa1)
Signed-off-by: Cory Snider <[email protected]>
Add a property-based test which asserts that a cluster of NetworkDB
nodes always eventually converges to a consistent state. As this test
takes a long time to run it is build-tagged to be excluded from CI.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit d8730dc)
Signed-off-by: Cory Snider <[email protected]>
TestNetworkDBAlwaysConverges will occasionally find a failure where one
entry is missing on one node even after waiting a full five minutes. One
possible explanation is that the selection of nodes to gossip with is
biased in some way. Test that the mRandomNodes function picks a
uniformly distributed sample of node IDs of sufficient length.

The new test reveals that mRandomNodes may sometimes pick out a sample
of fewer than m nodes even when the number of nodes to pick from
(excluding the local node) is >= m. Put the test behind an xfail tag so
it is opt-in to run, without interfering with CI or bisecting.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 5799deb)
Signed-off-by: Cory Snider <[email protected]>
The property test for the mRandomNodes function revealed that it may
sometimes pick out a sample of fewer than m nodes even when the number
of nodes to pick from (excluding the local node) is >= m. Rewrite it
using a random shuffle or permutation so that it always picks a
uniformly-distributed sample of the requested size whenever the
population is large enough.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit ac5f464)
Signed-off-by: Cory Snider <[email protected]>
@corhere corhere force-pushed the backport-25.0/libn/all-the-networkdb-fixes branch from 089be93 to 728de37 Compare July 25, 2025 20:20
@corhere corhere marked this pull request as ready for review July 25, 2025 20:27
Copy link
Member

@akerouanton akerouanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@corhere corhere merged commit 59f062b into moby:25.0 Aug 7, 2025
135 of 136 checks passed
@corhere corhere deleted the backport-25.0/libn/all-the-networkdb-fixes branch August 7, 2025 15:44
@thaJeztah thaJeztah moved this from New to Complete in 🔦 Maintainer spotlight Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants