-
Notifications
You must be signed in to change notification settings - Fork 18.9k
[25.0] libnetwork/networkdb: backport all the fixes #50511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
corhere
merged 17 commits into
moby:25.0
from
corhere:backport-25.0/libn/all-the-networkdb-fixes
Aug 7, 2025
Merged
[25.0] libnetwork/networkdb: backport all the fixes #50511
corhere
merged 17 commits into
moby:25.0
from
corhere:backport-25.0/libn/all-the-networkdb-fixes
Aug 7, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Matthieu MOREL <[email protected]> Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 7013997) Signed-off-by: Cory Snider <[email protected]>
Signed-off-by: Matthieu MOREL <[email protected]>
Gracefully leaving the memberlist cluster is a best-effort operation. Failing to successfully broadcast the leave message to a peer should not prevent NetworkDB from cleaning up the memberlist instance on close. But that was not the case in practice. Log the error returned from (*memberlist.Memberlist).Leave instead of returning it and proceed with shutting down irrespective of whether Leave() returns an error. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 16ed51d) Signed-off-by: Cory Snider <[email protected]>
The loopback-test fixes seem to be sufficient to resolve the flakiness of all the tests aside from TestFlakyNetworkDBIslands. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 697c17c) Signed-off-by: Cory Snider <[email protected]>
The map key for nDB.networks is the network ID. The struct field is not actually used anywhere in practice. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 30b27ab) Signed-off-by: Cory Snider <[email protected]>
When joining a network that was previously joined but not yet reaped, NetworkDB replaces the network struct value with a zeroed-out one with the entries count copied over. This is also the case when joining a network that is currently joined! Consequently, joining a network has the side effect of clearing the broadcast queue. If the queue is cleared while messages are still pending broadcast, convergence may be delayed until the next bulk sync cycle. Make it an error to join a network twice without leaving. Retain the existing broadcast queue when rejoining a network that has not yet been reaped. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 51f3182) Signed-off-by: Cory Snider <[email protected]>
NetworkDB uses a muli-dimensional map of struct network to keep track of network attachments for both remote nodes and the local node. Only a subset of the struct fields are used for remote nodes' network attachments. The tableBroadcasts pointer field in particular is always initialized for network values representing local attachments (read: nDB.networks[nDB.config.NodeID]) and always nil for remote attachments. Consequently, unnecessary defensive nil-pointer checks are peppered throughout the code despite the aforementioned invariant. Enshrine the invariant that tableBroadcasts is initialized iff the network attachment is for the local node in the type system. Pare down struct network to only the fields needed for remote network attachments and move the local-only fields into a new struct thisNodeNetwork. Elide the unnecessary nil-checks. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit dbb0d88) Signed-off-by: Cory Snider <[email protected]>
Log more details when assertions fail to provide a more complete picture of what went wrong when TestCRUDTableEntries fails. Log the state of each NetworkDB instance at various points in TestCRUDTableEntries to provide an even more complete picture. Increase the global logger verbosity in tests so warnings and debug logs are printed to the test log. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit e9a7154) Signed-off-by: Cory Snider <[email protected]>
A network node is responsible for both broadcasting table events for entries it owns and for rebroadcasting table events from other nodes it has received. Table events to be broadcast are added to a single queue per network, including events for rebroadcasting. As the memberlist TransmitLimitedQueue is (to a first approximation) LIFO, a flood of events from other nodes could delay the broadcasting of locally-generated events indefinitely. Prioritize broadcasting local events by splitting up the queues and only pulling from the rebroadcast queue if there is free space in the gossip packet after draining the local-broadcast queue. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 6ec6e09) Signed-off-by: Cory Snider <[email protected]>
The rejoinClusterBootStrap periodic task rejoins with the bootstrap nodes if none of them are members of the cluster. It correlates the cluster nodes with the bootstrap list by comparing IP addresses, ignoring ports. In normal operation this works out fine as every node has a unique IP address, but in unit tests every node listens on a distinct port of 127.0.0.1. This situation causes the check to incorrectly filter out all nodes from the list, mistaking them for the local node. Filter out the local node using pointer equality of the *node to avoid any ambiguity. Correlate the remote nodes by IP:port so that the check behaves the same in tests and in production. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 1e1be54) Signed-off-by: Cory Snider <[email protected]>
With rejoinClusterBootStrap fixed in tests, split clusters should reliably self-heal in tests as well as production. Work around the other source of flakiness in TestNetworkDBIslands: timing out waiting for a failed node to transition to gracefully left. This flake happens when one of the leaving nodes sends its NodeLeft message to the other leaving node, and the second is shut down before it has a chance to rebroadcast the message to the remaining nodes. The proper fix would be to leverage memberlist's own bookkeeping instead of duplicating it poorly with user messages, but doing so requires a change in the memberlist module. Instead have the test check that the sum of failed+left nodes is expected instead of waiting for all nodes to have failed==3 && left==0. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit aff444d) Signed-off-by: Cory Snider <[email protected]>
NetworkDB's JoinNetwork function enqueues a message onto a TransmitLimitedQueue while holding the NetworkDB mutex locked for writing. The TransmitLimitedQueue has its own synchronization; it locks its mutex when enqueueing a message. Locking order: 1. (NetworkDB).RWMutex.Lock() 2. (TransmitLimitedQueue).mu.Lock() NetworkDB's gossip periodic task calls GetBroadcasts on the same TransmitLimitedQueue to retrieve the enqueued messages. GetBroadcasts invokes the queue's NumNodes callback while the mutex is locked. The NumNodes callback function that NetworkDB sets locks the NetworkDB mutex for reading to take the length of the nodes map. Locking order: 1. (TransmitLimitedQueue).mu.Lock() 2. (NetworkDB).RWMutex.RLock() If one goroutine calls GetBroadcasts on the queue concurrently with another goroutine calling JoinNetwork on the NetworkDB, the goroutines may deadlock due to the lock inversion. Fix the deadlock by caching the number of nodes in an atomic variable so that the NumNodes callback can load the value without blocking or violating Go's memory model. And fix a similar deadlock situation with the table-event broadcast queues. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 08bde5e) Signed-off-by: Cory Snider <[email protected]>
When a node leaves a network, all entries owned by that node are implicitly deleted. The other NetworkDB nodes handle the leave by setting the deleted flag on the entries owned by the left node in their local stores. This behaviour is problematic as it results in two conflicting entries with the same Lamport timestamp propagating through the cluster. Consider two NetworkDB nodes, A, and B, which are both joined to some network. Node A in quick succession leaves the network, immediately rejoins it, then creates an entry. If Node B processes the entry-creation event first, it will add the entry to its local store then set the deleted flag upon processing the network-leave. No matter how many times B bulk-syncs with A, B will ignore the live entry for having the same timestamp as its local tombstone entry. Once this situation occurs, the only way to recover is for the entry to get updated by A with a new timestamp. There is no need for a node to store forged tombstones for another node's entries. All nodes will purge the entries naturally when they process the network-leave or node-leave event. Simply delete the non-owned entries from the local store so there is no inconsistent state to interfere with convergence when nodes rejoin a network. Have nodes update their local store with tombstones for entries when leaving a network so that after a rapid leave-then-rejoin the entry deletions propagate to nodes which may have missed the leave event. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 21d9109) Signed-off-by: Cory Snider <[email protected]>
Add a feature to NetworkDB to log the encryption keys to a file for the Wireshark memberlist plugin to consume, configured using an environment variable. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit ebfafa1) Signed-off-by: Cory Snider <[email protected]>
Add a property-based test which asserts that a cluster of NetworkDB nodes always eventually converges to a consistent state. As this test takes a long time to run it is build-tagged to be excluded from CI. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit d8730dc) Signed-off-by: Cory Snider <[email protected]>
TestNetworkDBAlwaysConverges will occasionally find a failure where one entry is missing on one node even after waiting a full five minutes. One possible explanation is that the selection of nodes to gossip with is biased in some way. Test that the mRandomNodes function picks a uniformly distributed sample of node IDs of sufficient length. The new test reveals that mRandomNodes may sometimes pick out a sample of fewer than m nodes even when the number of nodes to pick from (excluding the local node) is >= m. Put the test behind an xfail tag so it is opt-in to run, without interfering with CI or bisecting. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit 5799deb) Signed-off-by: Cory Snider <[email protected]>
The property test for the mRandomNodes function revealed that it may sometimes pick out a sample of fewer than m nodes even when the number of nodes to pick from (excluding the local node) is >= m. Rewrite it using a random shuffle or permutation so that it always picks a uniformly-distributed sample of the requested size whenever the population is large enough. Signed-off-by: Cory Snider <[email protected]> (cherry picked from commit ac5f464) Signed-off-by: Cory Snider <[email protected]>
089be93 to
728de37
Compare
robmry
approved these changes
Jul 28, 2025
akerouanton
approved these changes
Aug 7, 2025
Member
akerouanton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/networking/d/overlay
Networking
area/networking
Networking
impact/changelog
kind/bugfix
PR's that fix bugs
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
- What I did
Backported all the outstanding fixes for NetworkDB to 25.0.
- How I did it
- How to verify it
go test -tags slowtests ./libnetwork/networkdb- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)