Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented Jul 28, 2025

- What I did

Backported all the outstanding fixes for Swarm Overlay networks to 25.0

- How I did it

- How to verify it

  1. Set up a three-node Swarm with two workers and one manager
  2. Create a user-defined, attachable overlay network
  3. On the worker nodes, set up a simulated lossy underlay network using the tc netem scheduler. E.g.: tc qdisc add dev eth0 root netem delay 100ms 200ms loss 1%
  4. Verify that service-discovery DNS stays in sync:
    1. Deploy a globally-replicated service with constraint node.role!=manager which is connected to the user-defined overlay network from step 2. Arrange it so the tasks exit after a few seconds, e.g. alpine sh -c 'sleep 5'
    2. Run a basic container on the manager node, connected to the overlay network from step 2. Repeatedly make DNS queries for tasks.<service from step 4> while the service tasks keep flapping in the background. Verify that the set of IP addresses in the answer does not grow unbounded.
    3. Verify that the the daemon logs on all three nodes show no evidence of persistent transient states
    4. Delete the basic container and service
  5. Verify that service-discovery DNS and overlay FDB do not fall out of sync when the same endpoint ID is repeatedly removed and re-added:
    1. On the manager node, docker run -d --name hellobasic --net my-overlay nginxdemos/hello:plain-text
    2. On one of the worker nodes, docker run --rm --net my-overlay alpine watch wget -qO- hellobasic
    3. Back on the manager, repeatedly for run in {1..3}; do docker kill hellobasic && docker start hellobasic; done so the endpoint flaps and gets reattached with different IP and MAC addresses.
    4. Verify that the worker node's container is eventually able to communicate with the hellobasic container after each restart.
    5. Delete the containers
  6. Verify that the ingress routing mesh is reliable with flapping tasks:
    1. docker service create -d --name hello --network my-overlay --mode global --constraint 'node.role!=manager' -p 30888:80 nginxdemos/hello:plain-text
    2. curl port 30888 on each node, repeatedly. Verify that repeated requests all succeed, and hit different tasks.
    3. for run in {1..10}; do docker service update --force hello; sleep 15; done
    4. Repeat step ii.

- Human readable description for the release notes

Improved the reliability of Swarm overlay container networks by fixing longstanding issues with the overlay network driver

- A picture of a cute animal (not mandatory but encouraged)

@corhere corhere added this to the 25.0.13 milestone Jul 28, 2025
@corhere corhere force-pushed the backport-25.0/libn/all-the-overlay-fixes branch from fefaa78 to 5ce9ea2 Compare July 28, 2025 20:03
@corhere corhere force-pushed the backport-25.0/libn/all-the-overlay-fixes branch from 5ce9ea2 to 1624d91 Compare August 7, 2025 15:46
@corhere corhere marked this pull request as ready for review August 7, 2025 15:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR backports multiple fixes for Swarm overlay networks to version 25.0 to improve reliability. It addresses longstanding issues with the overlay network driver including fixes for neighbor management, encryption handling, peer database operations, and event processing.

  • Refactors the neighbor management system in overlay networks to be more efficient and reliable
  • Improves event handling and watch mechanisms in networkdb with better type safety
  • Modernizes type signatures and data structures using generics and newer Go features

Reviewed Changes

Copilot reviewed 50 out of 50 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
libnetwork/service.go Updates SetMatrix type parameter for better type safety
libnetwork/osl/neigh_linux.go Refactors neighbor entry management with improved error handling
libnetwork/networkdb/watch.go Modernizes event system with WatchEvent type replacing separate event types
libnetwork/drivers/overlay/ Major refactor of peer database, encryption, and network management
libnetwork/internal/ Adds new utility packages for hashable types, count maps, and set matrices

Comment on lines 162 to 168
if prev != nil {
var err error
prevPeer, err = UnmarshalPeerRecord(prev)
if err != nil {
log.G(context.TODO()).WithError(err).Error("Failed to unmarshal previous peer record")
}
if prevPeer.TunnelEndpointIP == d.advertiseAddress {
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to line 181, this comparison accesses d.advertiseAddress without proper locking. The advertiseAddress field is protected by d.mu according to the struct comments, but it's being accessed here without holding the mutex.

Copilot uses AI. Check for mistakes.
if err != nil {
log.G(context.TODO()).WithError(err).Error("Failed to unmarshal previous peer record")
}
if prevPeer.TunnelEndpointIP.String() == n.providerAddress {
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential nil pointer dereference here. If unmarshalPeerRecord failed and returned an error, prevPeer could be nil, but the code still tries to access prevPeer.TunnelEndpointIP.

Suggested change
if prevPeer.TunnelEndpointIP.String() == n.providerAddress {
if prevPeer != nil && prevPeer.TunnelEndpointIP.String() == n.providerAddress {

Copilot uses AI. Check for mistakes.
if err != nil {
log.G(context.TODO()).WithError(err).Error("Failed to unmarshal peer record")
}
if prevPeer.TunnelEndpointIP.String() == n.providerAddress {
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line contains a copy-paste error. It should be checking newPeer.TunnelEndpointIP, not prevPeer.TunnelEndpointIP, since this is in the block handling the new peer value.

Suggested change
if prevPeer.TunnelEndpointIP.String() == n.providerAddress {
if newPeer.TunnelEndpointIP.String() == n.providerAddress {

Copilot uses AI. Check for mistakes.
driver *driver
joinCnt int
// Ref count of VXLAN Forwarding Database entries programmed into the kernel
fdbCnt countmap.Map[hashable.IPMAC]
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment above this field says 'Ref count of VXLAN Forwarding Database entries programmed into the kernel' but the field name 'fdbCnt' is not very descriptive. Consider renaming to 'fdbRefCount' or 'vxlanFdbRefCount' for better clarity.

Suggested change
fdbCnt countmap.Map[hashable.IPMAC]
fdbRefCount countmap.Map[hashable.IPMAC]

Copilot uses AI. Check for mistakes.
@corhere corhere force-pushed the backport-25.0/libn/all-the-overlay-fixes branch from 1624d91 to 8a0506b Compare August 7, 2025 17:41
Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - assuming a rebase fixes the tests.

Copy link
Member

@akerouanton akerouanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

corhere added 11 commits August 11, 2025 15:13
The writeToStore() call was removed from CreateNetwork in
commit 0fa873c. The comment about
undoing the write is no longer applicable.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit d902773)
Signed-off-by: Cory Snider <[email protected]>
as all callers unconditionally set them to false.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 6ee58c2)
Signed-off-by: Cory Snider <[email protected]>
as all callers unconditionally set them to false.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit a8e8a4c)
Signed-off-by: Cory Snider <[email protected]>
func (*Namespace) AddNeighbor is only ever called with the force
parameter set to false. Remove the parameter and eliminate dead code.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 3bdf99d)
Signed-off-by: Cory Snider <[email protected]>
Scope local variables as narrowly as possible.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit b6d76eb)
Signed-off-by: Cory Snider <[email protected]>
The isDefault and nlHandle fields are immutable once the Namespace is
constructed.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 9866738)
Signed-off-by: Cory Snider <[email protected]>
The Namespace keeps some state for each inserted neighbor-table entry
which is used to delete the entry (and any related entries) given only
the IP and MAC address of the entry to delete. This state is not
strictly required as the retained data is a pure function of the
parameters passed to AddNeighbor(), and the kernel can inform us whether
an attempt to add a neighbor entry would conflict with an existing
entry. Get rid of the neighbor state in Namespace. It's just one more
piece of state that can cause lots of grief if it falls out of sync with
ground truth. Require callers to call DeleteNeighbor() with the same
aguments as they had passed to AddNeighbor(). Push the responsibility
for detecting attempts to insert conflicting entries into the neighbor
table onto the kernel by using (*netlink.Handle).NeighAdd() instead of
NeighSet().

Modernize the error messages and logging in DeleteNeighbor() and
AddNeighbor().

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 0d6e7cd)

libn/d/overlay: delete FDB entry from AF_BRIDGE

Starting with commit 0d6e7cd
DeleteNeighbor() needs to be called with the same options as the
AddNeighbor() call that created the neighbor entry. The calls in peerdb
were modified incorrectly, resulting in the deletes failing and leaking
neighbor entries. Fix up the DeleteNeighbor calls so that the FDB entry
is deleted from the FDB instead of the neighbor table, and the neighbor
is deleted from the neighbor table instead of the FDB.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 7a12bbe)

Signed-off-by: Cory Snider <[email protected]>
Make the SetMatrix key's type generic so that e.g. netip.Addr values can
be used as matrix keys.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 0317f77)
Signed-off-by: Cory Snider <[email protected]>
The netip types are really useful for tracking state in the overlay
driver as they are hashable, unlike net.IP and friends, making them
directly useable as map keys. Converting between netip and net types is
fairly trivial, but fewer conversions is more ergonomic.

The NetworkDB entries for the overlay peer table encode the IP addresses
as strings. We need to parse them to some representation before
processing them further. Parse directly into netip types and pass those
values around to cut down on the number of conversions needed.

The peerDB needs to marshal the keys and entries to structs of hashable
values to be able to insert them into the SetMatrix. Use netip.Addr in
peerEntry so that peerEntry values can be directly inserted into the
SetMatrix without conversions. Use a hashable struct type as the
SetMatrix key to avoid having to marshal the whole struct to a string
and parse it back out.

Use netip.Addr as the map key for the driver's encryption map so the
values do not need to be converted to and from strings. Change the
encryption configuration methods to take netip types so the peerDB code
can pass netip values directly.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit d188df0)
Signed-off-by: Cory Snider <[email protected]>
The overlay driver's checkEncryption function configures the IPSec
parameters for the VXLAN tunnels to peer nodes. When called with
isLocal=true, it configures encryption for all peer nodes with at least
one peerDB entry. Since the local peers are also included in the peerDB,
it needs to filter those entries out. It does so by filtering out any
peer entries whose VTEP address is equal to the current local advertise
address. Trouble is, the local advertise address is not necessarily
constant. The driver tries to handle this case by calling
peerDBUpdateSelf() when the advertise address changes. This function
iterates through the peerDB and tries to update the VTEP address for all
local peer entries, but it does not actually do anything: it mutates a
temporary copy of the entry which is not persisted back into the peerDB.
(It used to be functional, but was broken when the peerDB was extended
to use SetMatrix.) So there may be cases where local peer entries are
not filtered out properly, resulting in spurious encryption parameters
being programmed into the kernel.

Filter out local peers when walking the peerDB by filtering on whether
the entry has the isLocal flag set. Remove the no-op code which attempts
to update local entries in the peerDB. No other code takes any interest
in the VTEP value for isLocal peer entries.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit a9e2d6d)
Signed-off-by: Cory Snider <[email protected]>
The VTEP value for a peer in peerDB is only accurate for a remote peer.
The VTEP for a local peer would be the driver's advertise address, which
is not necessarily constant for the lifetime of the driver instance.
The VTEP values persisted in the peerDB entries for local peers could be
stale or missing if not kept in sync with the advertise address. And the
peerDB could get polluted with duplicate entries for local peers if the
advertise address was to change, as entries which differ only by VTEP
are considered distinct by SetMatrix. Persisting the advertise address
as the VTEP for local peers creates lots of problems that are not easy
to solve.

Stop persisting the VTEP for local peers in peerDB. Any code that needs
to know the VTEP for local peers can look that up from the source of
truth: the driver's advertise address. Use the lack of a VTEP in peerDB
entries to signify local peers, making the isLocal flag redundant.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 48e0b24)
Signed-off-by: Cory Snider <[email protected]>
corhere and others added 19 commits August 11, 2025 15:13
Drop the isLocal boolean parameters from the peerDB functions. Local
peers have vtep == netip.Addr{}.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 4b1c123)
Signed-off-by: Cory Snider <[email protected]>
Since it is not meaningful to add or remove encryption between the local
node and itself, the isLocal parameter is redundant. Setting up
encryption for all network peers is now invoked by calling

    checkEncryption(nid, netip.Addr{}, true)

Calling checkEncryption with isLocal=true, add=false is now more
explicitly a no-op. It always was effectively a no-op, but that was not
easy to spot by inspection. In the world with the isLocal flag,
calls to checkEncryption where isLocal=true and add=false would have rIP
set to d.advertiseAddr. In other words, it was a request to remove
encryption parameters between the local peer and itself if peerDB had no
remote-peer entries for the network. So either the call would do
nothing, or it would remove encryption parameters that aren't used for
anything. Now the equivalent call always does nothing.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 0d89325)
Signed-off-by: Cory Snider <[email protected]>
The setupEncryption and removeEncryption functions take several
parameters, but all call sites pass the same values for all the
parameters aside from remoteIP: values taken from fields of the driver
struct. Refactor these functions to be methods of the driver struct and
drop the redundant parameters.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit cb4e7b2)
Signed-off-by: Cory Snider <[email protected]>
In addition to being three functions in a trenchcoat, the
checkEncryption function has a very subtle implementation which is
difficult to reason about. That is not a good property for security
relevant code to have.

Replace two of the three calls to checkEncryption with conditional calls
to setupEncryption and removeEncryption, lifting the conditional logic
which was hidden away in checkEncryption into the call sites to make it
easier to reason about the code. Replace the third call with a call to a
new initEncryption function.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 713f887)
Signed-off-by: Cory Snider <[email protected]>
The (*driver).Join function does many things to set up overlay
networking. One of the first things it does is call
(*network).joinSandbox, which in turn calls (*driver).initSandboxPeerDB.
The initSandboxPeerDB function iterates through the peer db to add
entries to the VXLAN FDB, neighbor table and IPsec security association
database in the kernel for all known peers on the overlay network.

One of the last things the (*driver).Join function does is call
(*driver).initEncryption. The initEncryption function iterates through
the peer db to add entries to the IPsec security association database in
the kernel for all known peers on the overlay network. But the preceding
initSandboxPeerDB call already did that! The initEncryption function is
redundant and can safely be removed.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit df6b405)
Signed-off-by: Cory Snider <[email protected]>
The peer db implementation is more complex than it needs to be.
Notably, the peerCRUD / peerCRUDOp function split is a vestige of its
evolution from a worker goroutine receiving commands over a channel.

Refactor the peer db operations to be easier to read, understand and
modify. Factor the kernel-programming operations out into dedicated
addNeighbor and deleteNeighbor functions. Inline the rest of the
peerCRUDOp functions into their respective peerCRUD wrappers.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 59437f5)
Signed-off-by: Cory Snider <[email protected]>
The overlay driver assumes that the peer table in NetworkDB will always
converge to a 1:1:1 mapping from peer endpoint IP address to MAC address
to VTEP. While this currently holds true in practice most of the time,
it is not an invariant and there are ways that users can violate this
assumption.

The driver detects whether peer entries conflict with each other by
matching up (IP, MAC) tuples. In the common case this works out fine as
the MAC address for an endpoint is generally derived from the assigned
IP address. If an IP address gets reassigned to a container on another
node the MAC address will follow, so the driver's conflict resolution
logic will behave as intended. However users may explicitly configure
the MAC address for a container's network endpoints. If an IP address
gets reassigned from a container with an auto-generated MAC address to a
container with a manually-configured MAC, or vice versa, the driver
would not detect the conflict as the (IP, MAC) tuples won't match up. It
would attempt to program the kernel's neighbor table with two
conflicting MAC addresses for one IP, which will fail. And since it
does not realize that there is a conflict, the driver won't reprogram
the kernel from the remaining entry when the other entry is deleted.

The assumption that only one IP address may resolve to a given MAC
address is violated if multiple IP addresses are assigned to an
endpoint. This rarely comes up in practice today as the overlay driver
only supports IPv4 single-stack connectivity for endpoints. If multiple
distinct peer entries exist with the same MAC address, the driver will
delete the MAC->VTEP mapping from the kernel's forwarding database when
any entry is deleted, even if other entries remain active. This
limitation is one of the biggest obstacles in the way of supporting IPv6
and dual-stack connectivity for endpoints attached to overlay networks.

Modify the peer db logic to correctly handle the cases where peer
entries have non-unique MAC or VTEP values. Treat any set of entries
with non-unique IP addresses as a conflict, irrespective of the entries'
MAC addresses. Maintain a reference count of forwarding database entries
and only delete the MAC->VTEP mapping from the kernel when there are no
longer any neighbor entries which resolve to that MAC.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 1c2b744)
Signed-off-by: Cory Snider <[email protected]>
    libnetwork/drivers/overlay/encryption.go:370:2: naked return in func `programSA` with 64 lines of code (nakedret)
        return
        ^

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 02b4c7c)
Signed-off-by: Cory Snider <[email protected]>
The IPsec encryption parameters (Security Association Database and
Security Policy Database entries) for a particular overlay network peer
(VTEP) are shared global state as they have to be programmed into the
root network namespace. The same parameters are used when encrypting
VXLAN traffic to a particular VTEP for all overlay networks. Deleting
the entries for a VTEP will break encryption to that VTEP across all
encrypted overlay networks, therefore the decision of when to delete the
entries must take the state of all overlay networks into account.
Unfortunately this is not the case.

The overlay driver uses local per-network state to decide when to
program and delete the parameters for a VTEP. In practice, the
parameters for all VTEPs participating in an encrypted overlay network
are deleted when the network is deleted. Encryption to that VTEP over
all other active encrypted overlay networks would be broken until some
other incidental peerDB event triggered a re-programming of the
parameters for that VTEP.

Change the setupEncryption and removeEncryption functions to be
reference-counted. The removeEncryption function needs to be called the
same number of times as addEncryption before the parameters are deleted
from the kernel.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 057e35d)
Signed-off-by: Cory Snider <[email protected]>
It is easier to find all references when they are struct fields rather
than embedded structs.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 74713e1)
Signed-off-by: Cory Snider <[email protected]>
func (*driver) secMapWalk is a curious beast. It is named walk, yet it
also mutates the collection being iterated over. It returns an error,
but that error is always nil. It takes a callback that can break
iteration, yet the only caller makes no use of that affordance. Its
utility is limited and the abstraction hinders readability more than it
helps. Open-code the d.secMap.nodes loop into
func (*driver) updateKeys(), the only caller.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit a1d2997)
Signed-off-by: Cory Snider <[email protected]>
There is a dedicated mutex for synchronizing access to the encrMap.
Separately, the main driver mutex is used for synchronizing access to
the encryption keys. Their use is sufficient to prevent data races (if
used correctly, which is not the case) but not logical race conditions.
Programming the encryption parameters for a peer can race with
encryption keys being updated, which could lead to inconsistencies
between the parameters programmed into the kernel and the desired state.

Introduce a new mutex for synchronizing encryption operations. Use that
mutex to synchronize access to both encrMap and keys. Handle encryption
key updates in a critical section so they can no longer be interleaved
with kernel programming of encryption parameters.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 843cd96)
Signed-off-by: Cory Snider <[email protected]>
The concurrency control in the overlay driver is logically unsound.
While the use of mutexes is sufficient to prevent data races --
violations of the Go memory model -- many operations which need to be
atomic are performed with unbounded concurrency.

Overhaul the use of locks in the overlay network driver. Implement sound
locking at the network granularity: operations may proceed concurrently
iff they are being applied to distinct networks. Push the responsibility
of locking up to the code which calls methods or accesses struct fields
to avoid deadlock situations like we had previously with
d.initSandboxPeerDB() and to make the code easier to reason about.

Each overlay network has a distinct peer db. The NetworkDB watch for the
overlay peer table for the network will only start after
(*driver).CreateNetwork returns and will be stopped before libnetwork
calls (*driver).DeleteNetwork, therefore the lifetime of the peer db for
a network is constrained to the lifetime of the network itself. Yet the
peer db for a network is tracked in a dedicated map, separately from the
network objects themselves. This has resulted in a parallel set of
mutexes to manage concurrency of the peer db distinct from the mutexes
for the driver and networks. Move the peer db for a network into a field
of the network struct and guard it from concurrent access using the
per-network lock. Move the methods for manipulating the peer db into the
network struct so that the methods can only be called if the caller has
a reference to the network object.

Network creation and deletion are synchronized using the driver-scope
mutex, but some of the kernel programming is performed outside of the
critical section. It is possible for network deletion to race with
recreating the network, interleaving the kernel programming for the
network creation and deletion, resulting in inconsistent kernel state.
Parallelize network creation and deletion soundly. Use a double-checked
locking scheme to soundly handle the case of concurrent CreateNetwork
and DeleteNetwork for the same network id without blocking operations
on other networks. Synchronize operations on a network so that
operations on the network such as adding a neighbor to the peer db are
performed atomically, not interleaved with deleting the network.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 89d3419)
Signed-off-by: Cory Snider <[email protected]>
When handling updates to existing entries, it is often necessary to know
what the previous value was. NetworkDB knows the previous and new values
when it broadcasts an update event for an entry. Include both values in
the update event so the watchers do not have to do their own parallel
bookkeeping.

Unify the event types under WatchEvent as representing the operation kind
in the type system has been inconvenient, not useful. The operation is
now implied by the nilness of the Value and Prev event fields.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 69c3c56)
Signed-off-by: Cory Snider <[email protected]>
Overlay is the only driver which makes use of the EventNotify facility,
yet all other driver implementations are forced to provide a stub
implementation. Move the EventNotify and DecodeTableEntry methods into a
new optional TableWatcher interface and remove the stubs from all the
other drivers.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 844023f)
Signed-off-by: Cory Snider <[email protected]>
The macAddr and ipmac types are generally useful within libnetwork. Move
them to a dedicated package and overhaul the API to be more like that of
the net/netip package.

Update the overlay driver to utilize these types, adapting to the new
API.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit c7b9370)
Signed-off-by: Cory Snider <[email protected]>
Windows and Linux overlay driver instances are interoperable, working
from the same NetworkDB table for peer discovery. As both drivers
produce and consume serialized data through the table, they both need to
have a shared understanding of the shape and semantics of that data.
The Windows overlay driver contains a duplicate copy of the protobuf
definitions used for marshaling and unmarshaling the NetworkDB peer
entries for dubious reasons. It gives us the flexibility to have the
definitions diverge, which is only really useful for shooting ourselves
in the foot.

Make libnetwork/drivers/overlay the source of truth for the peer record
definitions and the name of the NetworkDB table for distributing peer
records.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 8340e10)
Signed-off-by: Cory Snider <[email protected]>
The eventually-consistent nature of NetworkDB means we cannot depend on
events being received in the same order that they were sent. Nor can we
depend on receiving events for all intermediate states. It is possible
for a series of entry UPDATEs, or a DELETE followed by a CREATE with the
same key, to get coalesced into a single UPDATE event on the receiving
node. Watchers of NetworkDB tables therefore need to be prepared to
gracefully handle arbitrary UPDATEs of a key, including those where the
new value may have nothing in common with the previous value.

The overlay driver naively handled events for overlay_peer_table
assuming that an endpoint leave followed by a rejoin of the same
endpoint would always be expressed as a DELETE event followed by a
CREATE. It would handle a coalesced UPDATE as a CREATE, inserting a new
entry into peerDB without removing the old one. This would
have various side effects, such as having the "transient state" of
multiple entries in peerDB with the same peer IP never settle.

Update driverapi to pass both the previous and new value of a table
entry into the driver. Modify the overlay driver to handle an UPDATE by
removing the previous peer entry from peerDB then adding the new one.
Modify the Windows overlay driver to match.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit e1a586a)

libn/d/overlay: don't deref nil PeerRecord on error

If unmarshaling the peer record fails, there is no need to check if it's
a record for a local peer. Attempting to do so anyway will result in a
nil-dereference panic. Don't do that.

The Windows overlay driver has a typo: prevPeer is being checked twice
for whether it was a local-peer record. Check prevPeer once and newPeer
once each, as intended.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 12c6345)

Signed-off-by: Cory Snider <[email protected]>
The eventually-consistent nature of NetworkDB means we cannot depend on
events being received in the same order that they were sent. Nor can we
depend on receiving events for all intermediate states. It is possible
for a series of entry UPDATEs, or a DELETE followed by a CREATE with the
same key, to get coalesced into a single UPDATE event on the receiving
node. Watchers of NetworkDB tables therefore need to be prepared to
gracefully handle arbitrary UPDATEs of a key, including those where the
new value may have nothing in common with the previous value.

The libnetwork controller naively handled events for endpoint_table
assuming that an endpoint leave followed by a rejoin of the same
endpoint would always be expressed as a DELETE event followed by a
CREATE. It would handle a coalesced UPDATE as a CREATE, adding a new
service binding without removing the old one. This would
have various side effects, such as having the "transient state" of
having multiple conflicting service bindings where more than one
endpoint is assigned an IP address never settling.

Modify the libnetwork controller to handle an UPDATE by removing the
previous service binding then adding the new one.

Signed-off-by: Cory Snider <[email protected]>
(cherry picked from commit 4538a1d)
Signed-off-by: Cory Snider <[email protected]>
@corhere corhere force-pushed the backport-25.0/libn/all-the-overlay-fixes branch from 8a0506b to f099e91 Compare August 11, 2025 19:13
@corhere corhere merged commit 165516e into moby:25.0 Aug 11, 2025
202 of 203 checks passed
@corhere corhere deleted the backport-25.0/libn/all-the-overlay-fixes branch August 11, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants