libnet/portmappers/nat: retry host-namespace bind on EADDRINUSE in rootless#52927
libnet/portmappers/nat: retry host-namespace bind on EADDRINUSE in rootless#52927notandruu wants to merge 1 commit into
Conversation
…otless In rootless mode, MapPorts performs two binds in different network namespaces. OSAllocator.RequestPortsInRange reserves a port and probes it with bind(2) in the daemon's (child) namespace, retrying on EADDRINUSE. configPortDriver then asks rootlesskit to bind the same host port in the host (parent) namespace, which has a separate port table. A dynamic port that is free in the child namespace can still be in use in the host namespace, and only the host bind sees that. The child-side retry does not help, so concurrent container starts that request ephemeral host ports intermittently fail with: cannot expose port X: listen tcp 0.0.0.0:X: bind: address already in use Retry the whole allocate-and-bind cycle when the port driver reports EADDRINUSE and the host port is dynamic. The allocator round-robins via its last-allocated pointer (which ReleasePort does not reset), so each attempt picks a different port and the retry converges. Fixed host ports are not retried, since the allocator would just pick the same port. The errno is flattened to a string across the rootlesskit IPC boundary, so it is matched by message as well as with errors.Is. This path is only exercised in rootless mode (pdc != nil); rootful behavior is unchanged. Signed-off-by: Andrew Liu <[email protected]>
Did you do some manual tests too? |
|
I wonder if the host namespace should be leading in all cases (or at least be the first check?) assuming that we have control over the child namespace, but host may be impacted by other tools? |
related to that; |
Fwiw I ran the (now corrected) reproducer script against the fixed implementation, and it showed the expected near-zero failure rate. (The script was biased against solutions where the retries all happen in a short time window.) |
Yes — in addition to daniel-huss's reproducer script validation (thanks for running that), I tested manually against the rootless Lima VM used for the rootlesskit fix (#52804): concurrent
That's a reasonable direction. The catch is that the host-namespace bind today happens inside rootlesskit's port driver via The retry keeps the existing two-phase structure intact and just adds resilience at the seam where the failure occurs. If you'd prefer the host-namespace-first approach I'm happy to explore it, but wanted to flag the restructuring involved before committing to that direction. |
Summary
In rootless mode,
MapPortsbinds the same host port twice in two different network namespaces:OSAllocator.RequestPortsInRangereserves a port and probes it withbind(2)in the daemon's (child) namespace, retrying on EADDRINUSE.configPortDriverthen asks rootlesskit to bind the same host port in the host (parent) namespace, which has a separate port table.A dynamic port that is free in the child namespace can still be in use in the host namespace, and only step 2 sees that. The child-side retry provides no protection, so concurrent container starts requesting ephemeral host ports intermittently fail with:
(@AkihiroSuda invited a PR on the issue.)
Approach
Wrap the allocate-and-bind cycle in a retry loop: when the port driver reports EADDRINUSE and the host port is dynamic, release the attempt's bindings (closing the child-namespace probe sockets and freeing the allocator reservation) and try again. The allocator round-robins via its last-allocated pointer, which
ReleasePortdoes not reset, so each attempt picks a different port and the retry converges (capped atmaxHostBindAttempts). Fixed host ports (HostPort == HostPortEnd != 0) are not retried, since the allocator would just return the same port.The errno is flattened into a string across the rootlesskit IPC boundary, so
isAddrInUsematches the message in addition toerrors.Is(err, syscall.EADDRINUSE).This path is only reached in rootless mode (
pdc != nil); rootful behavior is unchanged.Tests
go test ./daemon/libnetwork/portmappers/nat/(Linux):TestMapPortsRetriesHostNamespaceCollision: a fake port driver returns EADDRINUSE on the firstAddPortthen succeeds; assertsMapPortsretries and that the retry binds a different host port. This exercises the realOSAllocator(only the rootlesskit IPC is faked), so it validates the actual round-robin convergence, not a mocked one.TestMapPortsNoRetryForFixedPort: a fixed host port that always fails is attempted exactly once (no retry).Note on validation scope: the underlying failure is a nondeterministic race between concurrent starts, so I validated the fix mechanism (retry occurs, picks a fresh port, converges; fixed ports excluded) deterministically via the unit tests against the real allocator, rather than relying on an inherently flaky concurrent reproduction.
Release notes
Created with: Claude Code