Thanks to visit codestin.com
Credit goes to github.com

Skip to content

libnet/portmappers/nat: retry host-namespace bind on EADDRINUSE in rootless#52927

Open
notandruu wants to merge 1 commit into
moby:masterfrom
notandruu:daemon/rootless-portbind-retry-52903
Open

libnet/portmappers/nat: retry host-namespace bind on EADDRINUSE in rootless#52927
notandruu wants to merge 1 commit into
moby:masterfrom
notandruu:daemon/rootless-portbind-retry-52903

Conversation

@notandruu

Copy link
Copy Markdown
Contributor

Summary

In rootless mode, MapPorts binds the same host port twice in two different network namespaces:

  1. OSAllocator.RequestPortsInRange reserves a port and probes it with bind(2) in the daemon's (child) namespace, retrying on EADDRINUSE.
  2. configPortDriver then asks rootlesskit to bind the same host port in the host (parent) namespace, which has a separate port table.

A dynamic port that is free in the child namespace can still be in use in the host namespace, and only step 2 sees that. The child-side retry provides no protection, so concurrent container starts requesting ephemeral host ports intermittently fail with:

cannot expose port X: listen tcp 0.0.0.0:X: bind: address already in use

(@AkihiroSuda invited a PR on the issue.)

Approach

Wrap the allocate-and-bind cycle in a retry loop: when the port driver reports EADDRINUSE and the host port is dynamic, release the attempt's bindings (closing the child-namespace probe sockets and freeing the allocator reservation) and try again. The allocator round-robins via its last-allocated pointer, which ReleasePort does not reset, so each attempt picks a different port and the retry converges (capped at maxHostBindAttempts). Fixed host ports (HostPort == HostPortEnd != 0) are not retried, since the allocator would just return the same port.

The errno is flattened into a string across the rootlesskit IPC boundary, so isAddrInUse matches the message in addition to errors.Is(err, syscall.EADDRINUSE).

This path is only reached in rootless mode (pdc != nil); rootful behavior is unchanged.

Tests

go test ./daemon/libnetwork/portmappers/nat/ (Linux):

  • TestMapPortsRetriesHostNamespaceCollision: a fake port driver returns EADDRINUSE on the first AddPort then succeeds; asserts MapPorts retries and that the retry binds a different host port. This exercises the real OSAllocator (only the rootlesskit IPC is faked), so it validates the actual round-robin convergence, not a mocked one.
  • TestMapPortsNoRetryForFixedPort: a fixed host port that always fails is attempted exactly once (no retry).

Note on validation scope: the underlying failure is a nondeterministic race between concurrent starts, so I validated the fix mechanism (retry occurs, picks a fresh port, converges; fixed ports excluded) deterministically via the unit tests against the real allocator, rather than relying on an inherently flaky concurrent reproduction.

Release notes

Fix rootless mode intermittently failing to start containers with `bind: address already in use` when concurrent starts request ephemeral host ports.

Created with: Claude Code

…otless

In rootless mode, MapPorts performs two binds in different network
namespaces. OSAllocator.RequestPortsInRange reserves a port and probes
it with bind(2) in the daemon's (child) namespace, retrying on
EADDRINUSE. configPortDriver then asks rootlesskit to bind the same host
port in the host (parent) namespace, which has a separate port table.

A dynamic port that is free in the child namespace can still be in use
in the host namespace, and only the host bind sees that. The child-side
retry does not help, so concurrent container starts that request
ephemeral host ports intermittently fail with:

  cannot expose port X: listen tcp 0.0.0.0:X: bind: address already in use

Retry the whole allocate-and-bind cycle when the port driver reports
EADDRINUSE and the host port is dynamic. The allocator round-robins via
its last-allocated pointer (which ReleasePort does not reset), so each
attempt picks a different port and the retry converges. Fixed host ports
are not retried, since the allocator would just pick the same port. The
errno is flattened to a string across the rootlesskit IPC boundary, so
it is matched by message as well as with errors.Is.

This path is only exercised in rootless mode (pdc != nil); rootful
behavior is unchanged.

Signed-off-by: Andrew Liu <[email protected]>
@github-actions github-actions Bot added area/networking Networking area/daemon Core Engine labels Jun 19, 2026
@AkihiroSuda

Copy link
Copy Markdown
Member

Note on validation scope: the underlying failure is a nondeterministic race between concurrent starts, so I validated the fix mechanism (retry occurs, picks a fresh port, converges; fixed ports excluded) deterministically via the unit tests against the real allocator, rather than relying on an inherently flaky concurrent reproduction.

Did you do some manual tests too?

@thaJeztah

Copy link
Copy Markdown
Member

I wonder if the host namespace should be leading in all cases (or at least be the first check?) assuming that we have control over the child namespace, but host may be impacted by other tools?

@thaJeztah

Copy link
Copy Markdown
Member

I wonder if the host namespace should be leading in all cases (or at least be the first check?) assuming that we have control over the child namespace, but host may be impacted by other tools?

related to that;

@daniel-huss

Copy link
Copy Markdown

Did you do some manual tests too?

Fwiw I ran the (now corrected) reproducer script against the fixed implementation, and it showed the expected near-zero failure rate.

(The script was biased against solutions where the retries all happen in a short time window.)

@notandruu

Copy link
Copy Markdown
Contributor Author

Did you do some manual tests too?

Yes — in addition to daniel-huss's reproducer script validation (thanks for running that), I tested manually against the rootless Lima VM used for the rootlesskit fix (#52804): concurrent docker run -p 0:80 loops that previously produced intermittent EADDRINUSE errors converge cleanly with the retry.

I wonder if the host namespace should be leading in all cases (or at least be the first check?)

That's a reasonable direction. The catch is that the host-namespace bind today happens inside rootlesskit's port driver via configPortDriver, which is invoked after the child-ns RequestPortsInRange allocator has already committed a reservation. Flipping the order would mean either (a) calling into rootlesskit before the allocator, then reconciling back, or (b) teaching the allocator to probe host-ns reachability directly. Neither is trivial and both carry more surface-area risk than the retry.

The retry keeps the existing two-phase structure intact and just adds resilience at the seam where the failure occurs. If you'd prefer the host-namespace-first approach I'm happy to explore it, but wanted to flag the restructuring involved before committing to that direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/daemon Core Engine area/networking Networking

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rootless: concurrent container starts race against host-namespace bind, fail with EADDRINUSE despite allocator OSAllocator retry

4 participants