Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Rootless: concurrent container starts race against host-namespace bind, fail with EADDRINUSE despite allocator OSAllocator retry #52903

Description

@daniel-huss

Description

When containers are started in rootless mode and request an ephemeral host port, container starts can fail with:

error while calling RootlessKit PortManager.AddPort(): cannot expose port X: listen tcp 0.0.0.0:X: bind: address already in use

Retrying the start usualy picks a different port and succeeds.

Diagnosis

daemon/libnetwork/portmappers/nat/mapper_linux.go::MapPorts performs two distinct host-port binds:

  1. OSAllocator.RequestPortsInRange (osallocator_linux.go) reserves a port in portallocator's in-memory map and calls bind(2) (osallocator_linux.go) as a probe, retrying on EADDRINUSE (maxAllocateAttempts,osallocator_linux.go). Under rootless Docker, this bind happens in dockerd's network namespace, i.e. rootlesskit's child netns.

  2. configPortDriver (mapper_linux.go) then calls rlkclient.PortDriverClient.AddPort, which makes rootlesskit perform a second bind in the host (parent) netns at rootlesskit/pkg/port/builtin/parent/tcp/tcp.go. On EADDRINUSE this error just propagates back to the API client, no retry (mapper_linux.go).

The two netns have separate kernel port tables. The retry in step 1 catches collisions in the child netns but provides no protection against host-netns collisions in step 2. The OSAllocator's bind probe verifies a port is free in the wrong network namespace.

Proposed fix

Retry the allocate-and-bind cycle when pdc.AddPort reports EADDRINUSE. Sadly, the the IPC layer flattens errno into a string, so unless you're willing to do a coordinated API change this will require a sentinel inside the eror message string.

Reproduce

#!/usr/bin/env bash
set -u

DOCKERD_PID="$(pgrep -x dockerd | head -1 || true)"
if [[ -z "$DOCKERD_PID" ]]; then
    echo "ERROR: dockerd not running" >&2; exit 1
fi
if [[ "$(readlink /proc/self/ns/net)" == "$(readlink /proc/"$DOCKERD_PID"/ns/net)" ]]; then
    echo "ERROR: this shell shares dockerd's netns — re-run from a plain user shell." >&2
    exit 1
fi

LO="${LO:-33000}"
HI="${HI:-33099}"
ITERATIONS="${ITERATIONS:-100}"
PARALLEL="${PARALLEL:-10}"
HOLD_S="${HOLD_S:-0.05}"
GAP_S="${GAP_S:-0.05}"
BACKOFF_S="${BACKOFF_S:-0.5}"
RETRY_BUDGET="${RETRY_BUDGET:-10}"

echo "attacking host ports $LO-$HI"

docker pull alpine:3 >/dev/null

python3 - "$LO" "$HI" "$HOLD_S" "$GAP_S" "$BACKOFF_S" <<'PY' &
import socket, sys, threading, time, random
lo, hi = int(sys.argv[1]), int(sys.argv[2])
HOLD_S, GAP_S, BACKOFF_S = float(sys.argv[3]), float(sys.argv[4]), float(sys.argv[5])
print(f"hammering host-netns ports {lo}-{hi} with one worker per port", flush=True)

def worker(port):
    time.sleep(random.uniform(0, HOLD_S + GAP_S))
    while True:
        try:
            s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            s.bind(("0.0.0.0", port))
            s.listen(1)
        except OSError:
            time.sleep(BACKOFF_S)
            continue
        time.sleep(HOLD_S)
        s.close()
        time.sleep(GAP_S)

for port in range(lo, hi + 1):
    threading.Thread(target=worker, args=(port,), daemon=True).start()
while True:
    time.sleep(60)
PY
HAMMER_PID=$!
trap 'kill $HAMMER_PID 2>/dev/null' EXIT
sleep 0.5

echo "running $ITERATIONS docker iterations, up to $PARALLEL in parallel..."

export LO HI
results=$(seq 1 "$ITERATIONS" | xargs -P "$PARALLEL" -I{} bash -c '
    iter="$1"
    if out=$(docker run --rm -p "${LO}-${HI}:80" alpine:3 true 2>&1); then
        printf "iter %3d OK\n" "$iter"
    else
        printf "iter %3d FAIL: %s\n" "$iter" "$(echo "$out" | tr "\n" " " | sed "s/  */ /g")"
    fi
' _ {})

echo "$results"
fail=$(grep -c ' FAIL' <<<"$results")

echo
awk -v hold="$HOLD_S" -v gap="$GAP_S" -v retries="$RETRY_BUDGET" \
    -v fail="$fail" -v total="$ITERATIONS" \
    'BEGIN {
        p = hold / (hold + gap);
        printf "expected per-attempt failure rate:    %7.3f%%\n", p * 100;
        printf "expected after %dx retry (if fixed):   %7.3f%%\n", retries, (p^retries) * 100;
        printf "observed failure rate:                %7.3f%%  (%d/%d)\n", (fail/total) * 100, fail, total;
    }'

EDIT: Ugh, sorry the "reproducer" script didn't make sense, fixed it.
EDIT2: Added desync to hammer workers so the p^retries assumption holds.

Expected behavior

Port allocation succeeds.

docker version

Client:
 Version:           26.1.5+dfsg1
 API version:       1.45
 Go version:        go1.24.4
 Git commit:        a72d7cd
 Built:             Sat May  9 11:34:09 2026
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          26.1.5+dfsg1
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.24.4
  Git commit:       411e817
  Built:            Sat May  9 11:34:09 2026
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.24~ds1
  GitCommit:        1.7.24~ds1-6+deb13u1
 runc:
  Version:          1.1.15+ds1
  GitCommit:        1.1.15+ds1-2+b4
 docker-init:
  Version:          0.19.0
  GitCommit:

docker info

Client:
 Version:    26.1.5+dfsg1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  0.13.1+ds1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  2.26.1-4
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 20
  Running: 20
  Paused: 0
  Stopped: 0
 Images: 43
 Server Version: 26.1.5+dfsg1
 Storage Driver: btrfs
  Btrfs: 
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1.7.24~ds1-6+deb13u1
 runc version: 1.1.15+ds1-2+b4
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.12.90+deb13.1-amd64
 Operating System: Debian GNU/Linux 13 (trixie)
 OSType: linux
 Architecture: x86_64
 CPUs: 28
 Total Memory: 62.49GiB
 Name: work
 ID: 0b64ca5d-b38b-4de3-abe4-9dfdc0b6965b
 Docker Root Dir: /var/containers
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/networkingNetworkingarea/rootlessRootless Modekind/bugBugs are bugs. The cause may or may not be known at triage time so debugging may be needed.version/26.1

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions