Thanks to visit codestin.com
Credit goes to github.com

Skip to content

IPSec Sequence Number Exhaustion in Encrypted Overlay Networks #52005

@johartl

Description

@johartl

Description

Encrypted overlay networks in Docker Swarm can experience silent connection failures between containers on different nodes when the IPSec ESP sequence number overflows the 32-bit limit (2^32 packets) within a key rotation interval.

Docker uses IPSec in transport mode to encrypt VXLAN traffic between swarm nodes. Each Security Association (SA) uses a 32-bit sequence number that increments with every packet. When this sequence number wraps around, packets are silently dropped by the kernel's XFRM subsystem as potential replay attacks, causing all communication between the affected node pair to fail.

The key rotation interval is hardcoded to 12 hours in
https://github.com/moby/swarmkit/blob/17b8d222e7dd3a2f71919224a5813bab5fe47f39/manager/keymanager/keymanager.go#L26

const (
    // DefaultKeyRotationInterval used by key manager
    DefaultKeyRotationInterval = 12 * time.Hour
)

This limits throughput to approximately ~100 MB/s average per node-pair per direction (assuming ~1000 byte average packet size: 2^32 packets × 1000 bytes / 12 hours ≈ 100 MB/s).

The issue is exacerbated because the 12-hour timer restarts whenever swarm leadership changes. The key manager is started fresh each time a node becomes leader, as shown in
https://github.com/moby/swarmkit/blob/17b8d222e7dd3a2f71919224a5813bab5fe47f39/manager/manager.go#L927

// becomeLeader starts the subsystems that are run on the leader.
func (m *Manager) becomeLeader(ctx context.Context) {
    // ...
    m.keyManager = keymanager.New(s, keymanager.DefaultConfig())
    // ...
    if m.keyManager != nil {
        go func(keyManager *keymanager.KeyManager) {
            if err := keyManager.Run(ctx); err != nil {
                log.G(ctx).WithError(err).Error("keymanager failed with an error")
            }
        }(m.keyManager)
    }
}

The Run function creates a fresh ticker without considering the age of existing keys:

func (k *KeyManager) Run(ctx context.Context) error {
    // ...
    ticker := time.NewTicker(k.config.RotationInterval)
    defer ticker.Stop()
    // ...
}

Worst case scenario: If leader elections occur frequently (within every 12-hour window), keys may never rotate, and sequence numbers will eventually overflow.

Proposed fixes:

  1. Reduce default interval (quick fix): Change DefaultKeyRotationInterval from 12 hours to 1 hour. This doesn't fully solve the problem but reduces the likelihood during normal operations.
  2. Make rotation interval configurable: Expose RotationInterval as a cluster configuration option so operators can adjust based on their workload characteristics.
  3. Persist key creation timestamp: Store the key creation time in the cluster state and calculate the remaining rotation time when a new leader takes over, rather than restarting a fresh 12-hour timer.
  4. Proactive sequence number monitoring: Monitor XFRM state packet counters (via netlink) and trigger early key rotation when sequence numbers reach ~90% of the 32-bit limit.
  5. Using 64-bit Extended Sequence Number (ESN)

I'm happy to contribute a PR for any of these fixes but would appreciate guidance on the preferred approach.

Reproduce

  1. Set up a Docker Swarm cluster with at least 2 nodes
  2. Create an encrypted overlay network:
docker network create --driver overlay --opt encrypted=true my-encrypted-network
  1. Deploy services on the network across multiple nodes
  2. Generate high-throughput traffic between containers on different nodes (e.g., using iperf3)
  3. Monitor IPSec state with:
sudo ip -s xfrm state
  1. Observe packet counters approaching 2^32 (~4.3 billion packets)
  2. Connection drops silently when the sequence number overflows

Expected behavior

  1. Encrypted overlay networks should handle high-throughput workloads without silent connection failures
  2. Key rotation should occur based on the actual age of the keys, not a timer that resets on leadership changes
  3. Optionally, the system should proactively rotate keys when sequence numbers approach exhaustion

docker version

Client:
 Version:           28.3.3
 API version:       1.51
 Go version:        go1.24.5
 Git commit:        980b856
 Built:             Fri Jul 25 11:33:09 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.3.3
  API version:      1.51 (minimum version 1.24)
  Go version:       go1.24.5
  Git commit:       bea959c7
  Built:            Fri Jul 25 11:35:35 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.6
  GitCommit:        v1.2.6-0-ge89a299
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Version:    28.3.3
 Context:    default
 Debug Mode: false

Server:
 Containers: 13
  Running: 13
  Paused: 0
  Stopped: 0
 Images: 20
 Server Version: 28.3.3
 Storage Driver: overlay2
  Backing Filesystem: btrfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: active
  NodeID: 35iu7fcy47gr6i2umzq4kdfph
  Is Manager: true
  ClusterID: ff4eqtypagy9eyhar1n0g44j4
  Managers: 3
  Nodes: 3
  Default Address Pool: 10.0.0.0/8  
  SubnetSize: 24
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 500
   Number of Old Snapshots to Retain: 3
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 0
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.1.11.62
  Manager Addresses: REDACTED
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.6-0-ge89a299
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.1.154_1
 Operating System: REDACTED
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 15.63GiB
 Name: REDACTED
 ID: 14a00188-1774-4558-a208-918a0e803e32
 Docker Root Dir: /data/docker
 Debug Mode: false
 Labels:
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

WARNING: No swap limit support

Additional Info

Example output from sudo ip -s xfrm state when the sequence overrun occurs

src <IP1-REDACTED> dst <IP2-REDACTED>
	proto esp spi 0x0e85f66d(243660397) reqid 13681891(0x00d0c4e3) mode transport
	replay-window 0 seq 0x00000000 flag  (0x00000000)
	aead rfc4106(gcm(aes)) 0x71c4de3f8b2d51059a5b8546228d7b120e85f66d (160 bits) 64
	anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
	sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0
	lifetime config:
	  limit: soft (INF)(bytes), hard (INF)(bytes)
	  limit: soft (INF)(packets), hard (INF)(packets)
	  expire add: soft 0(sec), hard 0(sec)
	  expire use: soft 0(sec), hard 0(sec)
	lifetime current:
	  3214713168983(bytes), 4294938279(packets)       # <-- packets close to 2^32
	  add 2026-01-20 10:40:21 use 2026-01-20 22:40:21
	stats:
	  replay-window 0 replay 0 failed 0

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions