-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Description
Encrypted overlay networks in Docker Swarm can experience silent connection failures between containers on different nodes when the IPSec ESP sequence number overflows the 32-bit limit (2^32 packets) within a key rotation interval.
Docker uses IPSec in transport mode to encrypt VXLAN traffic between swarm nodes. Each Security Association (SA) uses a 32-bit sequence number that increments with every packet. When this sequence number wraps around, packets are silently dropped by the kernel's XFRM subsystem as potential replay attacks, causing all communication between the affected node pair to fail.
The key rotation interval is hardcoded to 12 hours in
https://github.com/moby/swarmkit/blob/17b8d222e7dd3a2f71919224a5813bab5fe47f39/manager/keymanager/keymanager.go#L26
const (
// DefaultKeyRotationInterval used by key manager
DefaultKeyRotationInterval = 12 * time.Hour
)This limits throughput to approximately ~100 MB/s average per node-pair per direction (assuming ~1000 byte average packet size: 2^32 packets × 1000 bytes / 12 hours ≈ 100 MB/s).
The issue is exacerbated because the 12-hour timer restarts whenever swarm leadership changes. The key manager is started fresh each time a node becomes leader, as shown in
https://github.com/moby/swarmkit/blob/17b8d222e7dd3a2f71919224a5813bab5fe47f39/manager/manager.go#L927
// becomeLeader starts the subsystems that are run on the leader.
func (m *Manager) becomeLeader(ctx context.Context) {
// ...
m.keyManager = keymanager.New(s, keymanager.DefaultConfig())
// ...
if m.keyManager != nil {
go func(keyManager *keymanager.KeyManager) {
if err := keyManager.Run(ctx); err != nil {
log.G(ctx).WithError(err).Error("keymanager failed with an error")
}
}(m.keyManager)
}
}The Run function creates a fresh ticker without considering the age of existing keys:
func (k *KeyManager) Run(ctx context.Context) error {
// ...
ticker := time.NewTicker(k.config.RotationInterval)
defer ticker.Stop()
// ...
}
Worst case scenario: If leader elections occur frequently (within every 12-hour window), keys may never rotate, and sequence numbers will eventually overflow.
Proposed fixes:
- Reduce default interval (quick fix): Change
DefaultKeyRotationIntervalfrom 12 hours to 1 hour. This doesn't fully solve the problem but reduces the likelihood during normal operations. - Make rotation interval configurable: Expose
RotationIntervalas a cluster configuration option so operators can adjust based on their workload characteristics. - Persist key creation timestamp: Store the key creation time in the cluster state and calculate the remaining rotation time when a new leader takes over, rather than restarting a fresh 12-hour timer.
- Proactive sequence number monitoring: Monitor XFRM state packet counters (via netlink) and trigger early key rotation when sequence numbers reach ~90% of the 32-bit limit.
- Using 64-bit Extended Sequence Number (ESN)
I'm happy to contribute a PR for any of these fixes but would appreciate guidance on the preferred approach.
Reproduce
- Set up a Docker Swarm cluster with at least 2 nodes
- Create an encrypted overlay network:
docker network create --driver overlay --opt encrypted=true my-encrypted-network
- Deploy services on the network across multiple nodes
- Generate high-throughput traffic between containers on different nodes (e.g., using iperf3)
- Monitor IPSec state with:
sudo ip -s xfrm state
- Observe packet counters approaching 2^32 (~4.3 billion packets)
- Connection drops silently when the sequence number overflows
Expected behavior
- Encrypted overlay networks should handle high-throughput workloads without silent connection failures
- Key rotation should occur based on the actual age of the keys, not a timer that resets on leadership changes
- Optionally, the system should proactively rotate keys when sequence numbers approach exhaustion
docker version
Client:
Version: 28.3.3
API version: 1.51
Go version: go1.24.5
Git commit: 980b856
Built: Fri Jul 25 11:33:09 2025
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 28.3.3
API version: 1.51 (minimum version 1.24)
Go version: go1.24.5
Git commit: bea959c7
Built: Fri Jul 25 11:35:35 2025
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.7.27
GitCommit: 05044ec0a9a75232cad458027ca83437aae3f4da
runc:
Version: 1.2.6
GitCommit: v1.2.6-0-ge89a299
docker-init:
Version: 0.19.0
GitCommit: de40ad0docker info
Client:
Version: 28.3.3
Context: default
Debug Mode: false
Server:
Containers: 13
Running: 13
Paused: 0
Stopped: 0
Images: 20
Server Version: 28.3.3
Storage Driver: overlay2
Backing Filesystem: btrfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
CDI spec directories:
/etc/cdi
/var/run/cdi
Swarm: active
NodeID: 35iu7fcy47gr6i2umzq4kdfph
Is Manager: true
ClusterID: ff4eqtypagy9eyhar1n0g44j4
Managers: 3
Nodes: 3
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 500
Number of Old Snapshots to Retain: 3
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.1.11.62
Manager Addresses: REDACTED
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
runc version: v1.2.6-0-ge89a299
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.1.154_1
Operating System: REDACTED
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.63GiB
Name: REDACTED
ID: 14a00188-1774-4558-a208-918a0e803e32
Docker Root Dir: /data/docker
Debug Mode: false
Labels:
Experimental: false
Insecure Registries:
::1/128
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: No swap limit supportAdditional Info
Example output from sudo ip -s xfrm state when the sequence overrun occurs
src <IP1-REDACTED> dst <IP2-REDACTED>
proto esp spi 0x0e85f66d(243660397) reqid 13681891(0x00d0c4e3) mode transport
replay-window 0 seq 0x00000000 flag (0x00000000)
aead rfc4106(gcm(aes)) 0x71c4de3f8b2d51059a5b8546228d7b120e85f66d (160 bits) 64
anti-replay context: seq 0x0, oseq 0x0, bitmap 0x00000000
sel src 0.0.0.0/0 dst 0.0.0.0/0 uid 0
lifetime config:
limit: soft (INF)(bytes), hard (INF)(bytes)
limit: soft (INF)(packets), hard (INF)(packets)
expire add: soft 0(sec), hard 0(sec)
expire use: soft 0(sec), hard 0(sec)
lifetime current:
3214713168983(bytes), 4294938279(packets) # <-- packets close to 2^32
add 2026-01-20 10:40:21 use 2026-01-20 22:40:21
stats:
replay-window 0 replay 0 failed 0