-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
This is really a Serf/Memberlist issue but causes trouble for Consul users so reporting it here as the solution likely involves all three layers.
When a node attempts to gracefully leave a cluster, it calls serf.Leave which will wait for BroadcastTimeout for the message to be sent out: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/serf/serf.go#L717
Resulting in output like:
<...> [INFO] agent: Gracefully shutting down agent...
<...> [ERR] consul: Failed to leave LAN Serf cluster: timeout while waiting for graceful leave
BroadcastTimeout is 5 seconds by default (and not configurable in Consul).
But if you follow the code through into memberlist, the notify channel we are waiting on is actually just waiting for the message to get through the broadcast queue the appropriate number of times: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/queue.go#L353-L356
In Consul, the retransmit mult defaults to 4 and so the retransmit limit ends up being: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/util.go#L71-L76
i.e. for a different cluster size, the following number of retransmits is used:
numNodes = 1, retransmitLimit = 4, minBroadcastTime: 800ms
numNodes = 10, retransmitLimit = 8, minBroadcastTime: 1.6s
numNodes = 100, retransmitLimit = 12, minBroadcastTime: 2.4s
numNodes = 1000, retransmitLimit = 16, minBroadcastTime: 3.2s
numNodes = 2000, retransmitLimit = 16, minBroadcastTime: 3.2s
numNodes = 5000, retransmitLimit = 16, minBroadcastTime: 3.2s
numNodes = 7500, retransmitLimit = 16, minBroadcastTime: 3.2s
numNodes = 10000, retransmitLimit = 20, minBroadcastTime: 4s
https://play.golang.org/p/7Jk6G34fkle
Now at worst getBroadcasts is only called once every GossipInterval (defaults to 200ms in Consul). It is sometimes better than this since we attempt to piggy back on any other message being sent too which might occur more often, but let's assume it's common only to be making one transmit attempt every 200ms.
So the minBroadcastTime column shows the theoretical minimum time it would take to make enough reBroadcasts for the notify channel to be closed. In theory even with 10k nodes this fits inside the 5s default but only just and it's only a minimum - any other messages being broadcast are competing for the limited space in each UDP packet sent which may well mean it takes several rounds of gossip for each broadcast to go out. In fact we prioritize messages that have been sent fewer times, so it gets increasingly likely on each re-broadcast that we won't deliver it in the next gossip round.
Anecdotally, anyone running a large enough cluster (on the order of 1000 nodes or more). Will often see graceful leave "timeout" and especially if there are any other changes in the cluster causing more gossip messages to be broadcast than usual.
The question is: What does this broadcast timeout achieve? If the goal is to keep the sending node around for long enough to ensure the message is sent, then the timeout should probably be proportional to the clustersize/number of attempts that will be made to send it. If we only care that we made some effort to send more than a few times, we should probably not wait for every single retransmit. The ultimate question is: Why report this as an error to operators when it's just natural in any large cluster and doesn't typically mean that the broadcast was actually any less effective?
Possible solutions:
One or more of these are possible.
- don't show that error on leave as it makes no real difference especially since we immediately sleep afterwards for longer than 5s any way which probably means the broadcast does complete even when we say it "timedout"
- change Serf to make that timeout be proportional to cluster size so it at least indicates that something unusual happened (rather than just being always hit past a certain practical size)
- change Serf/memberlist to only wait for N broadcasts before we consider the broadcast "sent" even if we send more - i.e. regardless of cluster size only wait on say
rebroadcast_multbroadcasts before saying "we sent this".