Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Graceful leave will often timeout on a large cluster though nothing is wrong #8435

@banks

Description

@banks

This is really a Serf/Memberlist issue but causes trouble for Consul users so reporting it here as the solution likely involves all three layers.

When a node attempts to gracefully leave a cluster, it calls serf.Leave which will wait for BroadcastTimeout for the message to be sent out: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/serf/serf.go#L717

Resulting in output like:

<...> [INFO] agent: Gracefully shutting down agent...
<...> [ERR] consul: Failed to leave LAN Serf cluster: timeout while waiting for graceful leave

BroadcastTimeout is 5 seconds by default (and not configurable in Consul).

But if you follow the code through into memberlist, the notify channel we are waiting on is actually just waiting for the message to get through the broadcast queue the appropriate number of times: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/queue.go#L353-L356

In Consul, the retransmit mult defaults to 4 and so the retransmit limit ends up being: https://github.com/hashicorp/serf/blob/2a20f94a0fd71f606ea5f58ad3c1457f5ee11bf5/vendor/github.com/hashicorp/memberlist/util.go#L71-L76

i.e. for a different cluster size, the following number of retransmits is used:

numNodes =      1, retransmitLimit =      4, minBroadcastTime: 800ms
numNodes =     10, retransmitLimit =      8, minBroadcastTime: 1.6s
numNodes =    100, retransmitLimit =     12, minBroadcastTime: 2.4s
numNodes =   1000, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =   2000, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =   5000, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =   7500, retransmitLimit =     16, minBroadcastTime: 3.2s
numNodes =  10000, retransmitLimit =     20, minBroadcastTime: 4s

https://play.golang.org/p/7Jk6G34fkle

Now at worst getBroadcasts is only called once every GossipInterval (defaults to 200ms in Consul). It is sometimes better than this since we attempt to piggy back on any other message being sent too which might occur more often, but let's assume it's common only to be making one transmit attempt every 200ms.

So the minBroadcastTime column shows the theoretical minimum time it would take to make enough reBroadcasts for the notify channel to be closed. In theory even with 10k nodes this fits inside the 5s default but only just and it's only a minimum - any other messages being broadcast are competing for the limited space in each UDP packet sent which may well mean it takes several rounds of gossip for each broadcast to go out. In fact we prioritize messages that have been sent fewer times, so it gets increasingly likely on each re-broadcast that we won't deliver it in the next gossip round.

Anecdotally, anyone running a large enough cluster (on the order of 1000 nodes or more). Will often see graceful leave "timeout" and especially if there are any other changes in the cluster causing more gossip messages to be broadcast than usual.

The question is: What does this broadcast timeout achieve? If the goal is to keep the sending node around for long enough to ensure the message is sent, then the timeout should probably be proportional to the clustersize/number of attempts that will be made to send it. If we only care that we made some effort to send more than a few times, we should probably not wait for every single retransmit. The ultimate question is: Why report this as an error to operators when it's just natural in any large cluster and doesn't typically mean that the broadcast was actually any less effective?

Possible solutions:

One or more of these are possible.

  1. don't show that error on leave as it makes no real difference especially since we immediately sleep afterwards for longer than 5s any way which probably means the broadcast does complete even when we say it "timedout"
  2. change Serf to make that timeout be proportional to cluster size so it at least indicates that something unusual happened (rather than just being always hit past a certain practical size)
  3. change Serf/memberlist to only wait for N broadcasts before we consider the broadcast "sent" even if we send more - i.e. regardless of cluster size only wait on say rebroadcast_mult broadcasts before saying "we sent this".

Metadata

Metadata

Assignees

No one assigned

    Labels

    theme/internalsSerf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topicstheme/operator-usabilityReplaces UX. Anything related to making things easier for the practitionertheme/reliability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions