multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer

PGCoordinator has a mechanism where if it is unable to heartbeat over the pub-sub, it declares itself unhealthy, disconnects any coordinatees (agents, server tailnet, CLI), and immediately disconnects any new coordinatees that connect to it.

The purpose of this feature is if a Coordinator loses connection to the pubsub/database thru a network partition, it drops connections so that coordinatees can retry and hopefully land on a healthy peer.

However, if multiple PGCoordinators go unhealthy at the same time, coordinatees can bounce between coordinators.

Furthermore, there is a bug in our implementation such that when we disconnect a coordinatee that has never sent a node binding, we trigger an unnecessary DeleteTailnetPeer query to the database.  The query is idempotent, so any individual query does no harm, but since we do it once per connection, this can trigger a storm of queries.

**Impact:**
Contributing or major factor in production outage at a customer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions