Thanks to visit codestin.com
Credit goes to github.com

Skip to content

multiple PGCoordinators going unhealthy can trigger a storm of DeleteTailnetPeer #12923

Closed
@spikecurtis

Description

@spikecurtis

PGCoordinator has a mechanism where if it is unable to heartbeat over the pub-sub, it declares itself unhealthy, disconnects any coordinatees (agents, server tailnet, CLI), and immediately disconnects any new coordinatees that connect to it.

The purpose of this feature is if a Coordinator loses connection to the pubsub/database thru a network partition, it drops connections so that coordinatees can retry and hopefully land on a healthy peer.

However, if multiple PGCoordinators go unhealthy at the same time, coordinatees can bounce between coordinators.

Furthermore, there is a bug in our implementation such that when we disconnect a coordinatee that has never sent a node binding, we trigger an unnecessary DeleteTailnetPeer query to the database. The query is idempotent, so any individual query does no harm, but since we do it once per connection, this can trigger a storm of queries.

Impact:
Contributing or major factor in production outage at a customer

Metadata

Metadata

Assignees

No one assigned

    Labels

    customer-reportedBugs reported by enterprise customers. Only humans may set this.networkingArea: networkings1Bugs that break core workflows. Only humans may set this.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions