Switch to Tailscale for agent networking

### Problem
Users have been reporting connection issues:
- https://github.com/coder/coder/issues/2497
- https://github.com/coder/coder/issues/2769
- https://github.com/coder/coder/issues/2658
- https://github.com/coder/coder/issues/2845

Some additional issues that haven't been reported:
- Throughput is ~30mb/s (improving seems like it'd require substantial effort)
- The initial connection takes a few hundred milliseconds (I believe @mafredri reported this actually)

And the debt we take on:
- Maintaining ~3000 lines of networking code that requires very specific domain knowledge to edit
- Intermittent disconnects have been occurring, and due to a lack of domain knowledge we have limited throughput to debug them

### Definition of Done
Our current networking stack with WebRTC is not durable, fast, or resilient. To fix all the issues listed, I propose we switch to using Tailscale's OSS implementation on top of Wireguard. This will remove a significant portion of our effort being spent on networking, and let us focus on improving the fundamentals where we have higher leverage.

Read this blog post by Tailscale to understand how it works: https://tailscale.com/blog/how-tailscale-works/

A proof of concept and partial implementation are merged into the product under a hidden flag for SSH: `coder config-ssh --wireguard`. It appears to resolve the problems listed above. The code can be [viewed here](https://github.com/coder/coder/tree/wgnet).

### FAQ

#### How do users upgrade from WebRTC to Wireguard?

Once we remove WebRTC, agents will be unable to connect to Coder. Those endpoints will be removed, and workspaces will require a stop -> start to begin working again. This could be eased by doing over a few releases, but I think making it explicit in the release notes would be better.

#### Does this talk to Tailscale's hosted offering at all?

Nope. A DERP server is embedded in the Coder server, just like how TURN has worked. A user could add the Tailscale DERP mapping to their Coder deployment to enable this proxying, and we'll allow that in the future.

#### How does this change our architecture?

The architecture diagram [here](https://coder.com/docs/coder-oss/latest/architecture) stays *exactly* the same. DERP is used instead of TURN, but it already uses HTTP(s) as the protocol. This should simplify and reduce proxying.

#### Does this still require significant domain knowledge to edit?

Yes, but the amount of code reduces substantially and the architecture maintains the same shape. The surface area of this problem should reduce.

#### Does this prevent us from using WebRTC from the browser for P2P connections in the future?

Yes, but due to our TURN over WebSockets proxying that hasn't been possible.

#### How long will we dogfood the Tailscale networking before removing WebRTC?

It's very similar architecturally, so it's not expected we'll run into significant hurdles. We'll make this the default on dogfood over break, and will likely cut a new release containing it as default the week we're back. We've been dogfooding it on dev.coder.com for ~1 week already.

#### Does Wireguard require anything in the kernel?

Nope. Tailscale uses [wireguard-go](https://github.com/WireGuard/wireguard-go) in userspace.

#### How do multiple regions work?

Just as Tailscale describes in their post. Users will specify a region name for each of their Coder instances (in an HA deployment), and they will mesh together.

#### What firewall ports are required?

It's just exposing an HTTP(s) server, nothing extra!

#### Does this require an external IP address? (if workspaces are running on the same node as Coder)

Nope. Just like the current networking stack, everything should _just work_ locally too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch to Tailscale for agent networking #2779

Problem

Definition of Done

FAQ

How do users upgrade from WebRTC to Wireguard?

Does this talk to Tailscale's hosted offering at all?

How does this change our architecture?

Does this still require significant domain knowledge to edit?

Does this prevent us from using WebRTC from the browser for P2P connections in the future?

How long will we dogfood the Tailscale networking before removing WebRTC?

Does Wireguard require anything in the kernel?

How do multiple regions work?

What firewall ports are required?

Does this require an external IP address? (if workspaces are running on the same node as Coder)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Switch to Tailscale for agent networking #2779

Description

Problem

Definition of Done

FAQ

How do users upgrade from WebRTC to Wireguard?

Does this talk to Tailscale's hosted offering at all?

How does this change our architecture?

Does this still require significant domain knowledge to edit?

Does this prevent us from using WebRTC from the browser for P2P connections in the future?

How long will we dogfood the Tailscale networking before removing WebRTC?

Does Wireguard require anything in the kernel?

How do multiple regions work?

What firewall ports are required?

Does this require an external IP address? (if workspaces are running on the same node as Coder)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions