Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Switch to Tailscale for agent networking #2779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kylecarbs opened this issue Jul 1, 2022 · 13 comments
Closed

Switch to Tailscale for agent networking #2779

kylecarbs opened this issue Jul 1, 2022 · 13 comments
Assignees
Milestone

Comments

@kylecarbs
Copy link
Member

kylecarbs commented Jul 1, 2022

Problem

Users have been reporting connection issues:

Some additional issues that haven't been reported:

  • Throughput is ~30mb/s (improving seems like it'd require substantial effort)
  • The initial connection takes a few hundred milliseconds (I believe @mafredri reported this actually)

And the debt we take on:

  • Maintaining ~3000 lines of networking code that requires very specific domain knowledge to edit
  • Intermittent disconnects have been occurring, and due to a lack of domain knowledge we have limited throughput to debug them

Definition of Done

Our current networking stack with WebRTC is not durable, fast, or resilient. To fix all the issues listed, I propose we switch to using Tailscale's OSS implementation on top of Wireguard. This will remove a significant portion of our effort being spent on networking, and let us focus on improving the fundamentals where we have higher leverage.

Read this blog post by Tailscale to understand how it works: https://tailscale.com/blog/how-tailscale-works/

A proof of concept and partial implementation are merged into the product under a hidden flag for SSH: coder config-ssh --wireguard. It appears to resolve the problems listed above. The code can be viewed here.

FAQ

How do users upgrade from WebRTC to Wireguard?

Once we remove WebRTC, agents will be unable to connect to Coder. Those endpoints will be removed, and workspaces will require a stop -> start to begin working again. This could be eased by doing over a few releases, but I think making it explicit in the release notes would be better.

Does this talk to Tailscale's hosted offering at all?

Nope. A DERP server is embedded in the Coder server, just like how TURN has worked. A user could add the Tailscale DERP mapping to their Coder deployment to enable this proxying, and we'll allow that in the future.

How does this change our architecture?

The architecture diagram here stays exactly the same. DERP is used instead of TURN, but it already uses HTTP(s) as the protocol. This should simplify and reduce proxying.

Does this still require significant domain knowledge to edit?

Yes, but the amount of code reduces substantially and the architecture maintains the same shape. The surface area of this problem should reduce.

Does this prevent us from using WebRTC from the browser for P2P connections in the future?

Yes, but due to our TURN over WebSockets proxying that hasn't been possible.

How long will we dogfood the Tailscale networking before removing WebRTC?

It's very similar architecturally, so it's not expected we'll run into significant hurdles. We'll make this the default on dogfood over break, and will likely cut a new release containing it as default the week we're back. We've been dogfooding it on dev.coder.com for ~1 week already.

Does Wireguard require anything in the kernel?

Nope. Tailscale uses wireguard-go in userspace.

How do multiple regions work?

Just as Tailscale describes in their post. Users will specify a region name for each of their Coder instances (in an HA deployment), and they will mesh together.

What firewall ports are required?

It's just exposing an HTTP(s) server, nothing extra!

Does this require an external IP address? (if workspaces are running on the same node as Coder)

Nope. Just like the current networking stack, everything should just work locally too.

@kylecarbs kylecarbs self-assigned this Jul 1, 2022
@mafredri
Copy link
Member

mafredri commented Jul 1, 2022

Does this prevent us from using WebRTC from the browser for P2P connections in the future?

This issue tracking Tailscale in the browser could be of interest to us tailscale/tailscale#3157.

@ntimo
Copy link
Contributor

ntimo commented Jul 1, 2022

  • Another think that would be interesting here, what firewall ports are requiered for this? (Both inbound and outbound)
  • And does this has an impact when running coder on the same VM the workspace is run on? (E.g. coder runs via docker and is used to start containers as workspaces on the same node)

@kylecarbs
Copy link
Member Author

@ntimo I've added those Q&As!

@ketang
Copy link
Contributor

ketang commented Jul 1, 2022

It appears to resolve the problems listed above

Can you provide data?

@kylecarbs
Copy link
Member Author

My data is primarily empirical based on usage of Tailscale. Their software is built to handle situations that are essentially identical.

Colin ran tests for throughput, and those came out at ~1GB/s.

@ketang
Copy link
Contributor

ketang commented Jul 2, 2022

1 GB/s is data!

What about initial connection setup time?

@kylecarbs
Copy link
Member Author

Ahh my bad, I missed this.

The initial connection time is significantly lower, especially on slow connections. Two round-trips are required for Tailscale networking, while four are required for WebRTC.

@kylecarbs kylecarbs mentioned this issue Jul 19, 2022
20 tasks
@spikecurtis
Copy link
Contributor

The blog post on how tailscale works mentions both a DERP server and a "coordination server" are we running both of these in Coder Server, or just DERP?

If we are not running our own coordination server, then how do clients learn the public keys for agents they want to connect to?

@spikecurtis
Copy link
Contributor

How does access control work? Specifically, how do we enforce which Coder users are allowed to connect to which Agents?

@kylecarbs
Copy link
Member Author

For people following this, you can use the latest version with CODER_TAILSCALE=true to test it out! It will become default once we ensure stability 😎.

@sharkymark
Copy link
Contributor

The blog post on how tailscale works mentions both a DERP server and a "coordination server" are we running both of these in Coder Server, or just DERP?

If we are not running our own coordination server, then how do clients learn the public keys for agents they want to connect to?

@kylecarbs I was reading the Tailscale doc too - is a coordination server built into coder along with an embedded DERP?

@kylecarbs
Copy link
Member Author

@sharkymark yup!

@bpmct
Copy link
Member

bpmct commented Sep 16, 2022

This is enabled by default in Coder v0.8.15! Closing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants