-
Notifications
You must be signed in to change notification settings - Fork 881
Switch to Tailscale for agent networking #2779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This issue tracking Tailscale in the browser could be of interest to us tailscale/tailscale#3157. |
|
@ntimo I've added those Q&As! |
Can you provide data? |
My data is primarily empirical based on usage of Tailscale. Their software is built to handle situations that are essentially identical. Colin ran tests for throughput, and those came out at ~1GB/s. |
1 GB/s is data! What about initial connection setup time? |
Ahh my bad, I missed this. The initial connection time is significantly lower, especially on slow connections. Two round-trips are required for Tailscale networking, while four are required for WebRTC. |
The blog post on how tailscale works mentions both a DERP server and a "coordination server" are we running both of these in Coder Server, or just DERP? If we are not running our own coordination server, then how do clients learn the public keys for agents they want to connect to? |
How does access control work? Specifically, how do we enforce which Coder users are allowed to connect to which Agents? |
For people following this, you can use the latest version with |
@kylecarbs I was reading the Tailscale doc too - is a coordination server built into coder along with an embedded DERP? |
@sharkymark yup! |
This is enabled by default in Coder v0.8.15! Closing :) |
Problem
Users have been reporting connection issues:
coder port-forward
andlocalhost:3000
freezes if web app source code is updated #2658Some additional issues that haven't been reported:
And the debt we take on:
Definition of Done
Our current networking stack with WebRTC is not durable, fast, or resilient. To fix all the issues listed, I propose we switch to using Tailscale's OSS implementation on top of Wireguard. This will remove a significant portion of our effort being spent on networking, and let us focus on improving the fundamentals where we have higher leverage.
Read this blog post by Tailscale to understand how it works: https://tailscale.com/blog/how-tailscale-works/
A proof of concept and partial implementation are merged into the product under a hidden flag for SSH:
coder config-ssh --wireguard
. It appears to resolve the problems listed above. The code can be viewed here.FAQ
How do users upgrade from WebRTC to Wireguard?
Once we remove WebRTC, agents will be unable to connect to Coder. Those endpoints will be removed, and workspaces will require a stop -> start to begin working again. This could be eased by doing over a few releases, but I think making it explicit in the release notes would be better.
Does this talk to Tailscale's hosted offering at all?
Nope. A DERP server is embedded in the Coder server, just like how TURN has worked. A user could add the Tailscale DERP mapping to their Coder deployment to enable this proxying, and we'll allow that in the future.
How does this change our architecture?
The architecture diagram here stays exactly the same. DERP is used instead of TURN, but it already uses HTTP(s) as the protocol. This should simplify and reduce proxying.
Does this still require significant domain knowledge to edit?
Yes, but the amount of code reduces substantially and the architecture maintains the same shape. The surface area of this problem should reduce.
Does this prevent us from using WebRTC from the browser for P2P connections in the future?
Yes, but due to our TURN over WebSockets proxying that hasn't been possible.
How long will we dogfood the Tailscale networking before removing WebRTC?
It's very similar architecturally, so it's not expected we'll run into significant hurdles. We'll make this the default on dogfood over break, and will likely cut a new release containing it as default the week we're back. We've been dogfooding it on dev.coder.com for ~1 week already.
Does Wireguard require anything in the kernel?
Nope. Tailscale uses wireguard-go in userspace.
How do multiple regions work?
Just as Tailscale describes in their post. Users will specify a region name for each of their Coder instances (in an HA deployment), and they will mesh together.
What firewall ports are required?
It's just exposing an HTTP(s) server, nothing extra!
Does this require an external IP address? (if workspaces are running on the same node as Coder)
Nope. Just like the current networking stack, everything should just work locally too.
The text was updated successfully, but these errors were encountered: