-
Notifications
You must be signed in to change notification settings - Fork 928
fix: change tailnet AwaitReachable to wait for wireguard #8492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Spike Curtis <[email protected]>
…-tcp-wait-for-wg
tailnet/conn.go
Outdated
// of a connection cause it to hang for an unknown | ||
// reason. TODO: @kylecarbs debug this! | ||
KeepAlive: ok && peerStatus.Active, | ||
KeepAlive: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note for reviewers: I think connections were hanging because of the Tailscale "trimming" of the WG nodes, which I have disabled in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could be the underlying cause of the test failures? I'm seeing a lot of changes that I wouldn't expect to be needed, so thought I'd point it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turning this on is what triggers wireguard to go and proactively attempt to connect to a peer, rather than just passively waiting for the peer to connect.
I am seeing some complaints in the test about peers getting "invalid response messages". I can't help but wonder if it's related to each side attempting to initiate a handshake at the same time. Maybe that's delaying the handshakes by 5 seconds a lot more.
I could make it so that only the clients turn this on...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm out of time before my vacation. We can leave this until I get back, or someone else can take over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, this looks good to me 👍🏻. Approved, but I'd love for @kylecarbs or @coadler to give this a once-over too.
func (c *Conn) AwaitReachable(ctx context.Context, ip netip.Addr) bool { | ||
rightAway := make(chan struct{}, 1) | ||
rightAway <- struct{}{} | ||
tkr := time.NewTicker(50 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 50ms a bit high to start off with? In non-ideal conditions every connection attempt may have an artificial delay of 50ms. A backoff could still be useful here.
Edit: Also see #8492 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50ms was the initial interval from the ping-based code, so I just kept it. What do you think is more reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s fair. And good question. I suppose something in the 15-25ms range would be nice. But I’m mostly basing this on what I feel a fast non-local ping is like.
case <-ctx.Done(): | ||
return false | ||
case <-tkr.C: | ||
case <-rightAway: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small change to signal intent (one-time use), this way it doesn't matter if the channel is ever closed in a refactor:
} | |
} | |
rightAway = nil |
(Top could also be replaced with a closed unbuffered chan in this case.)
c.logger.Debug(ctx, "missing node(s) for IP", slog.F("ip", ip.String())) | ||
continue | ||
} | ||
s := c.Status() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering this out of the "single tailnet" perspective, should we instead rely on utilizing the peers reported in the status callback (SetStatusCallback
)? Let's say this is the server, and there are a lot of peers and traffic, then waiting for multiple connections will result in a lot of wg state reporting/mutex locking along the way.
This method could also let us signal this loop when a peer is available instead of polling it on an arbitrary timer.
Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where you're head is at.
I did think about this, but decided not to build it until/unless we get some evidence it's a problem at scale. The code I'm replacing had us sending a bunch of pings, and who knows how expensive that was! If we see extended lag when making a lot of coderd proxy'd connections at scale, we can look into improving the efficiency here.
} | ||
|
||
// getNodeKeysForIP returns all the keys for the given IP address. There could be multiple keys if the agent/client | ||
// for the IP disconnects and reconnects, since we regenerate the node key each time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we regenerate the node key each time.
Should we perhaps not do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to guarantee the agent's key will both be secure and stable across agent restarts.
Picking a random key each time makes it secure...
// Our use case is different: for clients, it's a point-to-point connection to a single workspace, and lasts only as | ||
// long as the connection. For agents, it's connections to a small number of clients (CLI or Coderd) that are being | ||
// actively used by the end user. | ||
envknob.Setenv("TS_DEBUG_TRIM_WIREGUARD", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that comes to mind here is that agents will forever spin trying to connect to CLI connections that have gone away. Since we don't have the ability to tell agents that a client is gone, it seems like this would pile up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is this something we should keep disabled on coderd's tailnet as well? I guess it's sort-of moot since we're doing our own peer expiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that comes to mind here is that agents will forever spin trying to connect to CLI connections that have gone away. Since we don't have the ability to tell agents that a client is gone, it seems like this would pile up.
Yeah, they already do that #7960
We need to enhance the coordinator to revoke nodes when clients disconnect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is this something we should keep disabled on coderd's tailnet as well? I guess it's sort-of moot since we're doing our own peer expiration.
I don't think we want to trim wireguard on Coderd's tailnet, since that means we cannot wait for wireguard to come up before attempting to connect to the agent, and we risk hitting stalls due to an inflated retry timeout #7388 . As you mention, we have to solve the problem of when to remove agent peers in Coderd. Trimming the wireguard connections saves memory, but we'd still be leaking memory if we just never removed them from tailnet.
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
Signed-off-by: Spike Curtis <[email protected]>
I think I've found the reason for the failing tests. It seems like there's still a race condition in handshakes causing the 5s timeout:
I think the only way to fix this is by having the agent initiate the handshake instead of the client. The client needs to make sure it has the agent node in its netmap before propagating to the agent. I don't think there's an easy way to ensure the reverse. After changing KeepAlive: c.connType == ConnTypeAgent, |
cc @mafredri, curious what you think of above. I don't think this is super urgent, might just leave this until Spike gets back. |
I think it's an interesting find, and makes sense. I think leaving this until Spike gets back is fine, though, I'd like to hear what he thinks of it as well. I haven't seen too many CI failures pointing towards this either now, so I'd agree it's not super urgent. |
This is good analysis. Before this change, we'd wait for a Disco pong, then start the TCP session, and this would trigger the wireguard session to come up. Since we got a pong, we know the agent had received our node from the coordinator, so it mostly worked. So, the basic issue is that we want to ensure that whatever peer initiates the handshake, the other peer has programmed wireguard to accept the handshake before we send it. Option 1: (as you suggested) Have the agent initiate the handshake and the client only send its node after it has programmed the agent node into wireguard. I see two issues:
Option 2: Reinstate Disco ping/pong prior to sending the wireguard handshake. There is still a race condition because tailscale programs the peer into the Disco layer before it programs wireguard. But, I expect it to work pretty well because you have to
versus, the other side just needs to reconfigure wireguard before the handshake arrives. So, it should mostly win the race and we don't have to wait 5s. Even if it loses the race, after 5s, it'll retry and we'll be OK --- but we will have to make test timeouts long enough to account for this. Option 3: Enhance the coordinator to send confirmations. So, the client would send a node update, and the agent would program wireguard and then send a confirmation for that node update. Then, the client could wait for a confirmation before initiating the wireguard handshake. The confirmations are asymmetric in that the agent doesn't care when clients program its node into wireguard --- it passively waits for handshakes. Obviously, I don't think Option 3 is in scope of this PR. And, I'm nervous about adding a bunch of complexity to the coordinator: edge cases of multiple agent connections are particularly gnarly for confirmations. |
One interesting this is that it looks like Tailscale removed I think this essentially makes option 2 the only option for now, unless we want to maintain the keep-alive behavior on our fork. |
That's going to basically torpedo this whole effort if we upgrade to the latest Tailscale. Tailscale.com has this mechanism to only program Wireguard nodes and enable handshakes if there is active traffic sent to them. The That means the only way to get Tailscale to enable Wireguard is to send traffic. This is a layering violation, IMO, since you're using a higher layer to trigger bringing up a lower layer. We can't start the requested TCP session, since this skews the initial round-trip-time measurement while Wireguard comes up. But, we could add a simple UDP echo endpoint to the agent, and then ping it instead of relying on Disco pings for AwaitReachable(). This would have the dual role of sending traffic, to convince Tailscale to bring up Wireguard, and since we are pinging at the IP layer (not the Disco layer), when we get a response we know the wireguard session is up. @coadler WDYT? If the UDP echo thing sounds good I'll close out this PR and restart down that path. If we continue with this PR I don't think we'll be able to upgrade to the latest Tailscale without ripping it out. |
Tailscale has multiple different ping types: https://github.com/tailscale/tailscale/blob/b940c19ca61bcd5650406e54f090da437796de0e/tailcfg/tailcfg.go#L1413. We could potentially use Are we potentially able to salvage the Also just FYI, I'm hoping to bring us to the latest Tailscale release early next week: #8913 |
@coadler I will play around with the different built-in tailscale pings. It will be great if we don't have to implement the ping ourselves. I'm not sure what TSMP is --- presumably Tailscale management protocol or something. It says it runs over IP, but since this is a tunnel there are multiple IP layers involved. If it runs over the overlay IP layer, we should be good. I also agree it would be good to check |
Since Tailscale removed keep-alives, it seems like open but idle connections (SSH, port-forward, etc) can get trimmed fairly easily, causing hangs for a few seconds while the connection is setup again. This was taken from Spike's PR #8492 Co-authored-by: Spike Curtis <[email protected]>
Should help address TCP hangs described in #7388 (comment)
Prior to this change, AwaitReachable() waits for a tailnet Disco ping to succeed, which indicates successful connectivity either over DERP or UDP. However, this does not indicate that we are able to exchange end-user packets over the link, because an encrypted wireguard tunnel must also come up.
With this change, we wait for wireguard. Since a successful handshake includes a request and response, this implies that the DERP or UDP connectivity is also present.