fix: change tailnet AwaitReachable to wait for wireguard #8492

spikecurtis · 2023-07-13T09:40:07Z

Should help address TCP hangs described in #7388 (comment)

Prior to this change, AwaitReachable() waits for a tailnet Disco ping to succeed, which indicates successful connectivity either over DERP or UDP. However, this does not indicate that we are able to exchange end-user packets over the link, because an encrypted wireguard tunnel must also come up.

With this change, we wait for wireguard. Since a successful handshake includes a request and response, this implies that the DERP or UDP connectivity is also present.

Signed-off-by: Spike Curtis <[email protected]>

…-tcp-wait-for-wg

spikecurtis · 2023-07-13T09:41:29Z

tailnet/conn.go

-			// of a connection cause it to hang for an unknown
-			// reason. TODO: @kylecarbs debug this!
-			KeepAlive: ok && peerStatus.Active,
+			KeepAlive:  true,


A note for reviewers: I think connections were hanging because of the Tailscale "trimming" of the WG nodes, which I have disabled in this PR.

I wonder if this could be the underlying cause of the test failures? I'm seeing a lot of changes that I wouldn't expect to be needed, so thought I'd point it out.

Turning this on is what triggers wireguard to go and proactively attempt to connect to a peer, rather than just passively waiting for the peer to connect.

I am seeing some complaints in the test about peers getting "invalid response messages". I can't help but wonder if it's related to each side attempting to initiate a handshake at the same time. Maybe that's delaying the handshakes by 5 seconds a lot more.

I could make it so that only the clients turn this on...

07df3ab

I'm out of time before my vacation. We can leave this until I get back, or someone else can take over

mafredri

Nice work, this looks good to me 👍🏻. Approved, but I'd love for @kylecarbs or @coadler to give this a once-over too.

mafredri · 2023-07-13T10:04:55Z

tailnet/conn.go

+func (c *Conn) AwaitReachable(ctx context.Context, ip netip.Addr) bool {
+	rightAway := make(chan struct{}, 1)
+	rightAway <- struct{}{}
+	tkr := time.NewTicker(50 * time.Millisecond)


Is 50ms a bit high to start off with? In non-ideal conditions every connection attempt may have an artificial delay of 50ms. A backoff could still be useful here.

Edit: Also see #8492 (comment)

50ms was the initial interval from the ping-based code, so I just kept it. What do you think is more reasonable?

That’s fair. And good question. I suppose something in the 15-25ms range would be nice. But I’m mostly basing this on what I feel a fast non-local ping is like.

mafredri · 2023-07-13T10:10:34Z

tailnet/conn.go

 		case <-ctx.Done():
 			return false
+		case <-tkr.C:
+		case <-rightAway:
+		}


Small change to signal intent (one-time use), this way it doesn't matter if the channel is ever closed in a refactor:

Suggested change

}

}

rightAway = nil

(Top could also be replaced with a closed unbuffered chan in this case.)

mafredri · 2023-07-13T10:17:31Z

tailnet/conn.go

+			c.logger.Debug(ctx, "missing node(s) for IP", slog.F("ip", ip.String()))
+			continue
+		}
+		s := c.Status()


I'm considering this out of the "single tailnet" perspective, should we instead rely on utilizing the peers reported in the status callback (SetStatusCallback)? Let's say this is the server, and there are a lot of peers and traffic, then waiting for multiple connections will result in a lot of wg state reporting/mutex locking along the way.

This method could also let us signal this loop when a peer is available instead of polling it on an arbitrary timer.

Just a thought.

I like where you're head is at.

I did think about this, but decided not to build it until/unless we get some evidence it's a problem at scale. The code I'm replacing had us sending a bunch of pings, and who knows how expensive that was! If we see extended lag when making a lot of coderd proxy'd connections at scale, we can look into improving the efficiency here.

mafredri · 2023-07-13T10:22:09Z

tailnet/conn.go

+}
+
+// getNodeKeysForIP returns all the keys for the given IP address.  There could be multiple keys if the agent/client
+// for the IP disconnects and reconnects, since we regenerate the node key each time.


since we regenerate the node key each time.

Should we perhaps not do this?

Hard to guarantee the agent's key will both be secure and stable across agent restarts.

Picking a random key each time makes it secure...

coadler · 2023-07-13T16:40:29Z

tailnet/conn.go

+	// Our use case is different: for clients, it's a point-to-point connection to a single workspace, and lasts only as
+	// long as the connection.  For agents, it's connections to a small number of clients (CLI or Coderd) that are being
+	// actively used by the end user.
+	envknob.Setenv("TS_DEBUG_TRIM_WIREGUARD", "false")


One thing that comes to mind here is that agents will forever spin trying to connect to CLI connections that have gone away. Since we don't have the ability to tell agents that a client is gone, it seems like this would pile up.

Also, is this something we should keep disabled on coderd's tailnet as well? I guess it's sort-of moot since we're doing our own peer expiration.

One thing that comes to mind here is that agents will forever spin trying to connect to CLI connections that have gone away. Since we don't have the ability to tell agents that a client is gone, it seems like this would pile up.

Yeah, they already do that #7960

We need to enhance the coordinator to revoke nodes when clients disconnect.

Also, is this something we should keep disabled on coderd's tailnet as well? I guess it's sort-of moot since we're doing our own peer expiration.

I don't think we want to trim wireguard on Coderd's tailnet, since that means we cannot wait for wireguard to come up before attempting to connect to the agent, and we risk hitting stalls due to an inflated retry timeout #7388 . As you mention, we have to solve the problem of when to remove agent peers in Coderd. Trimming the wireguard connections saves memory, but we'd still be leaking memory if we just never removed them from tailnet.

Signed-off-by: Spike Curtis <[email protected]>

coadler · 2023-07-19T19:05:32Z

I think I've found the reason for the failing tests. It seems like there's still a race condition in handshakes causing the 5s timeout:

Client wants to connect to an agent, adds the agent to its netmap and sends a handshake initiation
Agent receives handshake and ignores it since it doesn't have the client in its netmap.
Agent receives the client node, adds to netmap
5s timeout happens, handshake is resent
Agent and client successfully handshake

I think the only way to fix this is by having the agent initiate the handshake instead of the client. The client needs to make sure it has the agent node in its netmap before propagating to the agent. I don't think there's an easy way to ensure the reverse.

After changing KeepAlive to the agent, all tests seem to pass locally.

			KeepAlive:  c.connType == ConnTypeAgent,

coadler · 2023-07-20T00:03:25Z

cc @mafredri, curious what you think of above. I don't think this is super urgent, might just leave this until Spike gets back.

mafredri · 2023-07-20T10:43:44Z

cc @mafredri, curious what you think of above. I don't think this is super urgent, might just leave this until Spike gets back.

I think it's an interesting find, and makes sense. I think leaving this until Spike gets back is fine, though, I'd like to hear what he thinks of it as well.

I haven't seen too many CI failures pointing towards this either now, so I'd agree it's not super urgent.

spikecurtis · 2023-08-03T11:53:47Z

I think I've found the reason for the failing tests. It seems like there's still a race condition in handshakes causing the 5s timeout:

Client wants to connect to an agent, adds the agent to its netmap and sends a handshake initiation

Agent receives handshake and ignores it since it doesn't have the client in its netmap.

Agent receives the client node, adds to netmap

5s timeout happens, handshake is resent

Agent and client successfully handshake

I think the only way to fix this is by having the agent initiate the handshake instead of the client. The client needs to make sure it has the agent node in its netmap before propagating to the agent. I don't think there's an easy way to ensure the reverse.

After changing KeepAlive to the agent, all tests seem to pass locally.

This is good analysis. Before this change, we'd wait for a Disco pong, then start the TCP session, and this would trigger the wireguard session to come up. Since we got a pong, we know the agent had received our node from the coordinator, so it mostly worked.

So, the basic issue is that we want to ensure that whatever peer initiates the handshake, the other peer has programmed wireguard to accept the handshake before we send it.

Option 1: (as you suggested) Have the agent initiate the handshake and the client only send its node after it has programmed the agent node into wireguard.

I see two issues:

This works fine for end-user clients, but I don't think it's compatible with the Coderd single-tailnet, since node updates happen asynchronously from connecting to agents.
It also means that the agent will continue to send handshakes after clients disconnect. It's a nice property of the client initiating the handshake that when the client goes away, so do the handshakes.

Option 2: Reinstate Disco ping/pong prior to sending the wireguard handshake. There is still a race condition because tailscale programs the peer into the Disco layer before it programs wireguard. But, I expect it to work pretty well because you have to

transmit the pong over the network
reconfigure wireguard to enable the handshake
transmit the handshake over the network

versus, the other side just needs to reconfigure wireguard before the handshake arrives. So, it should mostly win the race and we don't have to wait 5s. Even if it loses the race, after 5s, it'll retry and we'll be OK --- but we will have to make test timeouts long enough to account for this.

Option 3: Enhance the coordinator to send confirmations. So, the client would send a node update, and the agent would program wireguard and then send a confirmation for that node update. Then, the client could wait for a confirmation before initiating the wireguard handshake.

The confirmations are asymmetric in that the agent doesn't care when clients program its node into wireguard --- it passively waits for handshakes.

Obviously, I don't think Option 3 is in scope of this PR. And, I'm nervous about adding a bunch of complexity to the coordinator: edge cases of multiple agent connections are particularly gnarly for confirmations.

coadler · 2023-08-03T23:33:07Z

One interesting this is that it looks like Tailscale removed KeepAlive ~1mo ago: tailscale/tailscale@7b1c3df

I think this essentially makes option 2 the only option for now, unless we want to maintain the keep-alive behavior on our fork.

spikecurtis · 2023-08-04T07:16:08Z

One interesting this is that it looks like Tailscale removed KeepAlive ~1mo ago: tailscale/tailscale@7b1c3df

I think this essentially makes option 2 the only option for now, unless we want to maintain the keep-alive behavior on our fork.

That's going to basically torpedo this whole effort if we upgrade to the latest Tailscale. Tailscale.com has this mechanism to only program Wireguard nodes and enable handshakes if there is active traffic sent to them. The KeepAlive was a mechanism to explicitly enable handshakes, but Tailscale.com doesn't use it, so they removed it.

That means the only way to get Tailscale to enable Wireguard is to send traffic. This is a layering violation, IMO, since you're using a higher layer to trigger bringing up a lower layer.

We can't start the requested TCP session, since this skews the initial round-trip-time measurement while Wireguard comes up. But, we could add a simple UDP echo endpoint to the agent, and then ping it instead of relying on Disco pings for AwaitReachable(). This would have the dual role of sending traffic, to convince Tailscale to bring up Wireguard, and since we are pinging at the IP layer (not the Disco layer), when we get a response we know the wireguard session is up.

@coadler WDYT? If the UDP echo thing sounds good I'll close out this PR and restart down that path. If we continue with this PR I don't think we'll be able to upgrade to the latest Tailscale without ripping it out.

coadler · 2023-08-04T18:59:07Z

@coadler WDYT? If the UDP echo thing sounds good I'll close out this PR and restart down that path. If we continue with this PR I don't think we'll be able to upgrade to the latest Tailscale without ripping it out.

Tailscale has multiple different ping types: https://github.com/tailscale/tailscale/blob/b940c19ca61bcd5650406e54f090da437796de0e/tailcfg/tailcfg.go#L1413. We could potentially use TSMP to facilitate ensuring the IP layer is up. Otherwise, I think the UDP endpoint is a good option.

Are we potentially able to salvage the ps.LastHandshake change from here? I think that would still be a good addition on it's own, so we could bypass some unnecessary pings.

Also just FYI, I'm hoping to bring us to the latest Tailscale release early next week: #8913

spikecurtis · 2023-08-05T05:07:59Z

@coadler I will play around with the different built-in tailscale pings. It will be great if we don't have to implement the ping ourselves. I'm not sure what TSMP is --- presumably Tailscale management protocol or something. It says it runs over IP, but since this is a tunnel there are multiple IP layers involved. If it runs over the overlay IP layer, we should be good.

I also agree it would be good to check ps.LastHandshake before trying to ping. In the coderd single tailnet use case, we will dial the agent multiple times, so if we already know wireguard is up to the peer, we can skip pings. Of course, in the coder ssh use case, we only dial the agent once, so we'll ping every time.

Since Tailscale removed keep-alives, it seems like open but idle connections (SSH, port-forward, etc) can get trimmed fairly easily, causing hangs for a few seconds while the connection is setup again. This was taken from Spike's PR #8492 Co-authored-by: Spike Curtis <[email protected]>

spikecurtis added 2 commits July 13, 2023 09:31

change tailnet AwaitReachable to wait for wireguard

d8d52b0

Signed-off-by: Spike Curtis <[email protected]>

Merge branch 'main' of https://github.com/coder/coder into spike/7388…

2221ba5

…-tcp-wait-for-wg

github-actions bot assigned spikecurtis Jul 13, 2023

spikecurtis commented Jul 13, 2023

View reviewed changes

spikecurtis requested review from mafredri, kylecarbs and coadler July 13, 2023 09:41

mafredri approved these changes Jul 13, 2023

View reviewed changes

coadler approved these changes Jul 13, 2023

View reviewed changes

spikecurtis and others added 7 commits July 14, 2023 08:16

Fix TestPortForward context timeout with t.Parallel()

34ea292

Signed-off-by: Spike Curtis <[email protected]>

Fixup: port forward test

93fbc94

Signed-off-by: Spike Curtis <[email protected]>

Fixup: port forward tests use of contexts

b9ac7fb

Signed-off-by: Spike Curtis <[email protected]>

Add more logging for failed tests

af83870

Signed-off-by: Spike Curtis <[email protected]>

Add logging for SSH tests

caa2578

Signed-off-by: Spike Curtis <[email protected]>

Only clients initiate wireguard handshakes

07df3ab

Signed-off-by: Spike Curtis <[email protected]>

Merge branch 'main' into spike/7388-tcp-wait-for-wg

c16bb74

use keepalive in agent conns instead of client conns

3ca0161

github-actions bot added the stale This issue is like stale bread. label Jul 28, 2023

coadler removed the stale This issue is like stale bread. label Jul 28, 2023

github-actions bot added the stale This issue is like stale bread. label Aug 13, 2023

coadler mentioned this pull request Aug 15, 2023

fix(tailnet): disable wireguard trimming #9098

Merged

github-actions bot closed this Aug 17, 2023

github-actions bot deleted the spike/7388-tcp-wait-for-wg branch January 20, 2024 00:04

fix: change tailnet AwaitReachable to wait for wireguard #8492

fix: change tailnet AwaitReachable to wait for wireguard #8492

Uh oh!

Conversation

spikecurtis commented Jul 13, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

mafredri Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coadler commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coadler commented Jul 20, 2023

Uh oh!

mafredri commented Jul 20, 2023

Uh oh!

spikecurtis commented Aug 3, 2023

Uh oh!

coadler commented Aug 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spikecurtis commented Aug 4, 2023

Uh oh!

coadler commented Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spikecurtis commented Aug 5, 2023

Uh oh!

Uh oh!

mafredri Jul 13, 2023 •

edited

Loading

coadler commented Jul 19, 2023 •

edited

Loading

coadler commented Aug 3, 2023 •

edited

Loading

coadler commented Aug 4, 2023 •

edited

Loading