Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fix: change tailnet AwaitReachable to wait for wireguard #8492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

spikecurtis
Copy link
Contributor

Should help address TCP hangs described in #7388 (comment)

Prior to this change, AwaitReachable() waits for a tailnet Disco ping to succeed, which indicates successful connectivity either over DERP or UDP. However, this does not indicate that we are able to exchange end-user packets over the link, because an encrypted wireguard tunnel must also come up.

With this change, we wait for wireguard. Since a successful handshake includes a request and response, this implies that the DERP or UDP connectivity is also present.

tailnet/conn.go Outdated
// of a connection cause it to hang for an unknown
// reason. TODO: @kylecarbs debug this!
KeepAlive: ok && peerStatus.Active,
KeepAlive: true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note for reviewers: I think connections were hanging because of the Tailscale "trimming" of the WG nodes, which I have disabled in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be the underlying cause of the test failures? I'm seeing a lot of changes that I wouldn't expect to be needed, so thought I'd point it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning this on is what triggers wireguard to go and proactively attempt to connect to a peer, rather than just passively waiting for the peer to connect.

I am seeing some complaints in the test about peers getting "invalid response messages". I can't help but wonder if it's related to each side attempting to initiate a handshake at the same time. Maybe that's delaying the handshakes by 5 seconds a lot more.

I could make it so that only the clients turn this on...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

07df3ab

I'm out of time before my vacation. We can leave this until I get back, or someone else can take over

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this looks good to me 👍🏻. Approved, but I'd love for @kylecarbs or @coadler to give this a once-over too.

func (c *Conn) AwaitReachable(ctx context.Context, ip netip.Addr) bool {
rightAway := make(chan struct{}, 1)
rightAway <- struct{}{}
tkr := time.NewTicker(50 * time.Millisecond)
Copy link
Member

@mafredri mafredri Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 50ms a bit high to start off with? In non-ideal conditions every connection attempt may have an artificial delay of 50ms. A backoff could still be useful here.

Edit: Also see #8492 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50ms was the initial interval from the ping-based code, so I just kept it. What do you think is more reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s fair. And good question. I suppose something in the 15-25ms range would be nice. But I’m mostly basing this on what I feel a fast non-local ping is like.

case <-ctx.Done():
return false
case <-tkr.C:
case <-rightAway:
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change to signal intent (one-time use), this way it doesn't matter if the channel is ever closed in a refactor:

Suggested change
}
}
rightAway = nil

(Top could also be replaced with a closed unbuffered chan in this case.)

c.logger.Debug(ctx, "missing node(s) for IP", slog.F("ip", ip.String()))
continue
}
s := c.Status()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering this out of the "single tailnet" perspective, should we instead rely on utilizing the peers reported in the status callback (SetStatusCallback)? Let's say this is the server, and there are a lot of peers and traffic, then waiting for multiple connections will result in a lot of wg state reporting/mutex locking along the way.

This method could also let us signal this loop when a peer is available instead of polling it on an arbitrary timer.

Just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where you're head is at.

I did think about this, but decided not to build it until/unless we get some evidence it's a problem at scale. The code I'm replacing had us sending a bunch of pings, and who knows how expensive that was! If we see extended lag when making a lot of coderd proxy'd connections at scale, we can look into improving the efficiency here.

}

// getNodeKeysForIP returns all the keys for the given IP address. There could be multiple keys if the agent/client
// for the IP disconnects and reconnects, since we regenerate the node key each time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we regenerate the node key each time.

Should we perhaps not do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to guarantee the agent's key will both be secure and stable across agent restarts.

Picking a random key each time makes it secure...

// Our use case is different: for clients, it's a point-to-point connection to a single workspace, and lasts only as
// long as the connection. For agents, it's connections to a small number of clients (CLI or Coderd) that are being
// actively used by the end user.
envknob.Setenv("TS_DEBUG_TRIM_WIREGUARD", "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that comes to mind here is that agents will forever spin trying to connect to CLI connections that have gone away. Since we don't have the ability to tell agents that a client is gone, it seems like this would pile up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this something we should keep disabled on coderd's tailnet as well? I guess it's sort-of moot since we're doing our own peer expiration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that comes to mind here is that agents will forever spin trying to connect to CLI connections that have gone away. Since we don't have the ability to tell agents that a client is gone, it seems like this would pile up.

Yeah, they already do that #7960

We need to enhance the coordinator to revoke nodes when clients disconnect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is this something we should keep disabled on coderd's tailnet as well? I guess it's sort-of moot since we're doing our own peer expiration.

I don't think we want to trim wireguard on Coderd's tailnet, since that means we cannot wait for wireguard to come up before attempting to connect to the agent, and we risk hitting stalls due to an inflated retry timeout #7388 . As you mention, we have to solve the problem of when to remove agent peers in Coderd. Trimming the wireguard connections saves memory, but we'd still be leaking memory if we just never removed them from tailnet.

@coadler
Copy link
Contributor

coadler commented Jul 19, 2023

I think I've found the reason for the failing tests. It seems like there's still a race condition in handshakes causing the 5s timeout:

  1. Client wants to connect to an agent, adds the agent to its netmap and sends a handshake initiation
  2. Agent receives handshake and ignores it since it doesn't have the client in its netmap.
  3. Agent receives the client node, adds to netmap
  4. 5s timeout happens, handshake is resent
  5. Agent and client successfully handshake

I think the only way to fix this is by having the agent initiate the handshake instead of the client. The client needs to make sure it has the agent node in its netmap before propagating to the agent. I don't think there's an easy way to ensure the reverse.

After changing KeepAlive to the agent, all tests seem to pass locally.

			KeepAlive:  c.connType == ConnTypeAgent,

@coadler
Copy link
Contributor

coadler commented Jul 20, 2023

cc @mafredri, curious what you think of above. I don't think this is super urgent, might just leave this until Spike gets back.

@mafredri
Copy link
Member

cc @mafredri, curious what you think of above. I don't think this is super urgent, might just leave this until Spike gets back.

I think it's an interesting find, and makes sense. I think leaving this until Spike gets back is fine, though, I'd like to hear what he thinks of it as well.

I haven't seen too many CI failures pointing towards this either now, so I'd agree it's not super urgent.

@github-actions github-actions bot added the stale This issue is like stale bread. label Jul 28, 2023
@coadler coadler removed the stale This issue is like stale bread. label Jul 28, 2023
@spikecurtis
Copy link
Contributor Author

I think I've found the reason for the failing tests. It seems like there's still a race condition in handshakes causing the 5s timeout:

  1. Client wants to connect to an agent, adds the agent to its netmap and sends a handshake initiation
  2. Agent receives handshake and ignores it since it doesn't have the client in its netmap.
  3. Agent receives the client node, adds to netmap
  4. 5s timeout happens, handshake is resent
  5. Agent and client successfully handshake

I think the only way to fix this is by having the agent initiate the handshake instead of the client. The client needs to make sure it has the agent node in its netmap before propagating to the agent. I don't think there's an easy way to ensure the reverse.

After changing KeepAlive to the agent, all tests seem to pass locally.

This is good analysis. Before this change, we'd wait for a Disco pong, then start the TCP session, and this would trigger the wireguard session to come up. Since we got a pong, we know the agent had received our node from the coordinator, so it mostly worked.

So, the basic issue is that we want to ensure that whatever peer initiates the handshake, the other peer has programmed wireguard to accept the handshake before we send it.

Option 1: (as you suggested) Have the agent initiate the handshake and the client only send its node after it has programmed the agent node into wireguard.

I see two issues:

  1. This works fine for end-user clients, but I don't think it's compatible with the Coderd single-tailnet, since node updates happen asynchronously from connecting to agents.
  2. It also means that the agent will continue to send handshakes after clients disconnect. It's a nice property of the client initiating the handshake that when the client goes away, so do the handshakes.

Option 2: Reinstate Disco ping/pong prior to sending the wireguard handshake. There is still a race condition because tailscale programs the peer into the Disco layer before it programs wireguard. But, I expect it to work pretty well because you have to

  1. transmit the pong over the network
  2. reconfigure wireguard to enable the handshake
  3. transmit the handshake over the network

versus, the other side just needs to reconfigure wireguard before the handshake arrives. So, it should mostly win the race and we don't have to wait 5s. Even if it loses the race, after 5s, it'll retry and we'll be OK --- but we will have to make test timeouts long enough to account for this.

Option 3: Enhance the coordinator to send confirmations. So, the client would send a node update, and the agent would program wireguard and then send a confirmation for that node update. Then, the client could wait for a confirmation before initiating the wireguard handshake.

The confirmations are asymmetric in that the agent doesn't care when clients program its node into wireguard --- it passively waits for handshakes.

Obviously, I don't think Option 3 is in scope of this PR. And, I'm nervous about adding a bunch of complexity to the coordinator: edge cases of multiple agent connections are particularly gnarly for confirmations.

@coadler
Copy link
Contributor

coadler commented Aug 3, 2023

One interesting this is that it looks like Tailscale removed KeepAlive ~1mo ago: tailscale/tailscale@7b1c3df

I think this essentially makes option 2 the only option for now, unless we want to maintain the keep-alive behavior on our fork.

@spikecurtis
Copy link
Contributor Author

One interesting this is that it looks like Tailscale removed KeepAlive ~1mo ago: tailscale/tailscale@7b1c3df

I think this essentially makes option 2 the only option for now, unless we want to maintain the keep-alive behavior on our fork.

That's going to basically torpedo this whole effort if we upgrade to the latest Tailscale. Tailscale.com has this mechanism to only program Wireguard nodes and enable handshakes if there is active traffic sent to them. The KeepAlive was a mechanism to explicitly enable handshakes, but Tailscale.com doesn't use it, so they removed it.

That means the only way to get Tailscale to enable Wireguard is to send traffic. This is a layering violation, IMO, since you're using a higher layer to trigger bringing up a lower layer.

We can't start the requested TCP session, since this skews the initial round-trip-time measurement while Wireguard comes up. But, we could add a simple UDP echo endpoint to the agent, and then ping it instead of relying on Disco pings for AwaitReachable(). This would have the dual role of sending traffic, to convince Tailscale to bring up Wireguard, and since we are pinging at the IP layer (not the Disco layer), when we get a response we know the wireguard session is up.

@coadler WDYT? If the UDP echo thing sounds good I'll close out this PR and restart down that path. If we continue with this PR I don't think we'll be able to upgrade to the latest Tailscale without ripping it out.

@coadler
Copy link
Contributor

coadler commented Aug 4, 2023

@coadler WDYT? If the UDP echo thing sounds good I'll close out this PR and restart down that path. If we continue with this PR I don't think we'll be able to upgrade to the latest Tailscale without ripping it out.

Tailscale has multiple different ping types: https://github.com/tailscale/tailscale/blob/b940c19ca61bcd5650406e54f090da437796de0e/tailcfg/tailcfg.go#L1413. We could potentially use TSMP to facilitate ensuring the IP layer is up. Otherwise, I think the UDP endpoint is a good option.

Are we potentially able to salvage the ps.LastHandshake change from here? I think that would still be a good addition on it's own, so we could bypass some unnecessary pings.

Also just FYI, I'm hoping to bring us to the latest Tailscale release early next week: #8913

@spikecurtis
Copy link
Contributor Author

@coadler I will play around with the different built-in tailscale pings. It will be great if we don't have to implement the ping ourselves. I'm not sure what TSMP is --- presumably Tailscale management protocol or something. It says it runs over IP, but since this is a tunnel there are multiple IP layers involved. If it runs over the overlay IP layer, we should be good.

I also agree it would be good to check ps.LastHandshake before trying to ping. In the coderd single tailnet use case, we will dial the agent multiple times, so if we already know wireguard is up to the peer, we can skip pings. Of course, in the coder ssh use case, we only dial the agent once, so we'll ping every time.

@github-actions github-actions bot added the stale This issue is like stale bread. label Aug 13, 2023
coadler added a commit that referenced this pull request Aug 15, 2023
Since Tailscale removed keep-alives, it seems like open but idle
connections (SSH, port-forward, etc) can get trimmed fairly easily,
causing hangs for a few seconds while the connection is setup again.

This was taken from Spike's PR #8492

Co-authored-by: Spike Curtis <[email protected]>
@github-actions github-actions bot closed this Aug 17, 2023
@github-actions github-actions bot deleted the spike/7388-tcp-wait-for-wg branch January 20, 2024 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This issue is like stale bread.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants