-
Notifications
You must be signed in to change notification settings - Fork 18k
net: KeepAlive is disabled by write to dead link #31490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
tls.DialWithDialer literally calls net.Dialer.Dial, so I don't think there's anything TLS-specific here. |
Yeah, I looked and while net/http used to mess with the keep-alives, crypto/tls does not. Can you figure out the KEEPCNT value on your system? |
I have the standard defaults:
|
A likely factor... After resuming the client, its date jumps forward to the current time. On my system, that happens after 0-10 mins (it isn't immediate due to a VMWare or Linux driver issue). But that should cause a max wait of 13mins for the default KeepAlive. |
cc @mikioh |
@networkimprov, I'm a little confused about your setup. Anyway I suspect that in your test case, a It seems a common misconception that a connection with keep-alives enabled is going to return an error within I reproduced the issue on Linux, with client and server on the same host, using
I used netcat as server, and a simple Go program with default keep-alives (15s) as client. This is the first web search result I found about the issue. |
I haven't seen this with the same client app running on MacOS, so it may be Linux-specific and suspend-related. It could be due to my system, which is ArchLinux on VMWare on Win7. The net.Dialer side, which sets a keepalive, gets suspended. EDIT: the server side is on a remote host.
EDIT 2: The client writes a pulse to the server every 115s. The lack of that pulse causes the server to drop the connection after the laptop suspends. On the MacOS client I use @gopherbot add OS-Linux |
@networkimprov, if I understand correctly:
EDIT: sorry, I misunderstood the last part of your message; what I wrote below was probably unnecessary. If the client doesn't receive a If the client writes anything to the connection within ~50s after resume, keep-alives are out of the game. Maybe MacOS just has a shorter retry timeout? |
Is there a way to trigger a RST to the client on resume? I'm planning to add a server-to-client pulse and drop net.Dialer keepalive, but first I'll try to confirm that a write happens before the timeout on resume. @bradfitz maybe the docs could mention that writing to a lost connection disables keepalive? Because keepalive is now the default for both dialed and accepted connections. |
If the server receives any segment after closing the connection, it should respond with
I agree. Given a lost connection, writing to it before the keep-alive timeout seems like a very common case to me.
func setTCPTimeout(d time.Duration) func(string, string, syscall.RawConn) error {
return func(network, address string, c syscall.RawConn) error {
var sysErr error
var err = c.Control(func(fd uintptr) {
sysErr = syscall.SetsockoptInt(int(fd), syscall.SOL_TCP, 0x12,
int(d.Milliseconds()))
})
if sysErr != nil {
return os.NewSyscallError("setsockopt", sysErr)
}
return err
}
}
var dialer = &net.Dialer{
KeepAlive: 15 * time.Second,
Control: setTCPTimeout(150 * time.Second),
} |
A NAT, yes; my clients were connected to the server via a router doing NAT. Re docs, how about "keepalive ceases to function if a write is made to a dead link, as that kicks the link out of idle mode and retries the write repeatedly for ~15 minutes".
Do you think that default socket config in Go should set this in lieu of (or addition to) The Man7 page references RFC 793 and RFC 5482 for this feature. EDIT: a search for tcp_user_timeout reveals that it's apparently not implemented on Darwin nor Windows :-( |
It "ceases to function" because it is not needed anymore. The problem is not the scope of keep-alives, but the inability to change limits for non keep-alives:
|
About In my Linux tests, In Debian 9, kernel 4.9.168, Go 1.13:
In an older Debian, I observed I don't like this undocumented and changing behavior of The current OS support for TCP user timeout controls seems too divergent/unpredictable/poor to do something with it in the |
Thanks for the investigation! What was the version of the older Debian which ignores tcp_user_timeout in zero-window? I don't think it's necessarily a show-stopper if zero-window (sometimes) triggers tcp_user_timeout. The question is whether that behavior is worse than keepalive shut-off on write to a dead link, which is pretty bad. Hopefully the io.EOF return can be fixed. |
The one I tested was Debian 8, kernel 3.16.59 (probably anything older does the same).
It depends on the application. In many cases, timing out on zero-window shouldn't be a problem because there is no reason for the receiver not to consume its input for a significant amount of time, but there are exceptions. I'm not saying I wouldn't use FYI, commands like |
Maybe we could propose to default to tcp_user_timeout on kernels newer than [x.y] and fall back to the current default keepalive otherwise? Aren't exception cases known to their authors and clients, who would necessarily need something different than the Go default? The keepalive default in Go exists as training wheels; it's just a starting point. |
It's the newer behavior of
Currently they don't necessarily need something different than the Go default, but they will if we make If a knob for |
OK, I'll investigate the options you found for Windows and MacOS when I get a chance. If they work, I'll file a proposal to add a field for this in net.Dialer, etc. |
Was: net.Dialer.Dial() doesn't respect .KeepAlive
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Haven't tried 1.12.1+
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do/see?
For a TLS connection that's been dropped, c.Read() returns a net.Error with
e.Timeout()==true
after ~18 minutes, apparently due to KeepAlive failure. The connection is closed by the server while the client (laptop running VMWare with Linux) is suspended. Error string:"read tcp n.n.n.n:p->n.n.n.n:p: read: connection timed out"
The note on 5bd7e9c5 (discussed in #23459) says the default for net.Dialer keepalive failure is just 2-3min. With an explicit KeepAlive, I also see odd waits:
net.Dialer{KeepAlive: 30 * time.Second}
18minnet.Dialer{KeepAlive: 25 * time.Second}
18minnet.Dialer{KeepAlive: 10 * time.Second}
16minnet.Dialer{KeepAlive: 5 * time.Second}
<1min, but at least once 18minMeasurements aren't precise; 16-18 minutes could be 1000 or 1024 seconds. Code has:
Also filed #31449 to report that the error due to keepalive doesn't comply with the docs re connection timeout.
cc @FiloSottile @bradfitz
The text was updated successfully, but these errors were encountered: