Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Agent / PTY data race reading and resizing #3236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spikecurtis opened this issue Jul 26, 2022 · 11 comments
Closed

Agent / PTY data race reading and resizing #3236

spikecurtis opened this issue Jul 26, 2022 · 11 comments
Labels
api Area: HTTP API

Comments

@spikecurtis
Copy link
Contributor

Spotted during CI

Note that this trace is from a branch, and so line numbers might not be accurate to what is in main.

==================
WARNING: DATA RACE
Write at 0x00c00087af70 by goroutine 58:
  internal/poll.(*FD).destroy()
      /opt/hostedtoolcache/go/1.18.4/x64/src/internal/poll/fd_unix.go:86 +0xc9
  internal/poll.(*FD).readUnlock()
      /opt/hostedtoolcache/go/1.18.4/x64/src/internal/poll/fd_mutex.go:232 +0x44
  internal/poll.(*FD).Read.func1()
      /opt/hostedtoolcache/go/1.18.4/x64/src/internal/poll/fd_unix.go:147 +0x39
  runtime.deferreturn()
      /opt/hostedtoolcache/go/1.18.4/x64/src/runtime/panic.go:436 +0x32
  os.(*File).read()
      /opt/hostedtoolcache/go/1.18.4/x64/src/os/file_posix.go:31 +0xc7
  os.(*File).Read()
      /opt/hostedtoolcache/go/1.18.4/x64/src/os/file.go:119 +0x98
  github.com/coder/coder/pty.ReadWriter.Read()
      /home/runner/work/coder/coder/pty/pty.go:66 +0x6e
  github.com/coder/coder/pty.(*ReadWriter).Read()
      <autogenerated>:1 +0x7a
  io.copyBuffer()
      /opt/hostedtoolcache/go/1.18.4/x64/src/io/io.go:426 +0x28a
  io.Copy()
      /opt/hostedtoolcache/go/1.18.4/x64/src/io/io.go:385 +0xbb
  github.com/coder/coder/agent.(*agent).handleSSHSession.func4()
      /home/runner/work/coder/coder/agent/agent.go:468 +0x57

Previous read at 0x00c00087af70 by goroutine 93:
  os.(*File).Fd()
      /opt/hostedtoolcache/go/1.18.4/x64/src/os/file_unix.go:89 +0xa6
  github.com/creack/pty.Setsize()
      /home/runner/go/pkg/mod/github.com/creack/[email protected]/winsize_unix.go:23 +0x3d
  github.com/coder/coder/pty.(*otherPty).Resize()
      /home/runner/work/coder/coder/pty/pty_other.go:58 +0x104
  github.com/coder/coder/pty.(*otherPtyWithProcess).Resize()
      <autogenerated>:1 +0x67
  github.com/coder/coder/agent.(*agent).handleSSHSession.func2()
      /home/runner/work/coder/coder/agent/agent.go:458 +0x10a

Goroutine 58 (running) created at:
  github.com/coder/coder/agent.(*agent).handleSSHSession()
      /home/runner/work/coder/coder/agent/agent.go:467 +0xe9e
  github.com/coder/coder/agent.(*agent).init.func2()
      /home/runner/work/coder/coder/agent/agent.go:280 +0x97
  github.com/gliderlabs/ssh.(*session).handleRequests.func1()
      /home/runner/go/pkg/mod/github.com/gliderlabs/[email protected]/session.go:261 +0x4b

Goroutine 93 (running) created at:
  github.com/coder/coder/agent.(*agent).handleSSHSession()
      /home/runner/work/coder/coder/agent/agent.go:456 +0xd06
  github.com/coder/coder/agent.(*agent).init.func2()
      /home/runner/work/coder/coder/agent/agent.go:280 +0x97
  github.com/gliderlabs/ssh.(*session).handleRequests.func1()
      /home/runner/go/pkg/mod/github.com/gliderlabs/[email protected]/session.go:261 +0x4b

It's interesting that destroy is in the top stack --- which might mean the race occurs while the TTY is being closed and file descriptors being cleaned up.

@spikecurtis spikecurtis added bug api Area: HTTP API labels Jul 26, 2022
@spikecurtis
Copy link
Contributor Author

looking into this more, I believe the race condition is in creack/pty -- when calling Setsize(), it includes a "bare" call to get the file descriptor, which it uses to set an ioctl. Grabbing the file descriptor in this way doesn't go through the fdmutex on the file.

The "modern" go solution to this, since Go 1.12, is to use f.SyscallConn() to wrap the ioctl. I'm testing this out locally and will see if the upstream wants a PR. Unfortunately it seems to break their riscv compiler, which is still on go 1.6 😱 , so they may not want it. We could consider forking...

@spikecurtis
Copy link
Contributor Author

@kylecarbs @dwahler @mafredri do you think this is worth forking right away over, or should we wait and see whether upstream will accept a PR?

@mafredri
Copy link
Member

Nice find, whilst refactoring ptytest I felt there might be a race here, but I didn't consider that it would be during close.

Would guarding resize and close by the same mutex sufficiently protect against this case? If so, we could consider doing that instead of forking. Otherwise, I'd say go forth and fork.

mafredri added a commit that referenced this issue Jul 28, 2022
mafredri added a commit that referenced this issue Jul 28, 2022
@mafredri
Copy link
Member

Didn’t see this for a few CI runs now with #3270. If it doesn’t resurface we could consider this fixed?

@spikecurtis
Copy link
Contributor Author

Sorry I missed #3270 - I don’t think that actually fixes the race as you’ll notice from the stack traces that the race ends up between Resize and Read.

We have a goroutine that copies from the SSH session to the TTY file. I think what’s happening is that when the file is Closed, this doesn’t trigger the finalizer because we are still copying. It appears they are using a panic on the file to capture the EOF, which then is able to actually finalize the file.

Net is that in order to work around the issue we would probably need to include Read and Write in the mutex’d operations. That feels pretty annoying given that go already has a mutex that handles this stuff.

@mafredri
Copy link
Member

mafredri commented Jul 29, 2022

@spikecurtis The fact that internal/poll.(*FD).destroy() was at the top of the stack suggests Read is active while the file is Closed (as you wrote). However, the reason I thought this could help is that we're likely the ones calling Close.

My suspicion was then that we're calling Close and Resize immediately after, while the active Read is exiting and #3270 fixes that case. This is strengthened by the fact that we never actually abort the resizing operation (except whenever the lib decides to close the channel), e.g. in:

coder/agent/agent.go

Lines 456 to 463 in 74c8766

go func() {
for win := range windowSize {
resizeErr := ptty.Resize(uint16(win.Height), uint16(win.Width))
if resizeErr != nil {
a.logger.Warn(context.Background(), "failed to resize tty", slog.Error(resizeErr))
}
}
}()

The race would still be there if the fd was being closed by other means, but I'm wondering how likely this is to happen or in what scenarios it would.

I thought about guarding Read/Write as well, but that feels like it'd be prone to deadlock (e.g. due to Read blocking until there is data).

@spikecurtis
Copy link
Contributor Author

spikecurtis commented Jul 29, 2022

I agree that we are almost certainly the ones calling Close. But, there are three interacting goroutines (Close, Resize, Read). Yes, the race only occurs when Close is called, but fundamentally there is a problem between Resize and Read. Making Close and Resize mutually exclusive might narrow the window, but the race is still there.

Basically it goes like this:

  1. We call Close. This doesn’t finalize the file because we have a goroutine waiting to Read.
  2. We call Resize
  3. Read goroutine wakes up

1 and 2 can’t be concurrent because of the mutex, but 2 and 3 can, and we’ll have a race in that case.

@mafredri
Copy link
Member

I'm not sure I fully follow because that sounds like the case that is fixed by #3270. I.e. 2 and 3 are racy iff Close is called, and that can't happen because of the mutex. And if Close has been called then Resize is no-op.

This is not accounting for other ways the fd might possibly close (without calling Close).

@spikecurtis
Copy link
Contributor Author

The mutex doesn’t prevent Close from being called! It just prevents it from being concurrent with Resize. The race is possible any time Close is called, not just when it happens concurrently with Resize.

@mafredri
Copy link
Member

mafredri commented Jul 29, 2022

I don't think there's any race between Close, Read or Write. I'm seeing it as the race being in creack/pty calling (*os.File).Fd() (via Setsize) during the fd cleanup happening in Read or Write after a call to Close.

So calls to (*os.File).Fd() are racy, but are we doing that anywhere other than in Setsize (which is now guarded)?

@spikecurtis
Copy link
Contributor Author

@mafredri and I talked and I misunderstood the fix he made. It prevents the call to Setsize after the Close, so that should prevent a race with Read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Area: HTTP API
Projects
None yet
Development

No branches or pull requests

2 participants