fix: Disable compression for websocket dRPC transport #145

bryphe-coder · 2022-02-02T17:43:46Z

There is a race condition in the interop between the websocket and dRPC: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the byte[] being sent between them. This can lead to data races, in which both dRPC and the websocket are writing.

This is just tracking some experimentation to fix that race condition

Run results:

Run 1: peer test failure
Run 2: peer test failure
Run 3: TestWorkspaceHistory/CreateHistory - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45

status code 412: The provided project history is running. Wait for it to complete importing!`

Run 4: TestWorkspaceHistory/CreateHistory - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176

    workspacehistory_test.go:122: 
        	Error Trace:	workspacehistory_test.go:122
        	Error:      	Condition never satisfied
        	Test:       	TestWorkspaceHistory/CreateHistory

Run 5: peer failure
Run 6: Pass ✅
Run 7: Peer failure

Open Questions:

Is `dRPC` or `websocket` at fault for the data race?

It looks like this condition is specifically happening when dRPC decides to [SendError]). This constructs a new byte payload from MarshalError - so dRPC has created this buffer and owns it.

From dRPC's perspective, the callstack looks like this:

sendPacket
- writeFrame
  - AppendFrame
    - with finally the data race happening here:

// AppendFrame appends a marshaled form of the frame to the provided buffer.
func AppendFrame(buf []byte, fr Frame) []byte {
...
	out := buf
	out = append(out, control).   // <---------

This should be fine, since dPRC create this buffer, and is taking the byte buffer constructed from MarshalError and tacking a bunch of headers on it to create a proper frame.

Once dRPC is done writing, it hangs onto the buffer and resets it here_: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73

However... the websocket implementation, once it gets the buffer, it runs a statelessDeflate here, which compresses the buffer on the fly. This functionality actually mutates the buffer in place, which is where get our race.

In the case where the byte[] aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing dRPC -> websocket, they both are manipulating it (dRPC is reusing the buffer for the next write, and websocket is compressing on the fly).

Why does cloning on `Read` fail?

Get a bunch of errors like:

2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0
2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF
2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF
2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0

UPDATE:

We decided we could disable websocket compression, which would avoid the race because the in-place deflate operaton would no longer be run. Trying that out now:

Run 1: ✅
Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338
Run 3: ✅
Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168
Run 5: ✅

codecov · 2022-02-02T17:47:32Z

Codecov Report

Merging #145 (eef4d3a) into provisionerdaemon (2a76b60) will decrease coverage by 0.21%.
The diff coverage is 100.00%.

@@                  Coverage Diff                  @@
##           provisionerdaemon     #145      +/-   ##
=====================================================
- Coverage              68.04%   67.83%   -0.22%     
=====================================================
  Files                    101      101              
  Lines                   5095     5098       +3     
  Branches                  68       68              
=====================================================
- Hits                    3467     3458       -9     
- Misses                  1324     1333       +9     
- Partials                 304      307       +3

Flag	Coverage Δ
unittest-go-macos-latest	`64.59% <100.00%> (-0.22%)`	⬇️
unittest-go-ubuntu-latest	`66.89% <100.00%> (-0.07%)`	⬇️
unittest-go-windows-latest	`64.30% <100.00%> (-0.08%)`	⬇️
unittest-js	`64.92% <ø> (ø)`

Impacted Files	Coverage Δ
coderd/provisionerdaemons.go	`48.56% <100.00%> (-0.31%)`	⬇️
peerbroker/dial.go	`76.19% <0.00%> (-4.77%)`	⬇️
peer/channel.go	`84.75% <0.00%> (-2.44%)`	⬇️
peer/conn.go	`80.36% <0.00%> (-0.26%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a76b60...eef4d3a. Read the comment docs.

coadler · 2022-02-02T18:33:56Z

coderd/provisionerdaemons.go

+	cpy := make([]byte, len(p))
+	copy(cpy, p)
+	return c.conn.Write(cpy)


You can do 1 line slice copies as well!

Suggested change

cpy := make([]byte, len(p))

copy(cpy, p)

return c.conn.Write(cpy)

return c.conn.Write(append(([]byte)(nil), p...))

* feat: Add history middleware parameters These will be used for streaming logs, checking status, and other operations related to workspace and project history. * refactor: Move all HTTP routes to top-level struct Nesting all structs behind their respective structures is leaky, and promotes naming conflicts between handlers. Our HTTP routes cannot have conflicts, so neither should function naming. * Add provisioner daemon routes * Add periodic updates * Skip pubsub if short * Return jobs with WorkspaceHistory * Add endpoints for extracting singular history * The full end-to-end operation works * fix: Disable compression for websocket dRPC transport (#145) There is a race condition in the interop between the websocket and `dRPC`: https://github.com/coder/coder/runs/5038545709?check_suite_focus=true#step:7:117 - it seems both the websocket and dRPC feel like they own the `byte[]` being sent between them. This can lead to data races, in which both `dRPC` and the websocket are writing. This is just tracking some experimentation to fix that race condition ## Run results: ## - Run 1: peer test failure - Run 2: peer test failure - Run 3: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040858460?check_suite_focus=true#step:8:45 ``` status code 412: The provided project history is running. Wait for it to complete importing!` ``` - Run 4: `TestWorkspaceHistory/CreateHistory` - https://github.com/coder/coder/runs/5040957999?check_suite_focus=true#step:7:176 ``` workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` - Run 5: peer failure - Run 6: Pass ✅ - Run 7: Peer failure ## Open Questions: ## ### Is `dRPC` or `websocket` at fault for the data race? It looks like this condition is specifically happening when `dRPC` decides to [`SendError`]). This constructs a new byte payload from [`MarshalError`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/error.go#L15) - so `dRPC` has created this buffer and owns it. From `dRPC`'s perspective, the callstack looks like this: - [`sendPacket`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcstream/stream.go#L253) - [`writeFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L65) - [`AppendFrame`](https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/packet.go#L128) - with finally the data race happening here: ```go // AppendFrame appends a marshaled form of the frame to the provided buffer. func AppendFrame(buf []byte, fr Frame) []byte { ... out := buf out = append(out, control). // <--------- ``` This should be fine, since `dPRC` create this buffer, and is taking the byte buffer constructed from `MarshalError` and tacking a bunch of headers on it to create a proper frame. Once `dRPC` is done writing, it _hangs onto the buffer and resets it here__: https://github.com/storj/drpc/blob/f6e369438f636b47ee788095d3fc13062ffbd019/drpcwire/writer.go#L73 However... the websocket implementation, once it gets the buffer, it runs a `statelessDeflate` [here](https://github.com/nhooyr/websocket/blob/8dee580a7f74cf1713400307b4eee514b927870f/write.go#L180), which compresses the buffer on the fly. This functionality actually [mutates the buffer in place](https://github.com/klauspost/compress/blob/a1a9cfc821f00faf2f5231beaa96244344d50391/flate/stateless.go#L94), which is where get our race. In the case where the `byte[]` aren't being manipulated anywhere else, this compress-in-place operation would be safe, and that's probably the case for most over-the-wire usages. In this case, though, where we're plumbing `dRPC` -> websocket, they both are manipulating it (`dRPC` is reusing the buffer for the next `write`, and `websocket` is compressing on the fly). ### Why does cloning on `Read` fail? Get a bunch of errors like: ``` 2022/02/02 19:26:10 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [ERR] yamux: Failed to read header: unexpected EOF 2022/02/02 19:26:25 [WARN] yamux: frame for missing stream: Vsn:0 Type:0 Flags:0 StreamID:0 Length:0 ``` # UPDATE: We decided we could disable websocket compression, which would avoid the race because the in-place `deflate` operaton would no longer be run. Trying that out now: - Run 1: ✅ - Run 2: https://github.com/coder/coder/runs/5042645522?check_suite_focus=true#step:8:338 - Run 3: ✅ - Run 4: https://github.com/coder/coder/runs/5042988758?check_suite_focus=true#step:7:168 - Run 5: ✅ * fix: Remove race condition with acquiredJobDone channel (#148) Found another data race while running the tests: https://github.com/coder/coder/runs/5044320845?check_suite_focus=true#step:7:83 __Issue:__ There is a race in the p.acquiredJobDone chan - in particular, there can be a case where we're waiting on the channel to finish (in close) with <-p.acquiredJobDone, but in parallel, an acquireJob could've been started, which would create a new channel for p.acquiredJobDone. There is a similar race in `close(..)`ing the channel, which also came up in test runs. __Fix:__ Instead of recreating the channel everytime, we can use `sync.WaitGroup` to accomplish the same functionality - a semaphore to make close wait for the current job to wrap up. * fix: Bump up workspace history timeout (#149) This is an attempted fix for failures like: https://github.com/coder/coder/runs/5043435263?check_suite_focus=true#step:7:32 Looking at the timing of the test: ``` t.go:56: 2022-02-02 21:33:21.964 [DEBUG] (terraform-provisioner) <provision.go:139> ran apply t.go:56: 2022-02-02 21:33:21.991 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.050 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.090 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.140 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.195 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running t.go:56: 2022-02-02 21:33:22.240 [DEBUG] (provisionerd) <provisionerd.go:162> skipping acquire; job is already running workspacehistory_test.go:122: Error Trace: workspacehistory_test.go:122 Error: Condition never satisfied Test: TestWorkspaceHistory/CreateHistory ``` It appears that the `terraform apply` job had just finished - with less than a second to spare until our `require.Eventually` completes - but there's still work to be done (ie, collecting the state files). So my suspicion is that terraform might, in some cases, exceed our 5s timeout. Note that in the setup for this test - there is a similar project history wait that waits for 15s, so I borrowed that here. In the future - we can look at potentially using a simple echo provider to exercise this in the unit test, in a way that is more reliable in terms of timing. I'll log an issue to track that. Co-authored-by: Bryan <[email protected]>

bryphe-coder added 2 commits February 2, 2022 17:41

Add proxy 'Write' that clones the bytes prior to forwarding

236ab22

Bring back return

2a5ba89

bryphe-coder self-assigned this Feb 2, 2022

coadler reviewed Feb 2, 2022

View reviewed changes

kylecarbs approved these changes Feb 2, 2022

View reviewed changes

Turn off compression to avoid race

eef4d3a

bryphe-coder changed the title ~~experiment: Clone bytes on write between provisionerdaemon <-> websocket~~ fix: Disable compression for websocket dRPC transport Feb 2, 2022

kylecarbs approved these changes Feb 2, 2022

View reviewed changes

bryphe-coder marked this pull request as ready for review February 2, 2022 21:31

bryphe-coder merged commit 03ed951 into provisionerdaemon Feb 2, 2022

bryphe-coder deleted the bryphe/provisionerdaemon/experiment-race branch February 2, 2022 21:32

ketang unassigned bryphe-coder Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Disable compression for websocket dRPC transport #145

fix: Disable compression for websocket dRPC transport #145

Uh oh!

bryphe-coder commented Feb 2, 2022 •

edited

Loading

Uh oh!

codecov bot commented Feb 2, 2022 •

edited

Loading

Uh oh!

coadler Feb 2, 2022

Uh oh!

Uh oh!

fix: Disable compression for websocket dRPC transport #145

fix: Disable compression for websocket dRPC transport #145

Uh oh!

Conversation

bryphe-coder commented Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Run results:

Open Questions:

Is dRPC or websocket at fault for the data race?

Why does cloning on Read fail?

UPDATE:

Uh oh!

codecov bot commented Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coadler Feb 2, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bryphe-coder commented Feb 2, 2022 •

edited

Loading

Is `dRPC` or `websocket` at fault for the data race?

Why does cloning on `Read` fail?

codecov bot commented Feb 2, 2022 •

edited

Loading