-
Notifications
You must be signed in to change notification settings - Fork 92
Race in remote fetcher can cause parallel tests to fail #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @vanzin this looks like a good find and genuine issue. I'll try to spend some time over the next few weeks to look into this. Can you describe your test set up for me? |
I've experimented with this locally on my machine. Currently imported https://github.com/natefinch/atomic and replacing if err := os.WriteFile(cacheLocation, archiveBytes, file.FileHeader.Mode()); err != nil {
return errorExtractingPostgres(err)
} in remote_fetch.go with if err := atomic.WriteFile(cacheLocation, bytes.NewReader(archiveBytes)); err != nil {
return errorExtractingPostgres(err)
} All tests pass, but it's of course very difficult to write a test for this that ensures we have no data race (at least I don't know how to write such tests). On Monday I'm gonna do some trial and error with my version and the Gitlab CI pipeline. Thank you for looking into this @fergusstrange ! |
There's nothing special about the test setup. It's just different tests (as in different _test.go files), each one using a different embedded database, running in parallel. If I delete my |
Today I implemented the solution proposed in a previous comment (using the atomic package) and have not run into the data race anymore. Have run the GitLab pipeline described in #97 at least 20 times. Before the data race happened quite frequently. |
Just an update. We have been using our own fork of with the fix described in a previous comment using the atomic package for many days now, and we havn't seen a single unexpected failure. We used to see it all the time before. |
I've opened a PR to fix this here: #105 I've taken a slightly different strategy than the one suggested above and used a `sync.Mutex. Using
I believe that my change is safe, but as it's a only occurring randomly for us then it's hard to fully confirm! You should be able to add the following to your
|
What about copying the code from https://github.com/natefinch/atomic and put it under |
Hey @nahojer let's give @alecsammon some time to work through this challenge. |
A pre-release involving lots of work from @alecsammon is available to test for anyone that's interested. I'm really keen to hear feedback from everyone here as I only have limit access to certain hardware to test this with so please shout if it worked or not. |
The above release appears to have resolved a bunch of these issues so I'm closing this for now. |
is it possible to add an example, or a comment in the readme about the parallelization? |
This code in
remote_fetch.go
has a race that can be triggered when the cache hasn't been populated yet. If you have multiple tests running in parallel, multiple processes will try to download the remote archive and write it to the cache location.This can lead to errors like this:
That can happen when the test has successfully downloaded the archive into the cache location, and opens the file; at the same time, another test starts writing its own cache file, and the first one ends up reading partially-written file, and failing to uncompress it.
The usual way to fix this is to write the data to a temporary file, and move it into the final location, which is an atomic operation (except on Windows, if that's a worry). If Windows support is desired, you can try the move, and if it fails, check if the error is because the target file exists (which means some other process "won" the race).
The text was updated successfully, but these errors were encountered: