Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Race in remote fetcher can cause parallel tests to fail #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vanzin opened this issue Jan 4, 2023 · 11 comments
Closed

Race in remote fetcher can cause parallel tests to fail #96

vanzin opened this issue Jan 4, 2023 · 11 comments

Comments

@vanzin
Copy link

vanzin commented Jan 4, 2023

This code in remote_fetch.go has a race that can be triggered when the cache hasn't been populated yet. If you have multiple tests running in parallel, multiple processes will try to download the remote archive and write it to the cache location.

This can lead to errors like this:

--- FAIL: TestSuite (0.59s)
    database.go:49: could not start database: &{%!e(string=unable to extract postgres archive: xz: data is truncated or corrupt)}
FAIL

That can happen when the test has successfully downloaded the archive into the cache location, and opens the file; at the same time, another test starts writing its own cache file, and the first one ends up reading partially-written file, and failing to uncompress it.

The usual way to fix this is to write the data to a temporary file, and move it into the final location, which is an atomic operation (except on Windows, if that's a worry). If Windows support is desired, you can try the move, and if it fails, check if the error is because the target file exists (which means some other process "won" the race).

@fergusstrange
Copy link
Owner

Hey @vanzin this looks like a good find and genuine issue.

I'll try to spend some time over the next few weeks to look into this.

Can you describe your test set up for me?

@nahojer
Copy link

nahojer commented Jan 28, 2023

I've experimented with this locally on my machine. Currently imported https://github.com/natefinch/atomic and replacing

if err := os.WriteFile(cacheLocation, archiveBytes, file.FileHeader.Mode()); err != nil {
    return errorExtractingPostgres(err)
}

in remote_fetch.go with

if err := atomic.WriteFile(cacheLocation, bytes.NewReader(archiveBytes)); err != nil {
    return errorExtractingPostgres(err)
}

All tests pass, but it's of course very difficult to write a test for this that ensures we have no data race (at least I don't know how to write such tests). On Monday I'm gonna do some trial and error with my version and the Gitlab CI pipeline.

Thank you for looking into this @fergusstrange !

@vanzin
Copy link
Author

vanzin commented Jan 30, 2023

There's nothing special about the test setup. It's just different tests (as in different _test.go files), each one using a different embedded database, running in parallel. If I delete my ~/.embedded-postgres-go/ directory I very easily hit this issue.

@nahojer
Copy link

nahojer commented Jan 30, 2023

Today I implemented the solution proposed in a previous comment (using the atomic package) and have not run into the data race anymore. Have run the GitLab pipeline described in #97 at least 20 times. Before the data race happened quite frequently.

@nahojer
Copy link

nahojer commented Feb 4, 2023

Just an update. We have been using our own fork of with the fix described in a previous comment using the atomic package for many days now, and we havn't seen a single unexpected failure. We used to see it all the time before.

@alecsammon
Copy link
Contributor

I've opened a PR to fix this here: #105

I've taken a slightly different strategy than the one suggested above and used a `sync.Mutex.

Using atomic.WriteFile looks to work but has a couple of potential disadvantages.

  1. It introduces another dependency
  2. The archive will still be downloaded twice, which is not necessary.

I believe that my change is safe, but as it's a only occurring randomly for us then it's hard to fully confirm!

You should be able to add the following to your go.mod to test this:

replace github.com/fergusstrange/embedded-postgres => github.com/PaddleHQ/embedded-postgres v0.0.0-20230307104208-118f8fb312d5

@nahojer
Copy link

nahojer commented Mar 12, 2023

What about copying the code from https://github.com/natefinch/atomic and put it under internal/atomic and give a shoutout to the author? It's MIT licensed.
We have done this internally at my company and havn't seen a single race condition since we implemented this fix more than 1 month ago. Before the fix it happened multiple times a day on every other test pipeline.

@fergusstrange
Copy link
Owner

Hey @nahojer let's give @alecsammon some time to work through this challenge.

@fergusstrange
Copy link
Owner

A pre-release involving lots of work from @alecsammon is available to test for anyone that's interested.

I'm really keen to hear feedback from everyone here as I only have limit access to certain hardware to test this with so please shout if it worked or not.

@fergusstrange
Copy link
Owner

The above release appears to have resolved a bunch of these issues so I'm closing this for now.

@manuelarte
Copy link

is it possible to add an example, or a comment in the readme about the parallelization?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants