Race in remote fetcher can cause parallel tests to fail #96

vanzin · 2023-01-04T23:11:53Z

This code in remote_fetch.go has a race that can be triggered when the cache hasn't been populated yet. If you have multiple tests running in parallel, multiple processes will try to download the remote archive and write it to the cache location.

This can lead to errors like this:

--- FAIL: TestSuite (0.59s)
    database.go:49: could not start database: &{%!e(string=unable to extract postgres archive: xz: data is truncated or corrupt)}
FAIL

That can happen when the test has successfully downloaded the archive into the cache location, and opens the file; at the same time, another test starts writing its own cache file, and the first one ends up reading partially-written file, and failing to uncompress it.

The usual way to fix this is to write the data to a temporary file, and move it into the final location, which is an atomic operation (except on Windows, if that's a worry). If Windows support is desired, you can try the move, and if it fails, check if the error is because the target file exists (which means some other process "won" the race).

The text was updated successfully, but these errors were encountered:

fergusstrange · 2023-01-28T10:13:06Z

Hey @vanzin this looks like a good find and genuine issue.

I'll try to spend some time over the next few weeks to look into this.

Can you describe your test set up for me?

nahojer · 2023-01-28T10:13:40Z

I've experimented with this locally on my machine. Currently imported https://github.com/natefinch/atomic and replacing

if err := os.WriteFile(cacheLocation, archiveBytes, file.FileHeader.Mode()); err != nil {
    return errorExtractingPostgres(err)
}

in remote_fetch.go with

if err := atomic.WriteFile(cacheLocation, bytes.NewReader(archiveBytes)); err != nil {
    return errorExtractingPostgres(err)
}

All tests pass, but it's of course very difficult to write a test for this that ensures we have no data race (at least I don't know how to write such tests). On Monday I'm gonna do some trial and error with my version and the Gitlab CI pipeline.

Thank you for looking into this @fergusstrange !

vanzin · 2023-01-30T17:29:04Z

There's nothing special about the test setup. It's just different tests (as in different _test.go files), each one using a different embedded database, running in parallel. If I delete my ~/.embedded-postgres-go/ directory I very easily hit this issue.

nahojer · 2023-01-30T19:53:14Z

Today I implemented the solution proposed in a previous comment (using the atomic package) and have not run into the data race anymore. Have run the GitLab pipeline described in #97 at least 20 times. Before the data race happened quite frequently.

nahojer · 2023-02-04T08:21:29Z

Just an update. We have been using our own fork of with the fix described in a previous comment using the atomic package for many days now, and we havn't seen a single unexpected failure. We used to see it all the time before.

alecsammon · 2023-03-07T12:20:08Z

I've opened a PR to fix this here: #105

I've taken a slightly different strategy than the one suggested above and used a `sync.Mutex.

Using atomic.WriteFile looks to work but has a couple of potential disadvantages.

It introduces another dependency
The archive will still be downloaded twice, which is not necessary.

I believe that my change is safe, but as it's a only occurring randomly for us then it's hard to fully confirm!

You should be able to add the following to your go.mod to test this:

replace github.com/fergusstrange/embedded-postgres => github.com/PaddleHQ/embedded-postgres v0.0.0-20230307104208-118f8fb312d5

nahojer · 2023-03-12T09:01:00Z

What about copying the code from https://github.com/natefinch/atomic and put it under internal/atomic and give a shoutout to the author? It's MIT licensed.
We have done this internally at my company and havn't seen a single race condition since we implemented this fix more than 1 month ago. Before the fix it happened multiple times a day on every other test pipeline.

fergusstrange · 2023-03-14T10:59:38Z

Hey @nahojer let's give @alecsammon some time to work through this challenge.

fergusstrange · 2023-05-21T10:24:56Z

A pre-release involving lots of work from @alecsammon is available to test for anyone that's interested.

I'm really keen to hear feedback from everyone here as I only have limit access to certain hardware to test this with so please shout if it worked or not.

fergusstrange · 2023-06-13T06:15:24Z

The above release appears to have resolved a bunch of these issues so I'm closing this for now.

manuelarte · 2023-07-11T11:23:49Z

is it possible to add an example, or a comment in the readme about the parallelization?

nahojer mentioned this issue Jan 5, 2023

Get error "unable to extract postgres archive: xz: data is truncated or corrupt" in GitLab pipeline. #97

Closed

alecsammon mentioned this issue Mar 7, 2023

Add sync.Mutex and os.Rename to prevent corrupted file when downloading the Postgres archive #105

Merged

fergusstrange closed this as completed Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race in remote fetcher can cause parallel tests to fail #96

Race in remote fetcher can cause parallel tests to fail #96

vanzin commented Jan 4, 2023

fergusstrange commented Jan 28, 2023

nahojer commented Jan 28, 2023

vanzin commented Jan 30, 2023

nahojer commented Jan 30, 2023

nahojer commented Feb 4, 2023

alecsammon commented Mar 7, 2023

nahojer commented Mar 12, 2023

fergusstrange commented Mar 14, 2023

fergusstrange commented May 21, 2023

fergusstrange commented Jun 13, 2023

manuelarte commented Jul 11, 2023

Race in remote fetcher can cause parallel tests to fail #96

Race in remote fetcher can cause parallel tests to fail #96

Comments

vanzin commented Jan 4, 2023

fergusstrange commented Jan 28, 2023

nahojer commented Jan 28, 2023

vanzin commented Jan 30, 2023

nahojer commented Jan 30, 2023

nahojer commented Feb 4, 2023

alecsammon commented Mar 7, 2023

nahojer commented Mar 12, 2023

fergusstrange commented Mar 14, 2023

fergusstrange commented May 21, 2023

fergusstrange commented Jun 13, 2023

manuelarte commented Jul 11, 2023