Concurrency fixes for the reference db #3561

carlosmn · 2015-12-24T18:14:40Z

There's all sorts of races in there if you run many threads which all want to create and compress references.

We start by removing a useless test which doesn't even follow our own rules for concurrency and fixing a second one to use different objects and actually perform a concurrency test.

A bunch of this is simply to bubble up error codes so we know what we're dealing with, or fixing how we report them.

This also makes the packing logic more robust and safer by ignoring transient errors (and really the non-transient ones but it's more important that we continue working) and only deleting references if they haven't changed since we packed.

We still need to make sure #1534 doesn't happen by locking the packed-refs file before reloading, but the test no longer fails every second run.

carlosmn · 2016-03-10T12:58:31Z

This finally works fine on my Debian. ~~I don't know why AppVeyor doesn't like it. I've tried it with VS2015 and it works fine for me.~~ turns out I was testing the wrong branch, I can repro.

stanhu · 2016-08-27T11:55:30Z

Thanks for working on this fix. I can reproduce #1534 quite easily by running git gc continuously on a networked fileystem, pushing updates to a branch, and checking the SHA of that branch with Rugged. At times, it appears the HEAD goes "back in time" momentarily; it's also possible to cause the branch to disappear momentarily as well. More details here: https://gitlab.com/gitlab-org/gitlab-ce/issues/15392#note_14530450

The current workaround seems to be to initialize a new Rugged::Repository anytime you need to look up the latest SHA of a commit. This does not seem ideal, so I'm curious if there's anything the community can do to help move this PR forward.

stanhu · 2016-08-27T15:56:24Z

src/refdb_fs.c

-		if (git_path_exists(full_path.ptr) && p_unlink(full_path.ptr) < 0) {
-			if (failed)
-				continue;
+		/* We need to stopy anybody from updating the ref while we try to do a safe delete */


stopy -> stop :)

ethomson · 2016-11-14T09:44:25Z

tests/threads/refdb.c


-		if (!git_reference_lookup(&ref, g_repo, name)) {
-			cl_git_pass(git_reference_delete(ref));
+			cl_git_pass(error);
 			git_reference_free(ref);
 		}

 		if (i == 5) {


This might be nice to have as a constant.

ethomson · 2016-11-14T09:44:37Z

tests/threads/refdb.c

 	git_reference *ref;
 	char name[128];
+	git_repository *repo;
+
+	cl_git_pass(git_repository_open(&repo, data->path));

 	for (i = 0; i < 10; ++i) {


This might be nice to have as a constant.

ethomson · 2016-11-14T09:45:11Z

tests/threads/refdb.c

+	do {
+		error = git_reference_name_to_id(&head, repo, "HEAD");
+	} while (error == GIT_ELOCKED);
+	cl_git_pass(error);

 	for (i = 0; i < 10; ++i) {


This might be nice to have as a constant. (I realize this was already here but it might be a nice cleanup.)

ethomson · 2016-11-14T09:45:16Z

tests/threads/refdb.c

+		do {
+			error = git_reference_create(&ref[i], repo, name, &head, 0, NULL);
+		} while (error == GIT_ELOCKED);
+		cl_git_pass(error);

 		if (i == 5) {


This might be nice to have as a constant.

We say it's going to work if you use a different repository in each thread. Let's do precisely that in our code instead of hoping re-using the refdb is going to work. This test does fail currently, surfacing existing bugs.

We can get useful information like GIT_ELOCKED out of this instead of just -1.

In order not to undo concurrent modifications to references, we must make sure that we only delete a loose reference if it still has the same value as when we packed it. This means we need to lock it and then compare the value with the one we put in the packed file.

We need to save the errno, lest we clobber it in the giterr_set() call. Also add code for reporting that a path component is missing, which is a distinct failure mode.

There might be a few threads or processes working with references concurrently, so fortify the code to ignore errors which come from concurrent access which do not stop us from continuing the work. This includes ignoring an unlinking error. Either someone else removed it or we leave the file around. In the former case the job is done, and in the latter case, the ref is still in a valid state.

We can reduce the duplication by cleaning up at the beginning of the loop, since it's something we want to do every time we continue.

This allows the caller to know the errors was e.g. due to the packed-refs file being already locked and they can try again later.

The logic simply consists of retrying for as long as the library says the data is locked, but it eventually gets through.

Checking the size before we open the file descriptor can lead to the file being replaced from under us when renames aren't quite atomic, so we can end up reading too little of the file, leading to us thinking the file is corrupted.

It does not help us to check whether the file exists before trying to unlink it since it might be gone by the time unlink is called. Instead try to remove it and handle the resulting error if it did not exist.

At times we may try to delete a reference which a different thread has already taken care of.

On Windows we can find locked files even when reading a reference or the packed-refs file. Bubble up the error in this case as well to allow callers on Windows to retry more intelligently.

carlosmn force-pushed the cmn/refdb-para branch from da14d28 to df071c6 Compare December 31, 2015 15:35

carlosmn force-pushed the cmn/refdb-para branch from df071c6 to 421edf5 Compare March 10, 2016 11:27

carlosmn changed the title ~~[WIP] Concurrency fixes for the reference db~~ Concurrency fixes for the reference db Mar 10, 2016

carlosmn force-pushed the cmn/refdb-para branch from 5badb3a to 60ffe78 Compare March 10, 2016 21:03

stanhu reviewed Aug 27, 2016
View reviewed changes

ethomson reviewed Nov 14, 2016

View reviewed changes

carlosmn added 13 commits November 14, 2016 11:25

refdb: adjust the threading tests to what we promise

7da4c42

We say it's going to work if you use a different repository in each thread. Let's do precisely that in our code instead of hoping re-using the refdb is going to work. This test does fail currently, surfacing existing bugs.

refdb: bubble up errors

9914efe

We can get useful information like GIT_ELOCKED out of this instead of just -1.

fileops: save errno and report file existence

f94825c

We need to save the errno, lest we clobber it in the giterr_set() call. Also add code for reporting that a path component is missing, which is a distinct failure mode.

refdb: refactor the lockfile cleanup

dd1ca6f

We can reduce the duplication by cleaning up at the beginning of the loop, since it's something we want to do every time we continue.

refdb: bubble up the error code when compressing the db

2e09106

This allows the caller to know the errors was e.g. due to the packed-refs file being already locked and they can try again later.

refdb: add retry logic to the threaded tests

26416f6

The logic simply consists of retrying for as long as the library says the data is locked, but it eventually gets through.

sortedcache: check file size after opening the file

40ffa07

Checking the size before we open the file descriptor can lead to the file being replaced from under us when renames aren't quite atomic, so we can end up reading too little of the file, leading to us thinking the file is corrupted.

refdb: remove a check-delete race when removing a loose ref

33248b9

It does not help us to check whether the file exists before trying to unlink it since it might be gone by the time unlink is called. Instead try to remove it and handle the resulting error if it did not exist.

refdb: expect threaded test deletes to race

7c32d87

At times we may try to delete a reference which a different thread has already taken care of.

refdb: bubble up locked files on the read side

ce5553d

On Windows we can find locked files even when reading a reference or the packed-refs file. Bubble up the error in this case as well to allow callers on Windows to retry more intelligently.

refdb: use a constant for the number of per-thread creations/deletes

aef54a4

carlosmn force-pushed the cmn/refdb-para branch from 60ffe78 to aef54a4 Compare November 14, 2016 10:36

ethomson merged commit 904e1e7 into master Nov 14, 2016

carlosmn deleted the cmn/refdb-para branch November 15, 2016 11:31

johnhaley81 mentioned this pull request Feb 6, 2017

Bump libgit to 0bf0526 nodegit/nodegit#1187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency fixes for the reference db #3561

Concurrency fixes for the reference db #3561

carlosmn commented Dec 24, 2015

carlosmn commented Mar 10, 2016

stanhu commented Aug 27, 2016 •

edited

Loading

stanhu Aug 27, 2016

ethomson Nov 14, 2016

ethomson Nov 14, 2016

ethomson Nov 14, 2016

ethomson Nov 14, 2016

Concurrency fixes for the reference db #3561

Concurrency fixes for the reference db #3561

Conversation

carlosmn commented Dec 24, 2015

carlosmn commented Mar 10, 2016

stanhu commented Aug 27, 2016 • edited Loading

stanhu Aug 27, 2016

Choose a reason for hiding this comment

ethomson Nov 14, 2016

Choose a reason for hiding this comment

ethomson Nov 14, 2016

Choose a reason for hiding this comment

ethomson Nov 14, 2016

Choose a reason for hiding this comment

ethomson Nov 14, 2016

Choose a reason for hiding this comment

stanhu commented Aug 27, 2016 •

edited

Loading