odb: freshen existing objects when writing #3861

ethomson · 2016-07-14T20:50:59Z

When writing an object, we calculate its OID and see if it exists in the object database. If it does, we need to freshen the file that contains it.

This adds a new function to backends, freshen, which will freshen the object if it exists.

Previously, we would do a simple exists check during git_odb_write, after calculating the new object ID. Now, we will use the freshen function if it is implemented on a particular backend (falling back to the exists check if it doesn't).

A slight disappointment is that we no longer look in the cache first to see if an object exists. To do that, we would need to store more information in the cache, namely the backend that the object was found in (or else we would need to go locate the object again to freshen it, and if we do that, there would be no point in looking in the cache at all.) That's quite a yak to shave, it turns out. But it's probably not really a big deal, since this is only used in the write functions, where it is presumed that the object doesn't exist (and thus wouldn't be in the cache).

(So in the general case, this is as performant as before.)

carlosmn · 2016-07-15T22:25:11Z

xref #3650

ethomson · 2016-07-19T01:11:41Z

Note that git itself will only freshen a pack file once per execution to avoid thrashing a packfile with touches, which would be quite annoying. That's probably not appropriate for us but we can keep track of the last time we've freshened a packfile and re-touch it only every n seconds.

I'm all ears if you have a good idea for what that n should be, otherwise I'll pick something that seems reasonable and we'll hope that we don't regret it. :P

arthurschreiber · 2016-08-01T14:22:43Z

@ethomson What's the status here? Anything I could help with?

ethomson · 2016-08-01T16:13:58Z

@arthurschreiber Just needs a review and some sanity checking, especially on the 2 second number that I picked out of thin air.

pks-t · 2016-08-04T08:40:57Z

src/odb.c

+		return odb_freshen_1(db, id, true);
+
+	/* Failed to refresh, hence not found */
+	return 0;


I think the return code semantics are a tad confusing - I guess this results from odb_freshen replacing the git_odb_exists calls, where the exists-function is obviously returning a boolean value. In this case, though, I'd rather expect to see our usual error semantics with -1 as error code, as odb_freshen does not hint at a returned boolean value.

At the moment, I think that the similarity with refresh is beneficial, so I'm going to keep it returning a boolean for now, especially since it's just internal. But I think that odb.c has become a bit confusing overall, so I am going to give the whole file a closer look this weekend.

pks-t · 2016-08-04T09:07:22Z

I don't really like the two-second span between refreshes, as the value seems to be rather arbitrary. On the other side we have to choose a value here and I have no arguments for or against any other value besides the trade-off between thrashing and accuracy.

Looks good besides this and the few comments I've left.

ethomson · 2016-08-04T13:38:42Z

I don't really like the two-second span between refreshes, as the value seems to be rather arbitrary.

My initial thought was to just do one second but I really wanted to avoid dealing with sub-second precision, so to avoid refreshing between a rollover from one second to the next (but within the same one second span) I just chose two.

An argument to going much larger (10 seconds) could certainly be made, but I'm not one to make it. I wonder if @peff has any thoughts here.

Thanks for the review.

peff · 2016-08-04T18:23:10Z

I considered doing a refresh-after-n-seconds like this in git, but didn't bother since we can usually assume our processes are short-lived. I agree it's a good thing for libgit2 to do.

When picking n, I think you have to consider what you're scaling against. And that's basically two things:

you're freshening timestamps so that we don't exceed the grace period for gc.pruneExpire, which defaults to 2 weeks (though some hosting sites I could mention drop that to 1 hour, and when trying to aggressively drop objects, even to 5 minutes). So if you've freshened within that time period, there's no need (for these purposes) to freshen again. Probably anything up to about 60 seconds would be completely reasonable, even for insanely aggressive attempts to drop recent-but-not-currently-in-use objects.
the act of checking the timestamp and pruning the objects is not actually atomic. The worst case of this is probably git repack --expire-unreachable, as repacks may take several minutes, during which packs could be freshened, but we would end up deleting them anyway (basically we come up with the list of objects to pack at the start of the program, chug on repacking for a while, then delete all the packs we assme we've obsoleted). Naively I want to say that a shorter freshening frequency would help there, but I actually don't think it would. Once you're inside the pruneExpire grace time you're good as long as you don't lose that race, but once you've lost it, it doesn't matter how many times you've freshened.

So...I think you could probably go much higher than you're at. But there's no reason to do so unless you're worried that calling utime once every 2 seconds can be considered thrashing. It probably isn't, though (especially given that it's letting you skip an object write entirely).

When writing an object, we calculate its OID and see if it exists in the object database. If it does, we need to freshen the file that contains it.

Since writing multiple objects may all already exist in a single packfile, avoid freshening that packfile repeatedly in a tight loop. Instead, only freshen pack files every 2 seconds.

ethomson · 2016-08-04T20:12:04Z

Thanks, @peff - I agree that every two seconds ought not to be very expensive, even on a truly godawful filesystem.

Thanks @pks-t for the review.

ethomson force-pushed the ethomson/refresh_objects branch from b732160 to 78e7595 Compare August 1, 2016 15:58

ethomson force-pushed the ethomson/refresh_objects branch 2 times, most recently from 5e6fab0 to 3754e57 Compare August 3, 2016 02:05

pks-t reviewed Aug 4, 2016
View reviewed changes

Edward Thomson added 2 commits August 4, 2016 15:12

odb: freshen existing objects when writing

8f09a98

When writing an object, we calculate its OID and see if it exists in the object database. If it does, we need to freshen the file that contains it.

odb: only freshen pack files every 2 seconds

27051d4

Since writing multiple objects may all already exist in a single packfile, avoid freshening that packfile repeatedly in a tight loop. Instead, only freshen pack files every 2 seconds.

ethomson force-pushed the ethomson/refresh_objects branch from 3754e57 to 27051d4 Compare August 4, 2016 19:12

ethomson merged commit 73dab76 into master Aug 4, 2016

ethomson mentioned this pull request Aug 4, 2016

We should refresh objects when they're referenced #3650

Closed

ethomson deleted the ethomson/refresh_objects branch January 13, 2017 12:28

johnhaley81 mentioned this pull request Feb 6, 2017

Bump libgit to 0bf0526 nodegit/nodegit#1187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

odb: freshen existing objects when writing #3861

odb: freshen existing objects when writing #3861

ethomson commented Jul 14, 2016

carlosmn commented Jul 15, 2016

ethomson commented Jul 19, 2016

arthurschreiber commented Aug 1, 2016

ethomson commented Aug 1, 2016

pks-t Aug 4, 2016

ethomson Aug 4, 2016

pks-t commented Aug 4, 2016

ethomson commented Aug 4, 2016

peff commented Aug 4, 2016

ethomson commented Aug 4, 2016

odb: freshen existing objects when writing #3861

odb: freshen existing objects when writing #3861

Conversation

ethomson commented Jul 14, 2016

carlosmn commented Jul 15, 2016

ethomson commented Jul 19, 2016

arthurschreiber commented Aug 1, 2016

ethomson commented Aug 1, 2016

pks-t Aug 4, 2016

Choose a reason for hiding this comment

ethomson Aug 4, 2016

Choose a reason for hiding this comment

pks-t commented Aug 4, 2016

ethomson commented Aug 4, 2016

peff commented Aug 4, 2016

ethomson commented Aug 4, 2016