Thanks to visit codestin.com
Credit goes to github.com

Skip to content

odb: freshen existing objects when writing #3861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 4, 2016
Merged

Conversation

ethomson
Copy link
Member

When writing an object, we calculate its OID and see if it exists in the object database. If it does, we need to freshen the file that contains it.

This adds a new function to backends, freshen, which will freshen the object if it exists.

Previously, we would do a simple exists check during git_odb_write, after calculating the new object ID. Now, we will use the freshen function if it is implemented on a particular backend (falling back to the exists check if it doesn't).

A slight disappointment is that we no longer look in the cache first to see if an object exists. To do that, we would need to store more information in the cache, namely the backend that the object was found in (or else we would need to go locate the object again to freshen it, and if we do that, there would be no point in looking in the cache at all.) That's quite a yak to shave, it turns out. But it's probably not really a big deal, since this is only used in the write functions, where it is presumed that the object doesn't exist (and thus wouldn't be in the cache).

(So in the general case, this is as performant as before.)

@carlosmn
Copy link
Member

xref #3650

@ethomson
Copy link
Member Author

Note that git itself will only freshen a pack file once per execution to avoid thrashing a packfile with touches, which would be quite annoying. That's probably not appropriate for us but we can keep track of the last time we've freshened a packfile and re-touch it only every n seconds.

I'm all ears if you have a good idea for what that n should be, otherwise I'll pick something that seems reasonable and we'll hope that we don't regret it. :P

@arthurschreiber
Copy link
Member

@ethomson What's the status here? Anything I could help with?

@ethomson ethomson force-pushed the ethomson/refresh_objects branch from b732160 to 78e7595 Compare August 1, 2016 15:58
@ethomson
Copy link
Member Author

ethomson commented Aug 1, 2016

@arthurschreiber Just needs a review and some sanity checking, especially on the 2 second number that I picked out of thin air.

@ethomson ethomson force-pushed the ethomson/refresh_objects branch 2 times, most recently from 5e6fab0 to 3754e57 Compare August 3, 2016 02:05
return odb_freshen_1(db, id, true);

/* Failed to refresh, hence not found */
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return code semantics are a tad confusing - I guess this results from odb_freshen replacing the git_odb_exists calls, where the exists-function is obviously returning a boolean value. In this case, though, I'd rather expect to see our usual error semantics with -1 as error code, as odb_freshen does not hint at a returned boolean value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, I think that the similarity with refresh is beneficial, so I'm going to keep it returning a boolean for now, especially since it's just internal. But I think that odb.c has become a bit confusing overall, so I am going to give the whole file a closer look this weekend.

@pks-t
Copy link
Member

pks-t commented Aug 4, 2016

I don't really like the two-second span between refreshes, as the value seems to be rather arbitrary. On the other side we have to choose a value here and I have no arguments for or against any other value besides the trade-off between thrashing and accuracy.

Looks good besides this and the few comments I've left.

@ethomson
Copy link
Member Author

ethomson commented Aug 4, 2016

I don't really like the two-second span between refreshes, as the value seems to be rather arbitrary.

My initial thought was to just do one second but I really wanted to avoid dealing with sub-second precision, so to avoid refreshing between a rollover from one second to the next (but within the same one second span) I just chose two.

An argument to going much larger (10 seconds) could certainly be made, but I'm not one to make it. I wonder if @peff has any thoughts here.

Thanks for the review.

@peff
Copy link
Member

peff commented Aug 4, 2016

I considered doing a refresh-after-n-seconds like this in git, but didn't bother since we can usually assume our processes are short-lived. I agree it's a good thing for libgit2 to do.

When picking n, I think you have to consider what you're scaling against. And that's basically two things:

  • you're freshening timestamps so that we don't exceed the grace period for gc.pruneExpire, which defaults to 2 weeks (though some hosting sites I could mention drop that to 1 hour, and when trying to aggressively drop objects, even to 5 minutes). So if you've freshened within that time period, there's no need (for these purposes) to freshen again. Probably anything up to about 60 seconds would be completely reasonable, even for insanely aggressive attempts to drop recent-but-not-currently-in-use objects.
  • the act of checking the timestamp and pruning the objects is not actually atomic. The worst case of this is probably git repack --expire-unreachable, as repacks may take several minutes, during which packs could be freshened, but we would end up deleting them anyway (basically we come up with the list of objects to pack at the start of the program, chug on repacking for a while, then delete all the packs we assme we've obsoleted). Naively I want to say that a shorter freshening frequency would help there, but I actually don't think it would. Once you're inside the pruneExpire grace time you're good as long as you don't lose that race, but once you've lost it, it doesn't matter how many times you've freshened.

So...I think you could probably go much higher than you're at. But there's no reason to do so unless you're worried that calling utime once every 2 seconds can be considered thrashing. It probably isn't, though (especially given that it's letting you skip an object write entirely).

Edward Thomson added 2 commits August 4, 2016 15:12
When writing an object, we calculate its OID and see if it exists in the
object database.  If it does, we need to freshen the file that contains
it.
Since writing multiple objects may all already exist in a single
packfile, avoid freshening that packfile repeatedly in a tight loop.
Instead, only freshen pack files every 2 seconds.
@ethomson ethomson force-pushed the ethomson/refresh_objects branch from 3754e57 to 27051d4 Compare August 4, 2016 19:12
@ethomson
Copy link
Member Author

ethomson commented Aug 4, 2016

Thanks, @peff - I agree that every two seconds ought not to be very expensive, even on a truly godawful filesystem.

Thanks @pks-t for the review.

@ethomson ethomson merged commit 73dab76 into master Aug 4, 2016
@ethomson ethomson deleted the ethomson/refresh_objects branch January 13, 2017 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants