Improve revision walk preparation logic #3921

carlosmn · 2016-09-01T11:21:16Z

This brings us closer to the code that's in git, makes it more efficient and introduces the slop mechanic in order to make it less likely a complex graph will trip us up.

This is solves the failing tests presented in #3838 in a much more elegant manner than the commits I pushed to that branch and resolves #3916. The gecho-dev walk in question now runs in 1.5s instead of longer than we care to measure.

Some of these tests now set the sorting since our unsorted iteration is now much less sorted than it used to be.

Chances are we're still doing something silly performance-wise like the way we deal with parents in mark_uninteresting() but this produces correct results and solves the immediate performance issue we're facing.

pks-t · 2016-09-01T11:39:40Z

src/revwalk.c

@@ -398,81 +398,191 @@ static int revwalk_next_reverse(git_commit_list_node **object_out, git_revwalk *
 	return *object_out ? 0 : GIT_ITEROVER;
 }

-
-static int interesting(git_pqueue *list)
+static int contains(git_pqueue *list, git_commit_list_node *node)


This looks like git_vector_search, which should in fact be more efficient. So maybe just add #define git_pqueue_search git_vector_search and remove this function altogether?

I ended up removing the whole block since we're not longer using this code.

pks-t · 2016-09-01T12:13:03Z

Mostly minor nits, looks very nice otherwise 👍

ethomson · 2016-09-02T22:34:54Z

I'm seeing some results that differ from git, using the repo in #3916 :

% git rev-list 0dd403224a5acb0702bdbf7ff405067f5d29c239 ^b7083959a30f2137d8a6e27a8489f8729873950c --date-order |head -10             
0dd403224a5acb0702bdbf7ff405067f5d29c239
a2812fa126be538f73efed589e78d6973f23df2f
21ac721516934679f9d6528eba41364bbb7f6f5d
a2da90fae1c4b5fd0cd33ff1a509d8589f8ce695
7f0262e9054aac9f44ee307ea2d1b9a2f2993da3
f09e8fef1a803905ee29457e12cd68a04af256c4
44a196676794033e6dc0a66b890cdf55e7a3c999
247986c342e2cab0f95b35c8f841ac609aa0882d
d07b49ff57344a58c389d31d1c0235c469c215b0
f9242cf7754ab3f64b2f6650f40b24c7020ac61c

And in libgit2, using this simple program:

git_repository_open(&repo, "/Users/ethomson/Temp/gecko-dev");
git_revwalk_new(&revwalk, repo);
git_oid_fromstr(&head_id, "0dd403224a5acb0702bdbf7ff405067f5d29c239");
git_revwalk_push(revwalk, &head_id);
git_oid_fromstr(&base_id, "b7083959a30f2137d8a6e27a8489f8729873950c");
git_revwalk_push(revwalk, &base_id);
git_revwalk_sorting(revwalk, GIT_SORT_TIME);

while (git_revwalk_next(&id, revwalk) == 0) {
    char idstr[GIT_OID_HEXSZ];
    git_oid_fmt(idstr, &id);
    printf("%.*s\n", GIT_OID_HEXSZ, idstr);
}

We get:

% revwalk | head -10
0dd403224a5acb0702bdbf7ff405067f5d29c239
a2812fa126be538f73efed589e78d6973f23df2f
21ac721516934679f9d6528eba41364bbb7f6f5d
a2da90fae1c4b5fd0cd33ff1a509d8589f8ce695
7f0262e9054aac9f44ee307ea2d1b9a2f2993da3
f09e8fef1a803905ee29457e12cd68a04af256c4
44a196676794033e6dc0a66b890cdf55e7a3c999
247986c342e2cab0f95b35c8f841ac609aa0882d
e1f9b3132b193b95d8d70acd0bdc2edc0ac33046
529df92f7d1e13e0ca613af7548509c23d919644

carlosmn · 2016-09-22T11:12:27Z

I was testing this pair of commits with rev-list --topo-order which does produce the same result as GIT_SORT_TIME | GIT_SORT_TOPOLOGICAL.

Unfortunately it does seem that these combination isn't quite the right one, as TIME | TOPO is what --date-order describes, rather than --topo-order which is what we're getting.

carlosmn · 2016-09-22T11:51:03Z

As an aside the snipped as-given does not have an equivalent git incantation since --date-order does also imply a topological sort, so the options we get with git would be the equivalent to TOPO or TIME | TOPO.

carlosmn · 2016-09-25T10:41:40Z

I have ported more git code and now we do agree on --topo-order with GIT_SORT_TOPOGRAPHICAL and --date-order with GIT_SORT_TIME | GIT_SORT_TOPOGRAPHICAL.

With the exception of a single commit, which git shows but we don't. It's probably some edge condition I'm not taking into account, but we're almost there.

carlosmn · 2016-09-27T14:38:14Z

This should be good to go. We're not quite as fast as git, but fairly close. We're not as careful with memory allocations which is likely part of the reason.

But with this port of the code, we produce the same outputs for the --date-order and --topo-order equivalents.

arthurschreiber · 2016-10-04T10:56:43Z

tests/revwalk/basic.c

+	cl_git_pass(git_oid_fromstr(&old_id, "8e73b769e97678d684b809b163bebdae2911720f"));
+	cl_git_pass(git_revwalk_hide(_walk, &old_id));
+
+   cl_git_pass(git_revwalk_next(&oid, _walk));


It looks like you used spaces here instead of tabs.

arthurschreiber · 2016-10-04T10:56:53Z

tests/revwalk/basic.c

+	cl_git_pass(git_oid_fromstr(&old_id, "b91e763008b10db366442469339f90a2b8400d0a"));
+	cl_git_pass(git_revwalk_hide(_walk, &old_id));
+
+   cl_git_pass(git_revwalk_next(&oid, _walk));


Also spaces instead of tabs here.

carlosmn · 2016-10-04T17:37:28Z

I've discovered that just passing in REVERSE will in fact not reverse things, but not giving it anything will... so I guess we'll have to fix that before merging.

arthurschreiber · 2016-10-05T11:36:08Z

src/revwalk.c

-				parent->in_degree++;
-			}
+	for (list = commits; list; list = list->next) {
+		printf("%s: commit %s\n", __func__, git_oid_tostr_s(&list->item->oid));


There is still some debugging code left here. 😄

Introduce some tests that show some commits, while hiding some commits that have a timestamp older than the common ancestors of these two commits.

We had some home-grown logic to figure out which objects to show during the revision walk, but it was rather inefficient, looking over the same list multiple times to figure out when we had run out of interesting commits. We now use the lists in a smarter way. We also introduce the slop mechanism to determine when to stpo looking. When we run out of interesting objects, we continue preparing the walk for another 5 rounds in order to make it less likely that we miss objects in situations with complex graphs.

This is a convenience function to reverse the contents of a vector and a pqueue in-place. The pqueue function is useful in the case where we're treating it as a LIFO queue.

In this case, we simply behave like a vector.

After porting over the commit hiding and selection we were still left with mistmaching output due to the topologial sort. This ports the topological sorting code to make us match with our equivalent of `--date-order` and `--topo-order` against the output from `rev-list`.

This returns the integer-cast truth value comparing the dates. What we want instead of a (-1, 0, 1) output depending on how they compare.

Change the condition for returning 0 more in line with that we write elsewhere in the library.

We've now moved to code that's closer to git and produces the output during the preparation phase, so we no longer process the commits as part of generating the output. This makes a chunk of code redundant, as we're simply short-circuiting it by detecting we've processed the commits alrady.

…t sorting After `limit_list()` we already have the list in time-sorted order, which is what we want in the "default" case. Enqueueing into the "unsorted" list would just reverse it, and the topological sort will do its own sorting if it needs to.

It changed from implementation-defined to git's default sorting, as there are systems (e.g. rebase) which depend on this order. Also specify more explicitly how you can get git's "date-order".

`git-rebase--merge` does not ask for time sorting, but uses the default. We now produce the same default time-ordered output as git, so make us of that since it's not always the same output as our time sorting.

…ueued When we read from the list which `limit_list()` gives us, we need to check that the commit is still interesting, as it might have become uninteresting after it was added to the list.

ethomson · 2016-10-06T18:04:00Z

Hmm. With the test program above (showing and hiding the same commits) I'm still seeing differences.

GIT_SORT_TIME | GIT_SORT_TOPOLOGICAL (versus --topo-order) gives me the first several commits as being the same, but using only GIT_SORT_TIME (versus --date-order) gives me several differences even in the first few commits.

Worse, using either GIT_SORT_TIME or GIT_SORT_TIME|GIT_SORT_TOPOLOGICAL, we walk 489,028 commits while git walks 8,767.

carlosmn · 2016-10-07T11:16:51Z

The equivalent options are:

--topo-order is GIT_SORT_TOPOLOGICAL
--date-order is GIT_SORT_TOPOLOGICAL | GIT_SORT_TIME
GIT_SORT_TIME is something libgit2 has without any equivalency to anything anywhere at any time

I think you might have forgotten to change that _push int your sample program to a _hide since

% git rev-list --count 0dd403224a5acb0702bdbf7ff405067f5d29c239 b7083959a30f2137d8a6e27a8489f8729873950c
489028

so it definitely looks like you're just listing everything starting from those commits.

ethomson · 2016-10-07T14:42:56Z

I think you might have forgotten to change that _push int your sample program to a _hide ...

Oh, yes, duh, that's exactly my problem. I suspected I was doing something dumb but didn't expect it was that dumb. Alas.

pks-t reviewed Sep 1, 2016
View reviewed changes

carlosmn force-pushed the cmn/walk-limit-enough branch from 0e613a7 to b3e1dd1 Compare September 25, 2016 10:41

carlosmn force-pushed the cmn/walk-limit-enough branch 2 times, most recently from cdd3a3b to 76f4250 Compare September 27, 2016 14:21

arthurschreiber reviewed Oct 4, 2016

View reviewed changes

arthurschreiber reviewed Oct 5, 2016

View reviewed changes

carlosmn force-pushed the cmn/walk-limit-enough branch from a16df01 to ee82845 Compare October 5, 2016 13:36

Edward Thomson and others added 10 commits October 6, 2016 11:04

revwalk: introduce tests that hide old commits

565fb8d

Introduce some tests that show some commits, while hiding some commits that have a timestamp older than the common ancestors of these two commits.

vector, pqueue: add git_vector_reverse and git_pqueue_reverse

0bd4337

This is a convenience function to reverse the contents of a vector and a pqueue in-place. The pqueue function is useful in the case where we're treating it as a LIFO queue.

pqueue: support not having a comparison function

938f8e3

In this case, we simply behave like a vector.

commit_list: fix the date comparison function

5e2a29a

This returns the integer-cast truth value comparing the dates. What we want instead of a (-1, 0, 1) output depending on how they compare.

revwalk: style change

e93b7e3

Change the condition for returning 0 more in line with that we write elsewhere in the library.

Add revwalk note to CHANGELOG

4aed1b9

carlosmn added 2 commits October 6, 2016 11:04

revwalk: update the description for the default sorting

82d4c0e

It changed from implementation-defined to git's default sorting, as there are systems (e.g. rebase) which depend on this order. Also specify more explicitly how you can get git's "date-order".

rebase: don't ask for time sorting

3cc5ec9

`git-rebase--merge` does not ask for time sorting, but uses the default. We now produce the same default time-ordered output as git, so make us of that since it's not always the same output as our time sorting.

carlosmn force-pushed the cmn/walk-limit-enough branch from ee82845 to 3cc5ec9 Compare October 6, 2016 09:05

revwalk: don't show commits that become uninteresting after being enq…

fedc05c

…ueued When we read from the list which `limit_list()` gives us, we need to check that the commit is still interesting, as it might have become uninteresting after it was added to the list.

ethomson merged commit 45dc219 into master Oct 7, 2016

carlosmn mentioned this pull request Oct 24, 2016

revwalk: introduce tests that hide old commits #3838

Closed

carlosmn deleted the cmn/walk-limit-enough branch November 15, 2016 11:31

srajko mentioned this pull request Feb 6, 2017

Bump libgit to 0bf0526 nodegit/nodegit#1187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve revision walk preparation logic #3921

Improve revision walk preparation logic #3921

carlosmn commented Sep 1, 2016

pks-t Sep 1, 2016 •

edited

Loading

carlosmn Sep 27, 2016

pks-t commented Sep 1, 2016

ethomson commented Sep 2, 2016

carlosmn commented Sep 22, 2016

carlosmn commented Sep 22, 2016

carlosmn commented Sep 25, 2016

carlosmn commented Sep 27, 2016

arthurschreiber Oct 4, 2016

arthurschreiber Oct 4, 2016

carlosmn commented Oct 4, 2016

arthurschreiber Oct 5, 2016

ethomson commented Oct 6, 2016

carlosmn commented Oct 7, 2016

ethomson commented Oct 7, 2016

Improve revision walk preparation logic #3921

Improve revision walk preparation logic #3921

Conversation

carlosmn commented Sep 1, 2016

pks-t Sep 1, 2016 • edited Loading

Choose a reason for hiding this comment

carlosmn Sep 27, 2016

Choose a reason for hiding this comment

pks-t commented Sep 1, 2016

ethomson commented Sep 2, 2016

carlosmn commented Sep 22, 2016

carlosmn commented Sep 22, 2016

carlosmn commented Sep 25, 2016

carlosmn commented Sep 27, 2016

arthurschreiber Oct 4, 2016

Choose a reason for hiding this comment

arthurschreiber Oct 4, 2016

Choose a reason for hiding this comment

carlosmn commented Oct 4, 2016

arthurschreiber Oct 5, 2016

Choose a reason for hiding this comment

ethomson commented Oct 6, 2016

carlosmn commented Oct 7, 2016

ethomson commented Oct 7, 2016

pks-t Sep 1, 2016 •

edited

Loading