Keep data about source of similarity #1643

ethomson · 2013-06-10T20:38:40Z

This change adds data about the source of similarity matching. Consider the case where you have two files, both very similar and both renamed and modified such that they rename very similar. For example:

Initial commit:
Class1.cs: class Class1 { }
Class2.cs: class Class2 { }

And these files are both renamed:
ClassA.cs: class ClassA { }
ClassB.cs: class ClassB { }

The loop in git_diff_find_similar loops at the possible rename targets (in this case, ClassA.cs and ClassB.cs) and compute similarity from possible sources and record them in the matches array.

If our deltas vector is:

Class1.cs (delete)
Class2.cs (delete)
ClassA.cs (add)
ClassB.cs (add)

Then we will decide that ClassA.cs is 96% similar to Class1.cs and record it as match[2]. Since ClassA.cs is also 96% similar to Class2.cs, it is not better, and this match is ignored.

Similarly, we will decide that ClassB.cs is 96% similar to Class1.cs and record it as match[3]. Again, Class2.cs will be ignored.

After the loop to calculate similarity:

matches[2] = { idx = 0, similarity = 93 }
matches[3] = { idx = 0, similarity = 93 }

This gives us a rename from Class1.cs to ClassA.cs, and records Class2.cs as a delete and ClassB.cs as an add. This is nonoptimal and different from core git.

By adding data about the similarity source, we are able to avoid doubling up a single source as the best similarity match for two targets. With the proposed change, at the end of our loop:

matches[2] = { idx = 0, similarity = 93 }
matches[3] = { idx = 1, similarity = 93 }

And thus we have a rename from Class1.cs to ClassA.cs and a rename from Class2.cs to ClassB.cs.

vmg · 2013-06-10T21:52:33Z

Very neat, but I think this is going to conflict with @arrbee's big diff PR. Could you re-open this PR on top if his branch so we can bring all this together?

arrbee · 2013-06-10T22:11:16Z

I think (though I haven't actually tested) that this is orthogonal to the work I did. I haven't touched git_diff_tform.c (apart from some #include changes) so I think we may be safe.

arrbee · 2013-06-10T22:18:12Z

This looks good to me. I think the match_sources array can be freed as soon as the first loop is over because it is only used in that loop to make sure we are picking the truly best target. That is pretty minor, but if you move the new git__free up, it will slightly reduce the peak memory usage by the algorithm (albeit by a pretty negligible amount).

Keep data about source of similarity

Edward Thomson added 2 commits June 10, 2013 15:16

failing unit test for similar renames

bda3fbb

keep source similarity in rename detection

690bf41

vmg pushed a commit that referenced this pull request Jun 12, 2013

Merge pull request #1643 from ethomson/rename_source

88c401b

Keep data about source of similarity

vmg merged commit 88c401b into libgit2:development Jun 12, 2013

phatblat pushed a commit to phatblat/libgit2 that referenced this pull request Sep 13, 2014

Merge pull request libgit2#1643 from ethomson/rename_source

b88e194

Keep data about source of similarity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep data about source of similarity #1643

Keep data about source of similarity #1643

Uh oh!

ethomson commented Jun 10, 2013

Uh oh!

vmg commented Jun 10, 2013

Uh oh!

arrbee commented Jun 10, 2013

Uh oh!

arrbee commented Jun 10, 2013

Uh oh!

Uh oh!

Keep data about source of similarity #1643

Keep data about source of similarity #1643

Uh oh!

Conversation

ethomson commented Jun 10, 2013

Uh oh!

vmg commented Jun 10, 2013

Uh oh!

arrbee commented Jun 10, 2013

Uh oh!

arrbee commented Jun 10, 2013

Uh oh!

Uh oh!