Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Keep data about source of similarity #1643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 12, 2013
Merged

Conversation

ethomson
Copy link
Member

This change adds data about the source of similarity matching. Consider the case where you have two files, both very similar and both renamed and modified such that they rename very similar. For example:

Initial commit:
Class1.cs: class Class1 { }
Class2.cs: class Class2 { }

And these files are both renamed:
ClassA.cs: class ClassA { }
ClassB.cs: class ClassB { }

The loop in git_diff_find_similar loops at the possible rename targets (in this case, ClassA.cs and ClassB.cs) and compute similarity from possible sources and record them in the matches array.

If our deltas vector is:

  1. Class1.cs (delete)
  2. Class2.cs (delete)
  3. ClassA.cs (add)
  4. ClassB.cs (add)

Then we will decide that ClassA.cs is 96% similar to Class1.cs and record it as match[2]. Since ClassA.cs is also 96% similar to Class2.cs, it is not better, and this match is ignored.

Similarly, we will decide that ClassB.cs is 96% similar to Class1.cs and record it as match[3]. Again, Class2.cs will be ignored.

After the loop to calculate similarity:

matches[2] = { idx = 0, similarity = 93 }
matches[3] = { idx = 0, similarity = 93 }

This gives us a rename from Class1.cs to ClassA.cs, and records Class2.cs as a delete and ClassB.cs as an add. This is nonoptimal and different from core git.

By adding data about the similarity source, we are able to avoid doubling up a single source as the best similarity match for two targets. With the proposed change, at the end of our loop:

matches[2] = { idx = 0, similarity = 93 }
matches[3] = { idx = 1, similarity = 93 }

And thus we have a rename from Class1.cs to ClassA.cs and a rename from Class2.cs to ClassB.cs.

@vmg
Copy link
Member

vmg commented Jun 10, 2013

Very neat, but I think this is going to conflict with @arrbee's big diff PR. Could you re-open this PR on top if his branch so we can bring all this together?

@arrbee
Copy link
Member

arrbee commented Jun 10, 2013

I think (though I haven't actually tested) that this is orthogonal to the work I did. I haven't touched git_diff_tform.c (apart from some #include changes) so I think we may be safe.

@arrbee
Copy link
Member

arrbee commented Jun 10, 2013

This looks good to me. I think the match_sources array can be freed as soon as the first loop is over because it is only used in that loop to make sure we are picking the truly best target. That is pretty minor, but if you move the new git__free up, it will slightly reduce the peak memory usage by the algorithm (albeit by a pretty negligible amount).

vmg pushed a commit that referenced this pull request Jun 12, 2013
Keep data about source of similarity
@vmg vmg merged commit 88c401b into libgit2:development Jun 12, 2013
phatblat pushed a commit to phatblat/libgit2 that referenced this pull request Sep 13, 2014
Keep data about source of similarity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants