Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Correctly return matched pathspec when passing "*" or "." #1367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 11, 2013
Merged

Correctly return matched pathspec when passing "*" or "." #1367

merged 1 commit into from
Apr 11, 2013

Conversation

yorah
Copy link
Contributor

@yorah yorah commented Feb 26, 2013

This issue was detected during implementation of libgit2/libgit2sharp/pull/343

Context
git_pathspec_init is the function used to initialize a git_vector of pathspecs from a list of pathspecs passed from the client. However, if the list of pathspecs passed by the client is not deemed interesting (through the use of git_pathspec_is_interesting), then an empty vector is initialized.

What is considered an uninteresting list of pathspec?

  • a NULL or empty list
  • if the client passed only one pathspec, and if this pathspec is (!str || !str[0] || (!str[1] && (str[0] == '*' || str[0] == '.')))

Later on, when the vector of pathspecs is used in git_pathspec_match_path, one of the early exit branch is

if (!vspec || !vspec->length)
    return true;

This means that we consider the pathspec a match, but we don't return which pathspec matched (it could be either '*' or '.') in the matched_pathspec out parameter.

What does it mean for libgit2sharp (and other clients?)
When passing just '*' or '.', the diff process does not correctly notify which pathspec matched the diffed file (notifying was introduced in #1249).

Ways to fix it
If it can help, I'm willing to tackle this issue, but I would like to check first with you if this is a valid issue (we could also say it's up to the clients to handle those cases, even if it's still a bit painful IMO), and if yes, what is the preferred way to solve it.

Here are the ways I can think of:

  • 1st proposal (Included in the PR at the moment): just before calling the notify callback, get the matched pathspec if it exists, or fallback to the one that was passed by the user if it doesn't ("*" and "." cases).
  • 2nd proposal: remove the git_pathspec_is_interesting shortcut. It means removing an existing optimization, but I don't think this will have an impact on performance (no numbers to back up this claim, tell me if you want/need some)

@arrbee
Copy link
Member

arrbee commented Feb 26, 2013

Hmm. I really don't want to call fnmatch with "" against every single file, which is why I wrote the is_interesting check. But now that all the logic is isolated in the pathspec files, I'm open to moving the is_interesting check into git_attr_fnmatch__parse (i.e. adding a flag to the git_attr_fnmatch object that indicates that it is a "MATCH_ALL" pattern) and then adding a different shortcut at the top of the git_vector_foreach loop inside git_pathspec_match_path. It means that using a "" or "." pathspec will slow things down a bit (extra function calls for every file in the diff), but so long as we can still short circuit the match, I'm open to it. Actually, I like the idea because detecting trivial patterns (and also possible patterns with no wildcards) will eventually allow for a number of optimizations to attributes and ignores as well.

To do what I'm describing, we would leave in the is_interesting check, but the only uninteresting pathspecs would be cases where the pathspec is NULL or consists only of NULL or empty strings. All other cases would be considered interesting, but when we parse "*" and "." we would mark them as uninteresting to match and would immediately consider them a match when testing.

One concern I have with the behavior you are creating in libgit2sharp is that using a pathspec like [ "*.c", "*" ] will not raise an exception (assuming you have files ending in .c) but [ "*", "*.c" ] will raise one because the "*.c" pattern will never end up matching anything. Does that make sense?

@yorah
Copy link
Contributor Author

yorah commented Feb 26, 2013

Thanks for your proposal, really elegant as always! I will have a go at implementing it in the next few days.

One concern I have with the behavior you are creating in libgit2sharp is that using a pathspec like [ ".c", "" ] will not raise an exception (assuming you have files ending in .c) but [ "", ".c" ] will raise one because the "*.c" pattern will never end up matching anything. Does that make sense?

Yes, it does! Actually, this is a behaviour I already identified, and which is covered by a (currently) failing test. My initial naive proposal would have been to modify the notify_cb signature so that we could send a list of matching pathspecs instead of just the first one.

However, I now realize that it might have an undesirable side-effect on performance. Would it be OK to add this behaviour, but deactivated by default, and have a flag to activate it (something like GIT_DIFF_NOTIFY_ALL_MATCHED_PATHSPECS)?

@arrbee
Copy link
Member

arrbee commented Feb 26, 2013

Not to put to fine a point on it, but in my mind, the need to do something like that points to this being a misfeature. I don't have much knowledge of the C# API design esthetic, so I've tried to stay out of this, but the idea that you would run a potentially expensive fnmatch call over every single pathspec entry for every single file just to raise an exception because one might not be used feels to me like taking this too far.

I'd love to go back to the rationale that spawned this feature. I think the column "Ignore Unmatched Pathspec" is intertwining the cases of pathspecs with wildcards and pathspecs without wildcards. If you provide a list of 10 filenames to be staged and one doesn't match, then it seems reasonable to me that that could be an error, but as soon as you start injecting wildcard matches into the list, I think you are getting onto fairly shaky ground.

Interestingly, you can fix some cases of this problem inside libgit2sharp without adding the "notify all matched" behavior by sorting the pathspec from most specific to least specific (i.e. must items with no wildcards first, items with a mix next, and items that are all wildcards at the end). There are two problems with that:

  1. There are still cases that will have problems (e.g. you specified "a_" and "_.c" and there is a "a_file.c") - although again this makes me feel like this is a misfeature when wildcard patterns are involved
  2. Sorting the list in that order will actually pessimize the behavior in many cases, requiring more checks per file instead of fewer.

Would you consider going back to the original requirements for raising an exception and distinguishing between the wildcards vs no wildcards cases? Or maybe three cases: no wildcards, one wildcard (i.e. cannot have a conflicting match so simply putting the wildcard item at the end as a catchall will guarantee consistency), and multiple wildcards, where the third case would not enforce that all pathspecs must have a match.

/cc @jamill @nulltoken

@jamill
Copy link
Member

jamill commented Feb 27, 2013

I'd love to go back to the rationale that spawned this feature. I think the column "Ignore Unmatched Pathspec" is intertwining the cases of pathspecs with wildcards and pathspecs without wildcards. If you provide a list of 10 filenames to be staged and one doesn't match, then it seems reasonable to me that that could be an error, but as soon as you start injecting wildcard matches into the list, I think you are getting onto fairly shaky ground.

@arrbee - I agree with this paragraph. I think it is reasonable to treat wildcards differently than explicitly named files, especially if it is expensive. IIRC, The original case was that the consumer was attempting to stage / unstage an file, the call succeeded, but the file was not staged. At least in that case, we could detect that the caller wanted to act on a specific file (but there was no matching file).

@arrbee
Copy link
Member

arrbee commented Feb 27, 2013

Hey @yorah - I just wanted to say that I certainly didn't mean my comment as any critique of your work. All the code you've been writing, etc., has been of fine quality! I just wanted to steer to conversation back to why we were heading in this directory. I hope it didn't come across too negatively!

@yorah
Copy link
Contributor Author

yorah commented Feb 27, 2013

I hope it didn't come across too negatively!

@arrbee don't worry, this is actually the opposite! I didn't answer yet because as usual, your comments made me think about new aspects of the situation that I wasn't seeing before. I have the feeling like you are always 2 10 moves ahead, so I'm taking my time to try to say something that is not completely stupid 😉

Again, thanks a lot for your comments (on this PR and others), and for your time and patience!

@yorah
Copy link
Contributor Author

yorah commented Feb 28, 2013

Thanks @jamill @arrbee for your comments. As you both said, the original requirement was to be able to distinguish specific named files vs wildchar pathspecs. If I understood correctly, this is exactly what you said:

Would you consider going back to the original requirements for raising an exception and distinguishing between the wildcards vs no wildcards cases?

The end result is that it only makes sense raising an exception (and notifying of matched/unmatched pathspecs) when the client passes non-pathspecs to the Stage()/Unstage() methods. That said, it also seems reasonable to think that the client of the libgit2sharp API usually knows if he's sending a wildchar pathspec, or an explicitly named file.

Thus, my proposal is to say that when the user passes a OnUnmatchedPathspecs callback, or when he sets the ShouldFailOnUnmatchedPathspec property to true, the passed array of paths will be considered as explicitly named paths (by passing the GIT_DIFF_DISABLE_PATHSPEC_MATCH flag to the libgit2 diff function).

That way, notification and exception throwing will only be used with explicit file names.

Examples

If the index contains the following files: readme.txt, readmectxt

passed pathspecs should fail unmatched callback set matched files matched pathspecs throw exception? notified pathspecs
"readme*txt", "readme.txt" yes yes "readme.txt" "readme.txt" yes "readme*txt"
"readme*txt", "readme.txt" yes no "readme.txt" "readme.txt" yes
"readme*txt", "readme.txt" no yes "readme.txt" "readme.txt" no "readme*txt"
"readme*txt", "readme.txt" no no "readme.txt", "readmectxt "readme*txt" no

So, what about this issue?

Well, if you (@jamill @nulltoken @arrbee) think the behavior described above makes sense, then we obviously don't need anything new in libgit2.

However, for the sake of having a "more consistent" behavior, maybe it would be a good idea to implement your proposal (adding a flag to the git_attr_fnmatch and so on, so that we notify correctly of "*" and "." matched pathspecs). Unless we consider that performance is of uttermost importance in this code path, and we don't want to clutter it with a corner-case.

Please tell me if you want me to implement it or if you prefer to leave the existing code as is!

@nulltoken
Copy link
Member

@yorah Thanks a lot for all this work. This sounds reasonable to me. I agree that explicit filepaths are much more important than pathspecs.

@jamill @haacked @dahlbyk From a consumer perspective, how do you feel about this?

@dahlbyk
Copy link
Member

dahlbyk commented Mar 1, 2013

Late to this thread, but it strikes me as off to have pathspec behavior change completely based on superficially unrelated parameters.

I have no idea if this is feasible, but could the failure and/or unmatched callback handling first check if the offending pathspec matches any of the files already found? If so, consider it redundant and move along. From a trivial test, that seems to be how git add readme*txt readme.txt works.

@jamill
Copy link
Member

jamill commented Mar 1, 2013

From a trivial test, that seems to be how git add readme*txt readme.txt works.

To add another datapoint, git add * doesn't report any failures if there are no matches. Is @dahlbyk's suggestion feasible and not overly complex / expensive? If so, that would be great. Otherwise, I think keeping the feature less complex (and less expensive) while covering explicit filepaths also has its advantages.

@ethomson
Copy link
Member

ethomson commented Mar 1, 2013

@jamill that's interesting - I think that's msysgit specific beheavior. On Unix - where your shell expands wildcards - your shell would give you an error if * didn't match anything:

zsh: no matches found: *

If it didn't, git add would complain because * would be empty:

% git add
Nothing specified, nothing added.
Maybe you wanted to say 'git add .'?

So, what happens if I try to stage a file with a * in the filename?

@ethomson
Copy link
Member

ethomson commented Mar 1, 2013

% touch '*'
% git add \*
% git commit -m"asterisk\!"
% git ls-files --stage
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0      *

Whee!

@carlosmn
Copy link
Member

carlosmn commented Mar 1, 2013

git add * wouldn't complain if your shell is configured to pass the glob to the program if it can't expand it. As Ed showed, zsh complains, but bash will simply pass it on, so you're in effect typing git add '*' which passes the glob to git directly. Then the glob is expanded by git to its list of files. Thus, that can't fail because git will only consider files it already knows about.

@arrbee
Copy link
Member

arrbee commented Mar 1, 2013

Late to this thread, but it strikes me as off to have pathspec behavior change completely based on superficially unrelated parameters.

Maybe this is the point you are trying to make, but this sounds like a libgit2sharp design issue. Right now, the core library will tell you for each file in a diff the first pathspec that it found in the list that matched that file. The behavior is stable.

I have no idea if this is feasible, but could the failure and/or unmatched callback handling first check if the offending pathspec matches any of the files already found? If so, consider it redundant and move along. From a trivial test, that seems to be how git add readme*txt readme.txt works.

I'm open to exposing the pathspec checking facility as a utility API in libgit2 if it will help create the behavior you want in libgit2sharp.

If you want to iterate over the unmatched pathspec collection and for each item iterate over every file in the diff, you can do that. However, I still don't think that will give you the behavior you want...

Now that I think about it, this points out a fundamental flaw in the implementation so far. If you give a pathspec ["foo.c", "bar.c" ] and there are no diffs in "bar.c" but the file does exist, then the current implementation will generally not make a callback for "bar.c". I would think that git diff file_without_changes.c should not be an error, but that's what we would get at the moment.

Oh, maybe you are always using GIT_DIFF_INCLUDE_UNMODIFIED then I guess it works, but for many repos that's going to result in a much larger diff list object with a lot of stuff you don't care about. Does libgit2sharp always use that flag?

@arrbee
Copy link
Member

arrbee commented Mar 1, 2013

At a higher level, I don't believe libgit2 should be spending time replicating shell functionality to the extent that we can avoid it. For one thing, different shells behave differently, and for another, it muddies the separation of concerns.

For example, the diff APIs don't take strings to specify the trees to diff (with an implicit rev-parse), they take the tree objects. If you want to parse specs and pass them in, we have an explicit function for that. Just so, I don't think diff should try to recreate various shell error conditions for pathspec expansion.

I started to write more about ideas that I have, but I think there is a lot for you folks to discuss still about the behavior you want to achieve. For example, if the caller tries to diff with "*.baz" and yet "file.baz" is either untracked or ignored, is that an error? A shell match will find those files (and hence no error even if you are in a shell that does error for no matches - which mine does not) but there will be no entry in the diff list (again, unless you are always using INCLUDE_UNTRACKED, INCLUDE_UNMODIFIED, and INCLUDE_IGNORED, which feels like a pretty expensive choice to make on your user's behalf to me).

@yorah
Copy link
Contributor Author

yorah commented Mar 1, 2013

Does libgit2sharp always use that flag?

The implementation proposed in libgit2/libgit2sharp#343 is two-fold:

  • we always include untracked and ignored files (here)
  • we only include unmodified files if the user wants to be notified on unmatched pathspecs, or if he wants an exception to be thrown for an unmatched one (here)

It means that when we don't notify/throw, we "only" have the overhead of untracked/ignored files.

Not sure yet about @dahlbyk proposal. I thought of a case where it was not working earlier, but I didn't write it down, and can't find it now... Will try to find it back.

@dahlbyk maybe the ShouldFail property should be renamed to something more meaningful and more coherent with what it does (like ExplicitFilePath, event though I still don't like this name).

@arrbee
Copy link
Member

arrbee commented Mar 1, 2013

I just had an interesting idea that might solve this problem...

When I was working on the file similarity metric, I ended up writing a pluggable similarity metric API so callers could experiment with alternative ways of comparing files (because it's an interesting problem). What do you think about implementing a pluggable file filtering algorithm for diff (and eventually for status, etc). It would look something like:

typedef struct {
    /* process options and create a filter object */
    int (*filter_setup)(void **filter, const void *options_struct);
    /* release the filter object */
    int (*filter_free)(void *filter);
    /* given a filter object, suggest start and end paths for iteration */
    int (*filter_suggest_bounds)(char **iterstart, char **iterend, void *filter, void *payload);
    /* given a filter and file info, return 0 if matches, > 1 if no match, < 0 error/stop */
    int (*filter_match)(void *filter, const git_index_entry *file, void *payload);
    void *payload;
} git_diff_filter;

typedef struct {
    ...
    git_strarray pathspec;
    git_diff_filter *filter; /* if NULL, internal fnmatch filter will be used */
} git_diff_options;

Now, if you pass the pluggable filter as NULL, we will just use the internal implementation which will look at the flags and the pathspec from the opts and do much what we have today. But if you want to write a pluggable filter for libgit2sharp that exhaustively checks the pathspec, have at it! You can probably use Windows-native PathMatchSpec or something like that.

What I like about this is that you can take it further, if you want, and implement filtering just for small files or just for files with executable bits set or whatever rule you want to narrow your diff to a particular set of files.

This would not supplant the notifications that already exist because those are a post-match operation that lets you incrementally monitor what files are going into a diff list.

By the way, the reason that I wrote the API to take the options_struct as a const void * was so that we could pass either the git_diff_options or the git_status_options or what have you. That's probably a poor choice, but the filter needs to know the diff and/or status flags to do the right thing. The correct thing to do it probably isolate filter flags that allow "DISABLE_PATHSPEC_MATCH" and "PATHSPEC_ICASE" and use a narrower API there. But you get the idea...

@yorah
Copy link
Contributor Author

yorah commented Mar 4, 2013

@arrbee Sorry for not answering before, I actually got caught up playing with the pluggable file filtering idea that you had (and looking at the pluggable similarity API that you did)!

There are still some things in your proposals that I don't understand clearly (iterstart and iterend, and how filter_suggest_bounds is expected to be used => I guess it has to do with how iterators are built, where we currently pass the common prefix, but I didn't look at it in details yet).

You can probably use Windows-native PathMatchSpec or something like that.

I don't like this part, as libgit2sharp is supposed to be cross-platform. There is no out-of-the-box fnmatch equivalent on .NET. If you're still OK with that, exposing the pathspec facility checking of libgit2 (fnmatch.p_fnmatch()?) would likely help.

Anyway, I will get a push ready with what I have in the next few days.

@arrbee
Copy link
Member

arrbee commented Mar 4, 2013

Sorry, that probably could have used some more explanation...

The internal iterators take a start of range and end of range string prefix so that they don't have to iterate over the entire hierarchy when you are using a narrow pathspec. For example, if you give a pathspec of a*, the current implementation is actually quite efficient about not scanning much data that doesn't have an 'a' prefix. Even more, giving a path spec like path/to/file/* will scan quite narrowly. To do this, I rely on preprocessing the pathspec data to get a suggested "start" prefix string and "end" prefix string. This could stay internal to the library, assuming that pathspecs will remain described via an fnmatch strarray, but in case we wanted to really plug in a completely different way of specifying a pathspec, I thought it could be in the pluggable API. Maybe no need.

Regarding exposing fnmatch, I suppose we could do so. It is a slippery slope, I guess, between exposing fnmatch to exposing the current pathspec internal API (where there is a pre-match spec parsing phase separate from the actual match operation). At some point, you're just "pluggin in" for the purpose of creating an Observer wrapper to the process, at which point we may as well expose the default plugin implementation and support that behavior directly without necessarily exposing the component APIs.

Did that last paragraph make sense? I may need more caffeine. I'm worried that I didn't state things clearly...

If you like this direction, let me know. I'd be happy to take a stab at encapsulating the current behavior in such a pluggable API, if you like, and then you could take it and see if it extended naturally to cover the problem you want to solve (or you could just write the whole thing, but I don't what you to feel like to have to do that all by yourself if you don't want to / don't have time).

@yorah
Copy link
Contributor Author

yorah commented Mar 5, 2013

Here is a first spike of the plugin file filtering thingie. Basically, the plugin infrastructure is there, and the existing behaviour has been encapsulated into it.

All existing tests seem to pass. To be honest, this is mainly due to sheer luck.

This is far from being finished:

  • it misses a lot of tests (for the moment, I focused on adding the plugin code while keeping the existing code paths in working condition)
  • I think it must leak from a lot of places (hopefully Travis/Valgrind will help me on that one ;)
  • I have a problem with my naming (in some places, I end up doing filter->filter->..., which really doesn't feel right)
  • I didn't investigate yet how to make it work nicely with status
  • I still keep moving things around

Well, to sum it up, this is really just a spike, to keep you updated about my progress (and so you can tell me if I'm going in completely the wrong direction ;). If you don't have time to take a peek, this is also allright, I will push an update in 1-2 days.

@yorah
Copy link
Contributor Author

yorah commented Mar 5, 2013

Mmm, I also added 2 failing tests related to passing "." as pathspecs. I'm not sure yet what to do with them, and if we should do anything at all.

@yorah
Copy link
Contributor Author

yorah commented Mar 5, 2013

Regarding exposing fnmatch, I suppose we could do so. It is a slippery slope

Agreed. I will spark some more discussion on the libgit2sharp issues to see what can be considered a "stable" behaviour, and if we can find a way so that we don't need it.

Let's forget about that for now..

@yorah
Copy link
Contributor Author

yorah commented Mar 8, 2013

I think I'm ready for a first review, whenever someone has time!

Since last comments: no more leaks, works with status/checkout (no tests on that yet), a few tests showing how to implement a custom filter.
The default filter (relying on fnmatch) is also implemented as a plugin, so all the existing tests relying on diff actually leverage what I did.

@yorah
Copy link
Contributor Author

yorah commented Apr 9, 2013

Ok, it's time for some update on this PR.

Since the merge of libgit2/libgit2sharp#343, we finally won't be needing to report all matched pathspecs (because we only care about reporting unmatched explicit paths). It means that for libgit2sharp at least, we don't need the plugable filter mechanism.

What does it mean for this PR:

  • I believe the first commit (94c7afb) fixes an inconsistent behavior, and thus is still interesting
  • For the second commit (8cc723c), it is more about a strange behavior that I discovered and for which I wrote a failing test. I don't know how to fix it though. I can open up a separate issue to keep track of it if you want.
  • For the last commit (7bb20ce), which is the one about the plugable filter, I can either remove it (if you think nobody will use it), or rebase it on top of the latest changes.

@vmg
Copy link
Member

vmg commented Apr 10, 2013

@yorah: I'm still undecided on the file filtering API. @arrbee: do you think this will see any use at all?

Regardless, can you rebase this PR? It's not merging cleanly anymore.

@vmg
Copy link
Member

vmg commented Apr 10, 2013

That was fast, but there are a couple warnings and a phat segfault. :)

@yorah
Copy link
Contributor Author

yorah commented Apr 10, 2013

That was fast, but there are a couple warnings and a phat segfault. :)

Let's see if it's better now :)

Edit: Ack, still some warnings...

@yorah
Copy link
Contributor Author

yorah commented Apr 10, 2013

Much better now.

@vmg
Copy link
Member

vmg commented Apr 10, 2013

Looking great. @arrbee: pluggable filter, yay or nay?

@arrbee
Copy link
Member

arrbee commented Apr 10, 2013

I agree with @vmg that this is looking great, @yorah!

I lean towards dropping a29aa9f (i.e. the filtering plugin) if we don't know of a use for it. It seems like added complexity with unclear immediate benefit.

Regarding the failing tests, I'm wondering if the pathspec matcher should special case "." and know to skip over "./" before initiating a match. However, let's track that as a separate issue and get this PR merged!

@vmg
Copy link
Member

vmg commented Apr 10, 2013

Yeah, I'm having a hard time picturing practical usage for the filtering API. Let's drop that last commit then and I'll merge the PR.

@yorah
Copy link
Contributor Author

yorah commented Apr 10, 2013

Cool, I will drop the two last commits tomorrow then, and open up a separate issue for the failing test!

I also moved all tests related to notifying in their own file.
@yorah
Copy link
Contributor Author

yorah commented Apr 11, 2013

Removed last 2 commits, and rebased on top of vNext 😃

vmg pushed a commit that referenced this pull request Apr 11, 2013
Correctly return matched pathspec when passing "*" or "."
@vmg vmg merged commit acd4077 into libgit2:development Apr 11, 2013
@vmg
Copy link
Member

vmg commented Apr 11, 2013

Thank you! Looking great!

@yorah yorah deleted the fix/pathspecs_behaviour branch April 11, 2013 12:48
phatblat pushed a commit to phatblat/libgit2 that referenced this pull request Sep 13, 2014
Correctly return matched pathspec when passing "*" or "."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants