Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Clarify behavior of index/workdir operations on a case insensitive platforms #1689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

nulltoken
Copy link
Member

This LibGit2Sharp test doesn't pass any longer when being run against f2c4188.

The failing test has been stripped down to put this issue under the bright light of Clar.

@nulltoken
Copy link
Member Author

/cc @arrbee

@nulltoken
Copy link
Member Author

Of course, in order to see the failure, this has to be run on a Windows box ;-)

@arrbee
Copy link
Member

arrbee commented Jun 29, 2013

I haven't had a chance to run this yet, by I think that:

+  cl_git_pass(git_status_file(&status, repo, "NEW_FILE"));
+  cl_assert_equal_i(GIT_STATUS_INDEX_NEW, status);

at the end of the new test should actually fail. Based on some other failing tests that @ethomson wrote in 1540b19 we intentionally changed the behavior of tree-to-index comparisons so that they would always be case sensitive. If the tree contains "new_file" and the index contains "NEW_FILE" then those should be treated the same even on a case insensitive platform - only the filesystem is case-insensitive, not the actual data in the index.

However, you point out an interesting dilemma. Although tree-to-index comparisons should be done case sensitively, pathspecs should probably be applied case insensitively for filtering purposes on platforms where the filesystem itself is case insensitive. This suggests that the fix for @ethomson's use case that I wrote in eefef64 is probably the wrong approach. I think on a case-insensitive filesystem, the traversal of the tree and index still need to be done with case-insensitive ordering (or actually semi-case-insensitive - "FILE" should still sort before "file"), but using case-sensitive filename comparison to decide what the true deltas are.

Ugh, That is a mess because the case-insensitive tree and index iterators currently collapse a sequence of case-mismatched entries into a single albeit conflicted entry. That behavior is actually important for checkout where these multiple files are going to get mapped onto one another.

Hmm. contemplation commences

For the time being, I think applying pathspecs case-sensitively seems like the least bad thing to break of the many pieces that can be broken here. I'll need to think more about how to get all scenarios unbroken.

@nulltoken
Copy link
Member Author

@arrbee Thanks for your answer. That indeed makes sense.

Playing a bit further with this, it looks like git.git isn't that case insensitive on Windows.

git-status

$ mkdir case_test && cd case_test

$ git init .
Initialized empty Git repository in C:/Users/ntk/AppData/Local/Temp/case_test/.git/

$ touch lowercase.txt

$ git status
# On branch master
#
# Initial commit
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       lowercase.txt
nothing added to commit but untracked files present (use "git add" to track)

$ git status lowercase.txt
# On branch master
#
# Initial commit
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       lowercase.txt
nothing added to commit but untracked files present (use "git add" to track)

$ git status LOWERCASE.TXT
# On branch master
#
# Initial commit
#
nothing to commit (create/copy files and use "git add" to track)

git-add

$ git add LOWERCASE.txt
fatal: pathspec 'LOWERCASE.txt' did not match any files

$ git add L*
fatal: pathspec 'L*' did not match any files

$ git add l*

$ git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
#   (use "git rm --cached <file>..." to unstage)
#
#       new file:   lowercase.txt
#

@nulltoken
Copy link
Member Author

Closing this, as there's no need to keep the issue open any longer.

1ec680b may be worth cherry-picking, though.

@nulltoken nulltoken closed this Jul 2, 2013
@yorah
Copy link
Contributor

yorah commented Jul 2, 2013

I haven't had a chance to run this yet, by I think that:

cl_git_pass(git_status_file(&status, repo, "NEW_FILE"));
cl_assert_equal_i(GIT_STATUS_INDEX_NEW, status);

at the end of the new test should actually fail

I think I'm a bit lost. What would be the expected status then? We currently get GIT_STATUS_CURRENT, but I don't see what it actually means.

@nulltoken
Copy link
Member Author

Actually reopening this issue and giving it a more meaningful name

@nulltoken nulltoken reopened this Jul 2, 2013
@nulltoken
Copy link
Member Author

I'm starting to wonder if we shouldn't try and mimic git.git... Indeed this might be a bit confusing to expose a different behavior depending whether the file actually exists or not in the index.

Indeed, @yorah's right. GIT_STATUS_CURRENT is returned when querying the status of file NEW_FILE (when new_file) exists in the index. And I can't find a valid explanation regarding this behavior.

  • If we were case sensitive, we should return GIT_ENOTFOUND
  • If we were case insensitive, we should return GIT_STATUS_INDEX_NEW

Beside this, I think I've found a bug: When new_file exists in the workdir and not the index, calling git_index_add_bypath() with a NEW_FILE parameter value for the path, actually creates an entry in the index with the name in uppercase.

@yorah
Copy link
Contributor

yorah commented Jul 2, 2013

If the tree contains "new_file" and the index contains "NEW_FILE" then those should be treated the same even on a case insensitive platform

Sorry to spam, but after stepping through the code, did you mean "... then those should not be treated the same ..." ?
Or am I just lost beyond any hope? (Edit: please don't answer this one ;)

@arrbee
Copy link
Member

arrbee commented Jul 2, 2013

Okay. This is a mess. Thanks for teasing out all the subtle points! Let's talk about some of the cases:

git_index_add_bypath("FILE.TXT") when "file.txt" exists on disk

Here the behavior depends on whether file.txt has already been added to the index or not. In 6ea999b I made it so that git_index_add_bypath will preserve the capitalization of the existing entry in the index if there is one. So, let's not worry about that case.

If there is no existing entry, there is no API that I know of to read the "true" capitalization of a file from disk without scanning the whole directory. Well, maybe realpath() but that has its own set of problems like making it impossible to add symlinks. On a case-insensitive filesystem, I'm not even sure if there is a "true" capitalization. On a case-insensitive-but-case-preserving filesystem, there definitely is one, but I think we'd have to make git_index_add_bypath do pattern matching to get it. It's even worse because to get the true caps of every path component, you have to do that scan in every directory up the tree (i.e. you give "PATH/TO/FILE.TXT" and the "true" caps is "Path/to/FILE.txt" - discovering that is a lot of directory scanning).

Core Git always runs pattern matching. You, too, could call git_index_add_all() with pathspec of "FILE.TXT" and I would expect it to do the right thing because it will actually read the working directory entries and find one that matches that pattern. That is essentially where we do with git_status_file() to get around this same situation, but I don't think we want to impose that overhead on git_index_add_bypath().

git_status_file("FILE.TXT") returns GIT_STATUS_CURRENT when "file.txt" is used

This is an interesting artifact of the changes that make the tree-to-index diffs case sensitive, but leave the index-to-workdir diffs as case insensitive, interacting with the way that status decides what to return. In this case, we end up with a git_diff_delta record for the index-to-workdir comparison that shows GIT_DELTA_UNMODIFIED (because "file.txt" in the index compared to "file.txt" in working directory and they both matched the pattern "FILE.TXT" case insensitively), and we get no git_diff_delta at all for the tree-to-index comparison because no entries matches "FILE.TXT" case sensitively.

In order to get GIT_STATUS_INDEX_NEW, we would need a GIT_DELTA_ADDED record from the tree-to-index comparison, which we don't get. In order to get GIT_ENOTFOUND we would need no record from either comparison, which we don't get.

Does that make it clear why this behavior is happening?

Obviously this pattern of behavior is confusing. I suspect that knowing this odd interplay you may be able to come up with some combination of paths and patterns that gives a more seriously incorrect result.

So, let's think for a second about what we want to get as results. I started to make a table, but it gets pretty big. If you consider that HEAD, Index, and Working Directory can each have either "A", "a", or no entry and you can try matching against "A" or "a", then the table is large. Some interesting examples though...

  • HEAD: A, Index: A, WD: a, Pattern: A

    Status matches A->A in head-to-index, and A->a in index-to-workdir detecting diffs as needed

    Add by path matches Pattern 'A' to 'a' in workdir but preserves 'A' capitalization found in index
  • HEAD: A, Index: A, WD: a, Pattern: a

    Status doesn't match pattern in head-to-index, and does match A->a in index-to-workdir detecting diffs as needed - this is the GIT_STATUS_CURRENT case from above

    Add by path matches Pattern 'a' to 'a' in workdir but preserves 'A' capitalization found in index
  • HEAD: A, Index: a, WD: a, Pattern: a

    Status does not match between HEAD and index, so a is considered ADDED in head-to-index; if there was no pattern, this could be detected a 'A->a' rename however

    Add by path matches a to workdir which already matches capitalization in index.

Anyhow, there are so many variants. I don't want to go on too long. We want to be able to detect case-changing renames, but we also want pathspecs to match correctly across the three areas. I'm open to suggestions about what the correct behaviors are in the various cases and then we can try to plan out how to get those behaviors. I'd prefer to understand all the cases well, first, however, because spot fixing issues tends to make others arise.

@arrbee
Copy link
Member

arrbee commented Jul 5, 2013

I think I have come up with reasonable short-term and long-term solutions for this issue - not that I've coded the solution, but I have a plan. Let me write up my thoughts here and see if you think that it will work. I'd really appreciate some feedback!

Short Term

  • Because git_status_file actually considers it an error if there is more than one file specified by the path, for this API we can perform the HEAD-to-index comparison case-insensitively (on a case insensitive filesystem, that is) and then verify after the diff that the actual case of the HEAD and index items actually match one another. This will fix the situation where a pathspec that does not match the filename case is used and matches the working directory but not the actual HEAD and index filenames.
  • Alternatively, we could turn off case-insensitive pathspec matching completely, even for working directory entries. This will make our behavior more like core Git where f* will not match FILE.TXT even on a filesystem that is case insensitive. This would probably break a bunch of existing tests and, to me, is not the most intuitive behavior, but at least it would be explainable.
  • We can make git_index_add_bypath call p_realpath on the file to be added (assuming that tests show that would yield the "true" capitalization of the file on disk) so long as p_lstat says that it is not a symlink. If the resulting path is no longer prefixed by the working directory of the repo or no longer matches the original path when compared case-insensitively, then we would ignore the p_realpath version and just use the uncorrected original, but in the vast majority of cases this could allow us to match the on-disk capitalization for a newly added file.
    • We would not change the behavior of preserving the case an existing index entry for a file that is already present in the index.
    • If we add this, we should fix git_index_update_all() and git_index_remove_all() to bypass the p_realpath call, since they invoke git_index_add_bypath and would not want to do this.

Long Term

  • First off, I'm assuming that on the long term we want pathspecs to match (or at least have the option to match) case insensitively on a case insensitive filesystem. If we don't want that, then the following is irrelevant. It sounds like we could have more discussion about this, but I'd love to talk more with the Windows and MacOS based users of the library to determine what their expectations are.
  • Let's break general case-insensitivity into file ordering vs filename comparison. For tree-to-index diffs, we should always be doing case-sensitive filename comparison because both of those places are always case sensitive. For index/tree-to-workdir filename comparison, case-sensitivity should match the platform / file-system. For file ordering, all diffs should use either case-sensitive order or case-insensitive ordering that matches the platform filesystem. This would mean that pathspecs can match a range of files correctly.
  • To clarify, "case-insensitive ordering" actually means case-insensitive but stable. It would mean case-insensitive and then case-sensitive within the case-insensitive equivalents (i.e. fully consistent ordering "A, a, B, b, C, c" not "a, A, B, b, c, C"). Frankly, I think we do this already for case-insensitive sorting just so we can be consistent.
  • We already have flags to sort the results of status either case sensitively or case insensitively after the status has been calculated. We can offer equivalent flags for diff if the user wants that level of control.
  • The other related long-term project is to rewrite the index so that it can be simultaneously accessed case sensitively and case insensitively, where case insensitive traversal and file lookups are handled by a secondary index. This is critical for thread safety because some operations require case sensitive traversal and some require case insensitive. If we do this, we could potentially make the index contents update somewhat transactionally, so that the index iterator could operate on a snapshot of the index content and not mess up if another thread attempted to modify the index while the iteration was occurring.

I'm pretty pleased about the idea of distinguishing iteration order from filename comparison and how it can resolve this situation. The basics of doing so are already present in the iterator code, so I don't think it will be too disruptive to add this capability and have a more elegant solution for this problem. And the short term fixes are something that I can probably knock out over the next week if we think they are worth addressing.

Please let me know what you think!

@yorah
Copy link
Contributor

yorah commented Jul 18, 2013

Sorry for letting this thread sleep, especially after the time you took to write your explanation and proposals.

Alternatively, we could turn off case-insensitive pathspec matching completely, even for working directory entries. This will make our behavior more like core Git where f* will not match FILE.TXT even on a filesystem that is case insensitive. This would probably break a bunch of existing tests and, to me, is not the most intuitive behavior, but at least it would be explainable.

I like the predictability of that solution. And from what I understand, it would simplify things a lot?

@nulltoken
Copy link
Member Author

@arrbee I like the idea of providing a "better" service than git.git. However, I think that in this particular case I'd indeed prefer to rely on a case sensitive pattern matching behavior. As we're going to store tree entries in a case insensitive container, I'd prefer to be picky regarding what we accept.

I have the feeling that dealing with the platform preferred case handling in one tree, whereas we're case insensitive in the two others may turn into a gigantic magnet for corner-case issues.

Maybe am I not seeing the bigger picture, but I can't think of real life use cases when we would need a case insensitive pattern matching behavior.

@ben
Copy link
Member

ben commented Oct 1, 2013

Bump.

My 2¢: it's not unreasonable for libgit2 to require the correct case when given a path. The internal representation is case-sensitive, we shouldn't hide that. Also, I don't want to think about İ.txt vs. i.txt vs. I.txt. We don't have a full Unicode-compliant capitalization engine, let's not pretend we do.

But I'm no expert, and I certainly haven't thought about this as much as @arrbee has.

@ethomson
Copy link
Member

It's been 1.5 years since anybody last thought about this and we've had a lot of case insensitive changes since then to better match git.git.

@nulltoken is there still some unit test in LibGit2Sharp that is failing? Or some other behavior that we think maybe incorrect?

@nulltoken
Copy link
Member Author

@ethomson
Copy link
Member

Looking at the totality of this, and the issues (here and in LibGit2Sharp) that are now closed, I think that these are mostly resolved. If there are other, small, specific issues that I've missed then we should raise those separately.

@ethomson ethomson closed this Dec 30, 2016
@ethomson ethomson deleted the bug/status_case branch January 9, 2019 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants