-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
bpo-36305: Fixes to path handling and parsing in pathlib #12361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Lib/pathlib.py
Outdated
@@ -171,7 +171,7 @@ def splitroot(self, part, sep=sep): | |||
else: | |||
return part[:index2], sep, part[index2+1:] | |||
drv = root = '' | |||
if second == ':' and first in self.drive_letters: | |||
if second == ':' and first in self.drive_letters and third == sep: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is not the solution. "C:" is a drive-relative path. Of course it has a drive. We need to fix absolute
to properly support drive-relative paths. See the FIXME
comment in the code. It's not necessarily as simple as calling ntpath._getfullpatname
, as suggested in the comment, unless we give up on preserving ".." components in Windows. (That wouldn't be a problem really since preserving ".." components serves no point in Windows, unlike POSIX.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I will modify the PR.
This is not a bug. The resolved value of "c:" depends on its working directory, so joining it with "a" has to remain as "c:a". If we call |
Lib/test/test_pathlib.py
Outdated
# NTFS file-stream paths | ||
check(['c:'], ('', '', ['c:'])) | ||
check(['c:a'], ('', '', ['c:a'])) | ||
check(['a', 'Z:b', 'c'], ('', '', ['a', 'Z:b', 'c'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that "c:" is not a file stream path, and neither is "c:a". The system interprets "[A-Za-z]:" as a drive. If we want to create a file stream named "a" in a file named "c" in the working directory, we have to use a qualified path such as "./c:a"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which by the way is a separate bug in pathlib, if you want to look into it. For example, it's wrong for it to remove the explicit reference to the working directory in the following:
>>> p = Path('./c:a')
>>> str(p)
'c:a'
We gave it the path of a file stream, but it mistakenly handles it as a drive-relative path when getting the path as a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and yet,
>>> p = Path('c:a')
>>> p
WindowsPath('c:a')
>>> Path('.') / p
WindowsPath('c:a')
>>> Path('./c:a')
WindowsPath('c:a')
I don't think it's reasonable for Path('.') / Path('c:a')
to give a different result than Path('./c:a')
. I'm pretty sure I've seen it documented somewhere (although not in the official docs) that using Path('a/b/c')
is a portable and shorter way of writing Path('a')/Path('b')/Path('c')
, and the proposed behaviour violates that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does make sense, and does seem to have been the assumption in pathlib (seeing the implementation for parse_parts
).
But this is a wrong assumption, as sometimes a path part can be interpreted in more than one way (for example, c:a
that can either be a drive-relative path or a file with a stream). That means that separating a path to its parts may cause a loss of information about the original path and lead to incorrect results.
Assuming that this assumption can't be true, I think that the current behavior -
WindowsPath('a') / WindowsPath('./c:b')
returning WindowsPath('c:b')
is surely worse than
WindowsPath('a') / WindowsPath('./c:b')
returning WindowsPath('a/c:b')
(which would be the case if my fixes are applied)
Part of this was fixed with bpo-34775, causing the merge conflict. |
Pathlib's internal representation of paths is more complex than a single string. This allows operations like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once #101667 lands, you should only need to change _format_parsed_parts()
to fix the original bug.
check(['a', 'Z:b', 'c'], ('Z:', '', ['Z:', 'a', 'b', 'c'])) | ||
check(['Y:a', 'Z:b', 'c'], ('Z:', '', ['Z:', 'a', 'b', 'c'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right - if we switch drives mid-way through we should discard everything before. This is what os.path.join()
does.
@@ -746,7 +756,9 @@ class PureWindowsPathTest(_BasePurePathTest, unittest.TestCase): | |||
|
|||
equivalences = _BasePurePathTest.equivalences.copy() | |||
equivalences.update({ | |||
'c:a': [ ('c:', 'a'), ('c:', 'a/'), ('/', 'c:', 'a') ], | |||
'./a:b': [ ('./a:b',) ], | |||
'a:b:c': [ ('./b:c', 'a:'), ('b:', 'a:b:c') ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above, I don't see why ('./b:c', 'a:')
shouldn't resolve to a:
.
The bugs
This PR fixes three bugs with path parsing in
pathlib
.The following examples show these bugs (when the cwd is
C:\d
):WindowsPath('C:a').absolute()
should returnWindowsPath('C:\\d\\a')
but returnsWindowsPath('C:a')
.This is caused by flawed logic in the
parse_parts
method of the_Flavour
class.WindowsPath('./b:a').absolute()
should returnWindowsPath('C:\\d\\b:a')
but returnsWindowsPath('b:a')
.This is caused by the limited interface of
parse_parts
, and affects thePath.absolute
,Path.expanduser
andPath.__rtruediv__
methods.WindowsPath('./b:a').resolve()
should returnWindowsPath('C:\\d\\b:a')
but returnsWindowsPath('b:a')
.This is caused by missing logic in the
resolve
method and inPath.__str__
The fixes
To fix the first bug, I fixed a flaw in the
parse_parts
method.The second one was more complicated, as with the current interface of
parse_parts
(called by_parse_args
), the bug can't be fixed. Let's take a simple example:WindowsPath(WindowsPath('./a:b'))
andWindowsPath('a:b')
- before the bugfix, they are equal. That happens because in both cases,parse_parts
is called with['a:b']
. This part can be interpreted in two ways - either as the relative path'b'
with the drive'a:'
, or as a file'a'
with the NTFS data-stream'b'
.That means that in some cases, passing the flattened
_parts
of a path toparse_parts
is lossy. Therefore we have to a modifyparse_parts
's interface to enable passing more detailed information about the given parts. What I decided to do was allow passing tuples in addition to strings, thus supporting the old interface. The tuples would contain the drive, root and path parts, enough information to determine the correct parsing of path parts with data-streams, and maybe more future edge cases.After modifying
parse_parts
's interface, I changed_parse_args
to use it and madePath.absolute
,Path.expanduser
andPath.__rtruediv__
passPath
objects to_parse_args
instead of path parts, to preserve the path information.To solve the third bug I had to make small changes to both the
resolve
method and toPath.__str__
.Notes
In addition to the changes in the code, I've added regression tests and modified old incorrect tests.
Details about drive-relative paths can be found here.
https://bugs.python.org/issue36305