Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Decode subprocess output to utf-8 or regex will fail #1879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

tomspur
Copy link
Contributor

@tomspur tomspur commented Apr 3, 2013

This is an alternative fix for issue #1767, which changes the regular expression to bytes instead. But that will fail later on in if os.path.splitext(file)[1][1:] in fontexts, as fontexts will then be a set of strings and the former will be the current extension, but in bytes.

After this patch both will be strings and anything works here for me so far.

The RH bug for this is at:
https://bugzilla.redhat.com/show_bug.cgi?id=928326

@mdboom
Copy link
Member

mdboom commented Apr 3, 2013

This will only work if the filesystem encoding happens to be utf-8, though, right? (That's usually the case, but not always). I think instead this needs to use sys.getfilesystemencoding().

@tomspur
Copy link
Contributor Author

tomspur commented Apr 5, 2013

Three variables must be considered:

  • Regular expressions compiler
  • output of subprocess, which is processed with the reqular expressions compiler. By default returns bytes
  • fontexts synonyms, which are plain strings at the moment.

Things to consider:

  • The reqular expressions compiler and the output of subprocess encoding must match, which is the case, when using a bytes reqular expression or decoding the bytes object to a string. utf-8 is only the default for bytes.decode and could also be omited.
  • Even if the system file encoding will be differently, subprocess returns byes, which can be decoded to whatever you like. So utf-8, what is the default for bytes.decode anyway looks fine.

The only reason for sys.getfilesystemencoding() would be, that matplotlib uses the same encoding internally than the file system. But even there, I think it would be more sane to directly use utf-8 instead and would favour that in any case...

@mdboom
Copy link
Member

mdboom commented Apr 8, 2013

I think you've raised a valid issue with the fix in #1767, however I think this still is not correct.

If the file system encoding were, for example latin-1, it's possible, and actually quite easy, to have filenames that are not valid utf-8. (Note: I'm not saying not valid Unicode -- utf-8 is only an encoding, and not all bytestrings are valid utf-8). I looked deeper into fc-match -- the manpage doesn't really address encodings, but it does look from the code like the output is basically utf-8, though the filename part is passed in directly through, so it can be in the filesystem encoding. So I think the correct fix is:

--- a/lib/matplotlib/font_manager.py
+++ b/lib/matplotlib/font_manager.py
@@ -1289,11 +1289,12 @@ if USE_FONTCONFIG and sys.platform != 'win32':
         if pipe.returncode == 0:
             for match in _fc_match_regex.finditer(output):
                 file = match.group(1)
+                file = file.decode(sys.getfilesystemencoding())
                 if os.path.splitext(file)[1][1:] in fontexts:
                     return file
         return None

-    _fc_match_regex = re.compile(r'\sfile:\s+"([^"]*)"')
+    _fc_match_regex = re.compile(rb'\sfile:\s+"([^"]*)"')
     _fc_match_cache = {}

     def findfont(prop, fontext='ttf'):

@pelson
Copy link
Member

pelson commented Apr 18, 2013

I've marked this a blocker for v1.3 as it sounds pretty nasty. @mdboom & @tomspur - do we have a consensus on where this PR should go?

@tomspur
Copy link
Contributor Author

tomspur commented Apr 18, 2013

@pelson I think @mdboom's appoach is more reasonable. Unfortunately, I currently have other problems with my former reproducer from above (unrelated to matplotlib) and cannot verify his solution currently...

@mdboom mdboom closed this in 2415c62 Apr 18, 2013
@tomspur tomspur deleted the font_manager_utf8regex branch March 8, 2017 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants