Decode subprocess output to utf-8 or regex will fail #1879

tomspur · 2013-04-03T00:20:54Z

This is an alternative fix for issue #1767, which changes the regular expression to bytes instead. But that will fail later on in if os.path.splitext(file)[1][1:] in fontexts, as fontexts will then be a set of strings and the former will be the current extension, but in bytes.

After this patch both will be strings and anything works here for me so far.

The RH bug for this is at:
https://bugzilla.redhat.com/show_bug.cgi?id=928326

mdboom · 2013-04-03T12:35:39Z

This will only work if the filesystem encoding happens to be utf-8, though, right? (That's usually the case, but not always). I think instead this needs to use sys.getfilesystemencoding().

tomspur · 2013-04-05T11:14:22Z

Three variables must be considered:

Regular expressions compiler
output of subprocess, which is processed with the reqular expressions compiler. By default returns bytes
fontexts synonyms, which are plain strings at the moment.

Things to consider:

The reqular expressions compiler and the output of subprocess encoding must match, which is the case, when using a bytes reqular expression or decoding the bytes object to a string. utf-8 is only the default for bytes.decode and could also be omited.
Even if the system file encoding will be differently, subprocess returns byes, which can be decoded to whatever you like. So utf-8, what is the default for bytes.decode anyway looks fine.

The only reason for sys.getfilesystemencoding() would be, that matplotlib uses the same encoding internally than the file system. But even there, I think it would be more sane to directly use utf-8 instead and would favour that in any case...

mdboom · 2013-04-08T17:36:58Z

I think you've raised a valid issue with the fix in #1767, however I think this still is not correct.

If the file system encoding were, for example latin-1, it's possible, and actually quite easy, to have filenames that are not valid utf-8. (Note: I'm not saying not valid Unicode -- utf-8 is only an encoding, and not all bytestrings are valid utf-8). I looked deeper into fc-match -- the manpage doesn't really address encodings, but it does look from the code like the output is basically utf-8, though the filename part is passed in directly through, so it can be in the filesystem encoding. So I think the correct fix is:

--- a/lib/matplotlib/font_manager.py
+++ b/lib/matplotlib/font_manager.py
@@ -1289,11 +1289,12 @@ if USE_FONTCONFIG and sys.platform != 'win32':
         if pipe.returncode == 0:
             for match in _fc_match_regex.finditer(output):
                 file = match.group(1)
+                file = file.decode(sys.getfilesystemencoding())
                 if os.path.splitext(file)[1][1:] in fontexts:
                     return file
         return None

-    _fc_match_regex = re.compile(r'\sfile:\s+"([^"]*)"')
+    _fc_match_regex = re.compile(rb'\sfile:\s+"([^"]*)"')
     _fc_match_cache = {}

     def findfont(prop, fontext='ttf'):

pelson · 2013-04-18T12:28:31Z

I've marked this a blocker for v1.3 as it sounds pretty nasty. @mdboom & @tomspur - do we have a consensus on where this PR should go?

tomspur · 2013-04-18T12:47:50Z

@pelson I think @mdboom's appoach is more reasonable. Unfortunately, I currently have other problems with my former reproducer from above (unrelated to matplotlib) and cannot verify his solution currently...

Decode subprocess output to utf-8 or regex will fail

e46a976

tomspur mentioned this pull request Apr 3, 2013

bytes regex matching issue in font_manager.py around 1283 (line number) #1767

Closed

tomspur mentioned this pull request Apr 10, 2013

Python 3 and matplotlib 1.2 issue on Linux #1873

Closed

mdboom mentioned this pull request Apr 18, 2013

Fix filename decoding when calling fc-match #1921

Merged

mdboom closed this in 2415c62 Apr 18, 2013

tomspur deleted the font_manager_utf8regex branch March 8, 2017 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Decode subprocess output to utf-8 or regex will fail #1879

Decode subprocess output to utf-8 or regex will fail #1879

Uh oh!

tomspur commented Apr 3, 2013

Uh oh!

mdboom commented Apr 3, 2013

Uh oh!

tomspur commented Apr 5, 2013

Uh oh!

mdboom commented Apr 8, 2013

Uh oh!

pelson commented Apr 18, 2013

Uh oh!

tomspur commented Apr 18, 2013

Uh oh!

Uh oh!

Uh oh!

Decode subprocess output to utf-8 or regex will fail #1879

Decode subprocess output to utf-8 or regex will fail #1879

Uh oh!

Conversation

tomspur commented Apr 3, 2013

Uh oh!

mdboom commented Apr 3, 2013

Uh oh!

tomspur commented Apr 5, 2013

Uh oh!

mdboom commented Apr 8, 2013

Uh oh!

pelson commented Apr 18, 2013

Uh oh!

tomspur commented Apr 18, 2013

Uh oh!

Uh oh!