-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Handle Unicode font filenames correctly/Fix crashing MacOSX backend #2433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ntconfig in the filesystemencoding. Fix a number of buglets that did not allow fonts with non-ascii filenames to be opened. This involves taking advantage of code already in FT2Font to use unicode filenames, and encoding the filenames before passing into ttconv.
# The bulk of the output from fc-list is ascii, so we keep the | ||
# result in bytes and parse it as bytes, until we extract the | ||
# filename, which is in sys.filesystemencoding(). | ||
for line in output.split(b'\n'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can b'\n' or b':' be part of a bytestring representation of a unicode filename itself? If so, this will fail. I don't see the rationale for doing all this splitting on bytestrings instead of converting to unicode, doing the string-based processing, and then converting back to bytes before returning from this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need to do this is because if the file system encoding is something that isn't byte-based, such as utf-16be on Windows, then you have a combination of the ascii output from fontconfig with the embedded filename path in utf-16.
It is possible that '\n' could be in the filename, but again, that's kind of a degenerate case. It would not appear unexpectedly as a utf-8 surrogate pair (since utf-8 ensures that the control characters always remain control characters). It could appear in utf-16 as an upper byte, I suppose.
The colon is indeed an issue, however.
Looking at the fc-list manpage further, I think we can provide it with a custom output format that would only contain the filepath (which is the only part we care about anyway). That would simplify this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a new solution to this using the --format
argument to fontconfig.
anything complex ourselves.
Travis is very unhappy -- looking into it. |
The fix for the Mac OS X backend in src/_macosx.m looks fine. We may want to raise an Exception if the result of setfont is not NULL, just to be sure that we don't get a hard crash if for some reason setfont fails to find the font. |
@mdehoon: I'll add the setfont NULL check. Thanks for the tip. |
I have added the NULL check on setfont. |
That looks fine to me. You may consider to skip the call to CGContextSelectFont(cr, name, size, kCGEncodingMacRoman) if font is NULL. |
Handle Unicode font filenames correctly/Fix crashing MacOSX backend
This is an alternative to #2431.
First, when handling the output of fontconfig (fc-match and fc-list), the bulk of the output is in ascii, so it is maintained, parsed and handled as bytes. Only once the filename part of the output is extracted do we decode it using sys.getfilesystemencoding().
Behind that bug, if you actually try to use a font file with non-ascii characters in the path:
Every time we passed the path to FT2Font, we were calling str() on it. This was required when FT2Font did not handle Unicode paths, prior to f4adec. However, then and now, it just serves to raise an exception when provided with a non-ascii path.
Within FT2Font, that unicode string was encoded for the purpose of displaying error messages. This could fail both on encoding and displaying it. Encoding it using "unicode_escape" ensures that we'll have something printable in any environment.
ttconv does not handle Unicode filenames. ttconv is in the process of being eliminated for MEP14, so I didn't want to invest a lot of effort fixing that. Instead, since it is only called from two places, the path is encoded back to filesystemencoding on the Python side before calling into it. This doesn't change the API to ttconv.
This will require a manual backport to 1.3.x if we want that -- things are different enough there that this won't really work or apply.
This approach is not necessarily "best practice" for POSIX platforms, which is to keep paths as byte strings everywhere. I tried that first, but there are too many assumptions throughout the code that file paths are unicode (particularly post f4adec with the move to six and
from __future__ import unicode_literals
). The downside of that is pretty minor -- if a path contains a sequence that does not decode using the filesystemencoding, such as an invalid utf-8 surrogate pair, it will not be openable by matplotlib. I'd consider that a pathological, even if technically legal, case, however, so I don't think we should worry too much about it. Keeping the paths as unicode objects does make the processing of them much easier, particularly withsix
and Python 3.