Don't include the postscript title if it is not latin-1 encodable. #11130

anntzer · 2018-04-25T23:29:41Z

PR Summary

There does not appear to be a complete Unicode encoding available for postscript, so even if certain non-latin1 characters can be handled (won't be done in this PR, in any case), we'll always need to know what to do in the case we can't encode the title.

The %%Title is optional, as is clear from the is_writable_file_like(outfile) clause.

PR Checklist

Has Pytest style unit tests
Code is PEP 8 compliant
New features are documented, with examples if plot related
Documentation is sphinx and numpydoc compliant
Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

jenshnielsen · 2018-04-26T08:07:51Z

👍 An alternative would be to use replace or ignore options for encode https://docs.python.org/3/howto/unicode.html#converting-to-bytes

anntzer · 2018-04-26T08:21:40Z

That'll happen after someone does an exegesis of the 700 pages of the postscript standard to figure out what the correct approach (if any) is, but my limited understanding is summarized above. Anyways, I think "saving with missing optional metadata" is better than "failing to save".

jenshnielsen · 2018-04-26T13:58:05Z

@anntzer I do not understand your comment

if you do as in your pr and catch any exception the title will be blank with any non latin1 char in the title. If on the other hand you do

title.encode("latin-1", 'replace')

you will get the title but with any invalid char replaced by a ?

I don't see how this requires reading 700 pages of standard

timhoffm · 2018-04-26T19:37:25Z

+1 for replace because the standard says:

An application or spooler may optionally use the general header comments %%Creator:, %%Title:, and %%CreationDate: to provide information about a document. These header comments are strongly recommended for EPS files.

anntzer · 2018-04-26T21:35:28Z

The idea was perhaps the standard specifies a way to include non-ascii strings, or perhaps it doesn't, I don't know. In fact it is not even clear that latin-1 encoding (which we use right now) is correct, as PostScript traditionally uses something else.
Not that I really care, feel free to push something else to this PR.

timhoffm · 2018-04-28T09:48:11Z

From my understanding, the PostScript standard supports more general strings than ASCII. However, you have to do it yourself. From https://unix.stackexchange.com/questions/269659/graphviz-how-to-get-utf-8-and-external-postscript-procedures

rendering utf-8 fonts in Postscript is a do-it-yourself job. It would probably take weeks or months of work.

IMO not worth looking into. Let's just replace.

anntzer · 2018-04-28T21:18:36Z

fixed accordingly

tacaswell · 2018-04-28T23:31:47Z

Thanks all!

wilfriedh · 2018-11-23T09:16:38Z

After update to the latest release of Anaconda including matplotlib 3.0.1, savefig to eps fails, if the filename contains e.g. an "ü" = x'fc which is part of the ISO 8859-1 character set.
Error message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 11: invalid start byte

Savefig to png works without problems. -- System is Windows 7 pro 64bit.
If I comment-out the following line in backend_ps.py (line number 975 in the current github version)
title = title.encode("latin-1", "replace").decode()
savefig to eps works, the "ü" is included in the %%Title: ... as such (in ANSI encoding), and the saved eps is interpreted by ghostscript without problem.

Q: Is this the right place for this issue, or is this an Anaconda issue?
I also posted this to the Anaconda tracker as
[(https://github.com/ContinuumIO/anaconda-issues/issues/10356)]

Traceback:
File "D:/Daten/DemandRegio/Daten/Zeitverwendung/suf_zve_2012_2013/daten/zve-HH-to-Profile-allHHSizes_v8_stacked.py", line 541, in
transparent=False, frameon=False)

File "C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 689, in savefig
res = fig.savefig(*args, **kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\figure.py", line 2094, in savefig
self.canvas.print_figure(fname, **kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\backend_bases.py", line 2075, in print_figure
**kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\backends\backend_ps.py", line 921, in print_eps
return self._print_ps(outfile, 'eps', *args, **kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\backends\backend_ps.py", line 950, in _print_ps
**kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\backends\backend_ps.py", line 976, in _print_figure
title = title.encode("latin-1", "replace").decode()

anntzer · 2018-11-23T12:10:22Z

fwiw I looked a bit more into the standard. The following excerpts are relevant:

PostScript standard section 3.2.2

Literal Text Strings

Within a text string, the \ (backslash) character is treated as an “escape” for various purposes, such as including newline characters, unbalanced parentheses, and the \ character itself in the string. The character immediately following the \ determines its precise interpretation.

\n line feed (LF)
\r carriage return (CR)
\t horizontal tab
\b backspace
\f form feed
\ backslash
( left parenthesis
) right parenthesis
\ddd character code ddd (octal)

PostScript Document Structuring Conventions

%%Title:

This is a modified version of the elementary type. If the first character encountered is a left parenthesis, it is equivalent to a string. If not, the token is considered to be the rest of the characters on the line until end of line

A text string comprises any printable characters and is usually considered to be delimited by blanks. If blanks or special characters are desired inside the text string, the entire string should be enclosed in parentheses. Document managers parsing text strings should be prepared to handle multiple parentheses. Special characters can be denoted using the PostScript language string \ escape mechanism.
The following are examples of valid DSC text strings:
Thisisatextstring
(This is a text string with spaces)
(This is a text string (with parentheses))
(This is a special character \262 using the \ mechanism)
It is a good idea to enclose numbers that should be treated as text strings in parentheses to avoid confusion. For example, use (1040) instead of 1040.
The sequence () denotes an empty string.
Note that a text string must obey the 255 character line limit as set forth in
section 3, “DSC Conformance.”

A quick test shows that at least okular does convert \ddd escapes (while it gets confused by latin-1-encoded, non-ASCII strings) when using the "Import PostScript as PDF" functionality (the metadata can then be checked using "Properties"); however it does not e.g. check for a starting opening parenthesis as required by the spec.

Which is kind of strange as I think okular relies on libspectre, which explicitly does handle this (https://github.com/freedesktop/libspectre/blob/48696f7e724923564dd6c8908afdb7c9d4893f02/libspectre/ps.c#L1305).

So I guess we could implement \ddd escapes and get some small additional correctness there.

wilfriedh · 2018-11-26T09:39:19Z

I don't agree.
You write that okular will correctly interpret \ddd escapes. But \ddd escapes are only a different way to write 8 bit character constants.
While there exists only one correct interpretation of a character coded in Unicode or UTF-8, there are a lot of different yet legal interpretations of a plaintext character byte with a value above 127d.
If such a character is in a PostScript text string to be displayed or printed, there is a character font defined in the PostScript file which determines how this character shall look like.
But there is no way to specify a font or codepage which determines how characters above 127d in the %%Title: string (or other comments) shall look like.
Eps should be built in a way that it is correctly interpreted by any conforming eps interpreter using any codepage, and using a character above 127d in PostScript comments will produce different results on different interpreters or systems.
Fortunately the eps backend outputs the graphic itself as bitmap, so any text in the graphic is not changed if the eps is displayed on a system with a codepage different from the one it was created on. So the problem is only with the %%Title: string.

anntzer · 2018-11-26T09:49:26Z

I agree with your interpretation.
Can you submit a PR replacing "latin-1" by "ascii" in the #12869 patch?

wilfriedh · 2018-11-26T11:05:20Z

Sorry, I don't (yet) know how to submit a PR. Up to now I only used the (bug)trackers of github, and I have no knowledge of the git system.

anntzer · 2018-11-26T11:18:37Z

Done in #12890.

anntzer added the backend: ps label Apr 25, 2018

anntzer added this to the v3.0 milestone Apr 25, 2018

dstansby approved these changes Apr 26, 2018

View reviewed changes

Don't include the postscript title if it is not latin-1 encodable.

9e311d9

anntzer force-pushed the postscript-unicode-title branch from 099efbc to 9e311d9 Compare April 28, 2018 21:18

tacaswell merged commit 3ba79fb into matplotlib:master Apr 28, 2018

anntzer deleted the postscript-unicode-title branch April 29, 2018 00:31

wilfriedh mentioned this pull request Nov 23, 2018

matplotlib 3.0.1 savefig to eps fails ContinuumIO/anaconda-issues#10356

Open

anntzer mentioned this pull request Nov 23, 2018

Fix latin-1-ization of Title in eps. #12869

Merged

6 tasks

anntzer mentioned this pull request Nov 26, 2018

Restrict postscript title to ascii. #12890

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't include the postscript title if it is not latin-1 encodable. #11130

Don't include the postscript title if it is not latin-1 encodable. #11130

anntzer commented Apr 25, 2018

jenshnielsen commented Apr 26, 2018

anntzer commented Apr 26, 2018

jenshnielsen commented Apr 26, 2018

timhoffm commented Apr 26, 2018

anntzer commented Apr 26, 2018

timhoffm commented Apr 28, 2018

anntzer commented Apr 28, 2018

tacaswell commented Apr 28, 2018

wilfriedh commented Nov 23, 2018 •

edited

Loading

anntzer commented Nov 23, 2018 •

edited

Loading

wilfriedh commented Nov 26, 2018 •

edited

Loading

anntzer commented Nov 26, 2018

wilfriedh commented Nov 26, 2018

anntzer commented Nov 26, 2018

Don't include the postscript title if it is not latin-1 encodable. #11130

Don't include the postscript title if it is not latin-1 encodable. #11130

Conversation

anntzer commented Apr 25, 2018

PR Summary

PR Checklist

jenshnielsen commented Apr 26, 2018

anntzer commented Apr 26, 2018

jenshnielsen commented Apr 26, 2018

timhoffm commented Apr 26, 2018

anntzer commented Apr 26, 2018

timhoffm commented Apr 28, 2018

anntzer commented Apr 28, 2018

tacaswell commented Apr 28, 2018

wilfriedh commented Nov 23, 2018 • edited Loading

anntzer commented Nov 23, 2018 • edited Loading

wilfriedh commented Nov 26, 2018 • edited Loading

anntzer commented Nov 26, 2018

wilfriedh commented Nov 26, 2018

anntzer commented Nov 26, 2018

wilfriedh commented Nov 23, 2018 •

edited

Loading

anntzer commented Nov 23, 2018 •

edited

Loading

wilfriedh commented Nov 26, 2018 •

edited

Loading