Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Don't include the postscript title if it is not latin-1 encodable. #11130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 28, 2018

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Apr 25, 2018

PR Summary

Closes #11124.

There does not appear to be a complete Unicode encoding available for postscript, so even if certain non-latin1 characters can be handled (won't be done in this PR, in any case), we'll always need to know what to do in the case we can't encode the title.

The %%Title is optional, as is clear from the is_writable_file_like(outfile) clause.

PR Checklist

  • Has Pytest style unit tests
  • Code is PEP 8 compliant
  • New features are documented, with examples if plot related
  • Documentation is sphinx and numpydoc compliant
  • Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
  • Documented in doc/api/api_changes.rst if API changed in a backward-incompatible way

@anntzer anntzer added this to the v3.0 milestone Apr 25, 2018
@jenshnielsen
Copy link
Member

👍 An alternative would be to use replace or ignore options for encode https://docs.python.org/3/howto/unicode.html#converting-to-bytes

@anntzer
Copy link
Contributor Author

anntzer commented Apr 26, 2018

That'll happen after someone does an exegesis of the 700 pages of the postscript standard to figure out what the correct approach (if any) is, but my limited understanding is summarized above. Anyways, I think "saving with missing optional metadata" is better than "failing to save".

@jenshnielsen
Copy link
Member

@anntzer I do not understand your comment

if you do as in your pr and catch any exception the title will be blank with any non latin1 char in the title. If on the other hand you do

title.encode("latin-1", 'replace')

you will get the title but with any invalid char replaced by a ?

I don't see how this requires reading 700 pages of standard

@timhoffm
Copy link
Member

+1 for replace because the standard says:

An application or spooler may optionally use the general header comments %%Creator:, %%Title:, and %%CreationDate: to provide information about a document. These header comments are strongly recommended for EPS files.

@anntzer
Copy link
Contributor Author

anntzer commented Apr 26, 2018

The idea was perhaps the standard specifies a way to include non-ascii strings, or perhaps it doesn't, I don't know. In fact it is not even clear that latin-1 encoding (which we use right now) is correct, as PostScript traditionally uses something else.
Not that I really care, feel free to push something else to this PR.

@timhoffm
Copy link
Member

From my understanding, the PostScript standard supports more general strings than ASCII. However, you have to do it yourself. From https://unix.stackexchange.com/questions/269659/graphviz-how-to-get-utf-8-and-external-postscript-procedures

rendering utf-8 fonts in Postscript is a do-it-yourself job. It would probably take weeks or months of work.

IMO not worth looking into. Let's just replace.

@anntzer anntzer force-pushed the postscript-unicode-title branch from 099efbc to 9e311d9 Compare April 28, 2018 21:18
@anntzer
Copy link
Contributor Author

anntzer commented Apr 28, 2018

fixed accordingly

@tacaswell tacaswell merged commit 3ba79fb into matplotlib:master Apr 28, 2018
@tacaswell
Copy link
Member

Thanks all!

@anntzer anntzer deleted the postscript-unicode-title branch April 29, 2018 00:31
@wilfriedh
Copy link

wilfriedh commented Nov 23, 2018

After update to the latest release of Anaconda including matplotlib 3.0.1, savefig to eps fails, if the filename contains e.g. an "ü" = x'fc which is part of the ISO 8859-1 character set.
Error message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 11: invalid start byte

Savefig to png works without problems. -- System is Windows 7 pro 64bit.
If I comment-out the following line in backend_ps.py (line number 975 in the current github version)
title = title.encode("latin-1", "replace").decode()
savefig to eps works, the "ü" is included in the %%Title: ... as such (in ANSI encoding), and the saved eps is interpreted by ghostscript without problem.

Q: Is this the right place for this issue, or is this an Anaconda issue?
I also posted this to the Anaconda tracker as
[(https://github.com/ContinuumIO/anaconda-issues/issues/10356)]

Traceback:
File "D:/Daten/DemandRegio/Daten/Zeitverwendung/suf_zve_2012_2013/daten/zve-HH-to-Profile-allHHSizes_v8_stacked.py", line 541, in
transparent=False, frameon=False)

File "C:\Anaconda3\lib\site-packages\matplotlib\pyplot.py", line 689, in savefig
res = fig.savefig(*args, **kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\figure.py", line 2094, in savefig
self.canvas.print_figure(fname, **kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\backend_bases.py", line 2075, in print_figure
**kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\backends\backend_ps.py", line 921, in print_eps
return self._print_ps(outfile, 'eps', *args, **kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\backends\backend_ps.py", line 950, in _print_ps
**kwargs)

File "C:\Anaconda3\lib\site-packages\matplotlib\backends\backend_ps.py", line 976, in _print_figure
title = title.encode("latin-1", "replace").decode()

@anntzer
Copy link
Contributor Author

anntzer commented Nov 23, 2018

fwiw I looked a bit more into the standard. The following excerpts are relevant:

PostScript standard section 3.2.2

Literal Text Strings

Within a text string, the \ (backslash) character is treated as an “escape” for various purposes, such as including newline characters, unbalanced parentheses, and the \ character itself in the string. The character immediately following the \ determines its precise interpretation.

\n line feed (LF)
\r carriage return (CR)
\t horizontal tab
\b backspace
\f form feed
\ backslash
( left parenthesis
) right parenthesis
\ddd character code ddd (octal)

PostScript Document Structuring Conventions

%%Title:

This is a modified version of the elementary type. If the first character encountered is a left parenthesis, it is equivalent to a string. If not, the token is considered to be the rest of the characters on the line until end of line
A text string comprises any printable characters and is usually considered to be delimited by blanks. If blanks or special characters are desired inside the text string, the entire string should be enclosed in parentheses. Document managers parsing text strings should be prepared to handle multiple parentheses. Special characters can be denoted using the PostScript language string \ escape mechanism.

The following are examples of valid DSC text strings:
Thisisatextstring
(This is a text string with spaces)
(This is a text string (with parentheses))
(This is a special character \262 using the \ mechanism)
It is a good idea to enclose numbers that should be treated as text strings in parentheses to avoid confusion. For example, use (1040) instead of 1040.
The sequence () denotes an empty string.
Note that a text string must obey the 255 character line limit as set forth in
section 3, “DSC Conformance.”

A quick test shows that at least okular does convert \ddd escapes (while it gets confused by latin-1-encoded, non-ASCII strings) when using the "Import PostScript as PDF" functionality (the metadata can then be checked using "Properties"); however it does not e.g. check for a starting opening parenthesis as required by the spec.

Which is kind of strange as I think okular relies on libspectre, which explicitly does handle this (https://github.com/freedesktop/libspectre/blob/48696f7e724923564dd6c8908afdb7c9d4893f02/libspectre/ps.c#L1305).

So I guess we could implement \ddd escapes and get some small additional correctness there.

@wilfriedh
Copy link

wilfriedh commented Nov 26, 2018

I don't agree.
You write that okular will correctly interpret \ddd escapes. But \ddd escapes are only a different way to write 8 bit character constants.
While there exists only one correct interpretation of a character coded in Unicode or UTF-8, there are a lot of different yet legal interpretations of a plaintext character byte with a value above 127d.
If such a character is in a PostScript text string to be displayed or printed, there is a character font defined in the PostScript file which determines how this character shall look like.
But there is no way to specify a font or codepage which determines how characters above 127d in the %%Title: string (or other comments) shall look like.
Eps should be built in a way that it is correctly interpreted by any conforming eps interpreter using any codepage, and using a character above 127d in PostScript comments will produce different results on different interpreters or systems.
Fortunately the eps backend outputs the graphic itself as bitmap, so any text in the graphic is not changed if the eps is displayed on a system with a codepage different from the one it was created on. So the problem is only with the %%Title: string.

@anntzer
Copy link
Contributor Author

anntzer commented Nov 26, 2018

I agree with your interpretation.
Can you submit a PR replacing "latin-1" by "ascii" in the #12869 patch?

@wilfriedh
Copy link

Sorry, I don't (yet) know how to submit a PR. Up to now I only used the (bug)trackers of github, and I have no knowledge of the git system.

@anntzer
Copy link
Contributor Author

anntzer commented Nov 26, 2018

Done in #12890.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] savefig cannot save file with a Unicode name
6 participants