Make pdftex.map parsing stricter #20400

QuLogic · 2021-06-10T00:39:37Z

PR Summary

A test has started failing in Fedora Rawhide with Texlive 2021; while I think there are some issues in the pdftex.map there (see my investigation here), I found the parser in Matplotlib to be a bit laxer than it should be. Annoyingly, dvipdfm and pdflatex appear to use different parsers for this file, but I chose to emulate what pdflatex does, as it appears to match what's in the pdfTeX manual, which we claim to follow.

Some more details are available in the commit messages, but behaviour copied from pdflatex include:

ignoring duplicate lines
ignoring lines with out of range special entries, or on the wrong font type
failing on subset TrueType fonts without encoding files

PR Checklist

Has pytest style unit tests (and pytest passes).
Is Flake 8 compliant (run flake8 on changed files to check).
[n/a] New features are documented, with examples if plot related.
[n/a] Documentation is sphinx and numpydoc compliant (the docs should build without error).
Conforms to Matplotlib style conventions (install flake8-docstrings and run flake8 --docstring-convention=all).
[n/a] New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
[n/a] API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).

QuLogic · 2021-06-10T03:51:38Z

Also, I noticed that encodingfile and fontfile have inconsistent types. If they're not absolute, they're passed to find_tex_file, which returns a str, but otherwise they're bytes. We seem to pass these results to open or pathlib.Path, which accept both, but I wonder if we should reconcile this difference?

anntzer · 2021-06-10T05:56:22Z

lib/matplotlib/dviread.py

+            self._unparsed = defaultdict(list)
+            for line in file:
+                tfmname = line.split(b' ', 1)[0]
+                self._unparsed[tfmname].append(line)


I usually write this as

self._unparsed = {} for line in file: ...; self._unparsed.setdefault(tfmname, []).append(line)

(defaultdict's autovivification always makes me a bit nervous) but I guess you should time whichever is fastest.

Using timeit.timeit('dviread.PsFontsMap(".../pdftex.map")', setup='from matplotlib import dviread') taken 5 times, throwing away slowest and fastest times, the average for defaultdict is 0.202 microseconds, and for setdefault is 0.183 microseconds, for a map file that is 40827 lines long. Not sure if that's long or short.

I'd say that's on the long side, thanks for checking.

anntzer · 2021-06-10T06:03:33Z

I guess the spirit of the module would be to keep everything as bytes (so convert back the result of find_tex_file using os.fsencode).

anntzer

modulo comments above.

This can be tested by placing two lines with the same `tfmname`, but different `psname` in a `pdftex.map`: ``` cmr12 CMR10 <cmr12.pfb cmr12 CMR12 <cmr12.pfb ``` and then running `TEXFONTMAPS=/path/to/pdftex.map pdflatex` on a file using Computer Modern. It will warn about the second line, and embed `CMR10` as the name in the resulting PDF.

As noted in the pfdtex manual, `SlantFont` and `ExtendFont` are only allowed for T1 fonts, and within range ±1 and ±2, respectively. This can be confirmed the same way as the previous commit, by copying the lines from the `test.map` (though using a _real_ tfmname).

As noted in the pdftex manual, > The *encodingfile* field may be omitted if you are sure that the font > resource has the correct built-in encoding. In general this option is > highly recommended, and it is *required* when subsetting a TrueType > font. This can be confirmed in a similar way to the previous commits, though instead of ignoring the line, pdflatex quits while attempting to embed the font.

QuLogic · 2021-06-10T07:05:32Z

Oops, I thought I tried it, but I guess Path doesn't like bytes, so I'll have to go with str.

anntzer · 2021-06-10T07:23:27Z

lib/matplotlib/dviread.py

+            if not encodingfile.startswith(b"/"):
+                encodingfile = find_tex_file(encodingfile)
+            else:
+                encodingfile = encodingfile.decode('utf-8', errors='replace')


that should be os.fsdecode then? at least on linux... (using surrogateescape rather than replace may matter)

This matches what find_tex_file does.

that's what find_tex_file does for filename which should effectively be just a filename (not an absolute path); the absolute path's encoding is determined by the kwargs a bit further down (well, things are a bit more complicated, but still...).

But that's the encoding for communicating to/from kpsewhich only. For converting the input to find_tex_file (which is directly from this file), it uses replaced utf-8 like above.

Can we split that out to a separate issue/PR, and keep the type instability for now? I am not convinced that this is correct, but mostly just need to spend some time setting up a system with weird fsencoding for testing...

Sure, can do that.

Actually, looking at it again, it seems that we can just remove the startswith("/") check and always pass things to find_tex_file. I have checked that kpsewhich (whether called directly or via luatex) will just happily pass-through absolute paths, so we don't need to pre-filter them out. It is true that in theory this may make things slightly slower (due to the subprocess interaction), but in practice I haven't seen any absolute paths in pdftex.map either on my machine or on the shared macos...

Hmm, okay, changed to call that always then.

It should work for absolute paths as well.

jklymak

I'm approving on the merits of @anntzer review.

QuLogic added the topic: text label Jun 10, 2021

QuLogic added this to the v3.5.0 milestone Jun 10, 2021

QuLogic added the backend: pdf label Jun 10, 2021

anntzer reviewed Jun 10, 2021

View reviewed changes

anntzer approved these changes Jun 10, 2021

View reviewed changes

QuLogic added 3 commits June 10, 2021 02:20

QuLogic force-pushed the stricter-psfontsmap branch from d9d727b to ccaf495 Compare June 10, 2021 06:30

QuLogic force-pushed the stricter-psfontsmap branch from ccaf495 to b447363 Compare June 10, 2021 07:20

anntzer reviewed Jun 10, 2021

View reviewed changes

QuLogic force-pushed the stricter-psfontsmap branch from b447363 to aa8e129 Compare June 10, 2021 22:41

Always call find_tex_file for PsfontsMap.

70245b1

It should work for absolute paths as well.

jklymak approved these changes Jun 16, 2021

View reviewed changes

jklymak merged commit 7d50020 into matplotlib:master Jun 16, 2021

QuLogic deleted the stricter-psfontsmap branch June 16, 2021 19:26

Uh oh!

Make pdftex.map parsing stricter #20400

Make pdftex.map parsing stricter #20400

Uh oh!

Conversation

QuLogic commented Jun 10, 2021

PR Summary

PR Checklist

Uh oh!

QuLogic commented Jun 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anntzer commented Jun 10, 2021

Uh oh!

anntzer left a comment

Choose a reason for hiding this comment

Uh oh!

QuLogic commented Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jklymak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

QuLogic commented Jun 10, 2021 •

edited

Loading