[MNT]: mathtext.MathTextParser is slow #20821

HDembinski · 2021-08-10T16:45:44Z

Summary

The mathtext.MathTextParser is a performance bottleneck.

I profiled my interactive app that uses PyQt and matplotlib to draw complex plots with very little text and yet significant time (20 %) is spend in the mathtext.MathTextParser. See
profile_graph.pdf

Proposed fix

The parser is using pyparsing internally and could potentially be speed up by switching to lark.

tacaswell · 2021-08-10T17:40:27Z

Effectively re-writing the mathtext module does not seem a small project (as I assume the output of the parsing step will be different enough that we will have to re-write the code that consumes it). It might be a good GSOC project?

A less invasive fix would might to add some strategic @functools.lru_cache. Looking at the pdf from the function names it looks like we are doing caching but either the caching is still slow (maybe too many function calls?) or you are generating cache misses. It would be good to sort out which one of those it is.

If your text is not changing frequently you may also get a speed win (at the cost of an expensive first draw) by switching to usetex as we aggressively cache those results.

HDembinski · 2021-08-30T15:28:35Z

It is probably not easy to replace the parser, true.

The hot-fix for my app was to remove all LaTeX, which produced a perceivable performance gain.

Some of the LaTeX was static, some was dynamically changing. I did not try to remove only the dynamic components which cannot be cached.

ptmcg · 2021-10-25T23:40:00Z

I'd be happy to work with a maintainer on possible performance improvements in this parser.

tacaswell · 2021-10-26T02:35:17Z

That would be very appreciated @ptmcg !

Our latex parsing code is in https://github.com/matplotlib/matplotlib/blob/main/lib/matplotlib/_mathtext.py and for parsing font config strings in https://github.com/matplotlib/matplotlib/blob/main/lib/matplotlib/fontconfig_pattern.py.

I suspect that there is a lot more opportunities for optimization in the latex parsing code, but the font config code is more frequently run (but could be wrong about both of those).

anntzer · 2021-10-26T07:29:07Z

Thanks!
Note that there's a slight rewrite of the parser in #21448 as well (I have a similar change for fontconfig in the works), but I don't think it would change any of the internal structures used by pyparsing.

From a quick profiling, I guess it may help a bit to replace the custom caching mechanism (_parseCache) by lru_cache, although things are getting complicated by the presence of debugActions which always get executed (so one would need to maintain the old path in the presence of debugActions, and only enable the lru_cache-based one in their absence?).

ptmcg · 2021-10-26T12:12:03Z

Is the custom caching even being used? I don't see enablePackrat being called anywhere.

anntzer · 2021-10-26T12:16:25Z

matplotlib/lib/matplotlib/_mathtext.py

Line 30 in d358cc3

ParserElement.enablePackrat()

?

ptmcg · 2021-10-26T13:48:34Z

My mistake, I was looking at an extract from _mathtext.py that I was using to test in isolation, and had left out that line 🤦.

Replacing the custom cache in pyparsing with lru_cache is probably not going to be helpful, as lru_cache won't cache exceptions and pyparsing expressions raise exceptions far more often than they return matched tokens.

I looked at the font config parser, and it is pretty simple. I don't see any obvious places for improvement.

In _mathtext.py there are a couple of places I would start.

Remove the unnecessary Forward declarations for a number of your expressions. While these expression wrappers have no effect on the parsing logic, they incur an added couple of levels of function calls, and function calls are performance killers in Python. Start with some of the most commonly used (like p.bslash), or the most frequently encountered (probably symbol and its component expressions). For experimentation, you could just replace the <<= assignment with = and run your tests. Once you've deForwarded the expressions you want, run your tests with -Wd:::pyparsing and let the warnings tell you where to clean up the unneeded Forward assignments
Change the definition of symbol_name to p.bslash + oneOf(list(tex2uni), asKeyword=True). Using asKeyword wraps the regex created by oneOf in \b markers, which I believe should be equivalent to the code in place now. This will allow you to drop the following FollowedBy/Regex lookahead and the wrapping Combine - fewer terms -> fewer function calls. I didn't see anywhere else that oneOf was combined with a following lookahead, but you could probably benefit from adding asKeyword on other oneOf's as well (function maybe?).
Given the size of this grammar, there might be some benefit in bumping the packrat cache size from the 128 default to 256.

Some things not to do:

Don't go through replacing punctuation expressions (like p.bslash) with string literals. Each of these would be converted to separate pyparsing Literals, and so would defeat the packrat cache's ability to detect the recurrence of that shared expression. I've attached a little demo script to show how this would affect the cache (uses pyparsing3's cache hit indicators in the debug output).

Just to set expectations: In general, I've found my pyparsing tweaking to have at best 10-30% improvement, no orders of magnitude boosts (other than when adding packrat parsing to a recursive grammar, especially one using infix_notation).

I've also attached PDF and HTML versions of a railroad diagram for this parser. From the generated regexes and expression names for the oneOf expressions, it is interesting to see the reordering that oneOf does to avoid masking of longer names by shorter names (like 'ggg' being tried before 'gg').
pyparsing_files.zip

anntzer · 2021-10-26T18:51:04Z

Thank you very much for the many detailed suggestions.

In the following, I use pytest lib/matplotlib/tests/test_mathtext.py -k 'mathtext_rendering and png' as a quick and dirty benchmark system. This covers much more than just parsing, but it's easy enough to run. AFAICT, #21448 already speeds up the above command (consistently across >5 runs; the first run is always a bit slower, likely due to cache issues) from ~11s to ~10s, which is quite significant given that parsing is itself only one part of the code. I have a followup PR which gets rid of the unneeded Forward definitions, which shaves off again ~0.3s. I have also tried restoring the reusable Literals for backslashes and braces, but this doesn't actually result in any improvement. However, I guess the better way to handle most backslashes may be to just include them in the following term; e.g., replace the definition of function from "\\" + oneOf(self._function_names)("name") to oneOf([rf"\{name}" for name in self._function_names])("name") (and strip the backslash in the parse action), which would remove one level from the tree. As for the asKeyword suggestion, I also have some followup things along these lines (originally aimed at getting rid of the accentprefixed hack). Given that #21448 already improves performance, I guess I'd rather have it go in first and re-adjust the backslashes later (as it won't be a simple revert to the original state anyways).

As a side point, it seems a pity that literal strings each generate their own Literal instance that is not shared. I guess in theory they are mutable (one could re-access the nodes and set different parse actions on them), but perhaps (just armchair-quarterbacking here) it may make sense to declare that literal strings generate immutable Literal instances and dedupe them?

ptmcg · 2021-10-26T20:46:35Z

Ouch! I misspoke. oneOf with asKeyword=True does not just wrap a regex in \b markers, it goes the slow boat route and generates a MatchFirst of Keywords (or CaselessKeywords if caseless is also True). This will be much slower than creating a Regex, which is probably why mathtext.py was written the way it was.

I can see why I did not do this before, in that CaselessKeywords and CaselessLiterals don't just match caselessly, but also return the given string, not whatever was matched from the input. I have an implementation for this now, will be included in 3.0.2. With all your asKeyword oneOf's, I think it will be some help.

ptmcg · 2021-10-27T12:17:40Z

3.0.2 has been pushed out, please try it, your oneOf's with asKeyword=True should be snappier now.

anntzer · 2021-10-29T18:40:09Z

Thanks for the suggestion, but I guess I'll probably just go for the good-old-regex route (discussion starting at #21454 (comment)). In any case that'll probably wait until after 3.5 is release to settle things down.

anntzer · 2021-10-30T11:26:33Z

Also, AFAICS restoring explicitly reused Literals for braces and backslashes on top of #21448 is actually slower (by ~5% on #21448 (comment)).

anntzer · 2021-11-17T19:46:29Z

Notes to self:

for another option: https://github.com/goodmami/pe -- see https://github.com/goodmami/python-parsing-benchmarks. Not pure-python, though, so it's not clear whether the slightly more complicated (cython-dependent) installation is worth it.
Lark is LALR, which may be harder to adapt to (we exploit the recursive descent nature of pyparsing to generate proper error messages).

HDembinski added the Maintenance label Aug 10, 2021

tacaswell added Performance topic: text/mathtext and removed Maintenance labels Aug 10, 2021

tacaswell added this to the unassigned milestone Aug 30, 2021

anntzer mentioned this issue Oct 26, 2021

Use named groups in mathtext parser. #21448

Merged

7 tasks

anntzer mentioned this issue Apr 21, 2022

Remove Forward definitions where possible. #22875

Merged

6 tasks

story645 modified the milestones: unassigned, needs sorting Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MNT]: mathtext.MathTextParser is slow #20821

[MNT]: mathtext.MathTextParser is slow #20821

HDembinski commented Aug 10, 2021

tacaswell commented Aug 10, 2021

HDembinski commented Aug 30, 2021 •

edited

Loading

ptmcg commented Oct 25, 2021

tacaswell commented Oct 26, 2021

anntzer commented Oct 26, 2021 •

edited

Loading

ptmcg commented Oct 26, 2021

anntzer commented Oct 26, 2021

ptmcg commented Oct 26, 2021 •

edited

Loading

anntzer commented Oct 26, 2021

ptmcg commented Oct 26, 2021

ptmcg commented Oct 27, 2021

anntzer commented Oct 29, 2021

anntzer commented Oct 30, 2021

anntzer commented Nov 17, 2021 •

edited

Loading

[MNT]: mathtext.MathTextParser is slow #20821

[MNT]: mathtext.MathTextParser is slow #20821

Comments

HDembinski commented Aug 10, 2021

Summary

Proposed fix

tacaswell commented Aug 10, 2021

HDembinski commented Aug 30, 2021 • edited Loading

ptmcg commented Oct 25, 2021

tacaswell commented Oct 26, 2021

anntzer commented Oct 26, 2021 • edited Loading

ptmcg commented Oct 26, 2021

anntzer commented Oct 26, 2021

ptmcg commented Oct 26, 2021 • edited Loading

anntzer commented Oct 26, 2021

ptmcg commented Oct 26, 2021

ptmcg commented Oct 27, 2021

anntzer commented Oct 29, 2021

anntzer commented Oct 30, 2021

anntzer commented Nov 17, 2021 • edited Loading

HDembinski commented Aug 30, 2021 •

edited

Loading

anntzer commented Oct 26, 2021 •

edited

Loading

ptmcg commented Oct 26, 2021 •

edited

Loading

anntzer commented Nov 17, 2021 •

edited

Loading