Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MNT]: mathtext.MathTextParser is slow #20821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
HDembinski opened this issue Aug 10, 2021 · 14 comments
Open

[MNT]: mathtext.MathTextParser is slow #20821

HDembinski opened this issue Aug 10, 2021 · 14 comments

Comments

@HDembinski
Copy link

Summary

The mathtext.MathTextParser is a performance bottleneck.

I profiled my interactive app that uses PyQt and matplotlib to draw complex plots with very little text and yet significant time (20 %) is spend in the mathtext.MathTextParser. See
profile_graph.pdf

Proposed fix

The parser is using pyparsing internally and could potentially be speed up by switching to lark.

@tacaswell
Copy link
Member

Effectively re-writing the mathtext module does not seem a small project (as I assume the output of the parsing step will be different enough that we will have to re-write the code that consumes it). It might be a good GSOC project?

A less invasive fix would might to add some strategic @functools.lru_cache. Looking at the pdf from the function names it looks like we are doing caching but either the caching is still slow (maybe too many function calls?) or you are generating cache misses. It would be good to sort out which one of those it is.

If your text is not changing frequently you may also get a speed win (at the cost of an expensive first draw) by switching to usetex as we aggressively cache those results.

@HDembinski
Copy link
Author

HDembinski commented Aug 30, 2021

It is probably not easy to replace the parser, true.

The hot-fix for my app was to remove all LaTeX, which produced a perceivable performance gain.

Some of the LaTeX was static, some was dynamically changing. I did not try to remove only the dynamic components which cannot be cached.

@tacaswell tacaswell added this to the unassigned milestone Aug 30, 2021
@ptmcg
Copy link

ptmcg commented Oct 25, 2021

I'd be happy to work with a maintainer on possible performance improvements in this parser.

@tacaswell
Copy link
Member

That would be very appreciated @ptmcg !

Our latex parsing code is in https://github.com/matplotlib/matplotlib/blob/main/lib/matplotlib/_mathtext.py and for parsing font config strings in https://github.com/matplotlib/matplotlib/blob/main/lib/matplotlib/fontconfig_pattern.py.

I suspect that there is a lot more opportunities for optimization in the latex parsing code, but the font config code is more frequently run (but could be wrong about both of those).

@anntzer
Copy link
Contributor

anntzer commented Oct 26, 2021

Thanks!
Note that there's a slight rewrite of the parser in #21448 as well (I have a similar change for fontconfig in the works), but I don't think it would change any of the internal structures used by pyparsing.


From a quick profiling, I guess it may help a bit to replace the custom caching mechanism (_parseCache) by lru_cache, although things are getting complicated by the presence of debugActions which always get executed (so one would need to maintain the old path in the presence of debugActions, and only enable the lru_cache-based one in their absence?).

@ptmcg
Copy link

ptmcg commented Oct 26, 2021

Is the custom caching even being used? I don't see enablePackrat being called anywhere.

@anntzer
Copy link
Contributor

anntzer commented Oct 26, 2021

ParserElement.enablePackrat()
?

@ptmcg
Copy link

ptmcg commented Oct 26, 2021

My mistake, I was looking at an extract from _mathtext.py that I was using to test in isolation, and had left out that line 🤦.

Replacing the custom cache in pyparsing with lru_cache is probably not going to be helpful, as lru_cache won't cache exceptions and pyparsing expressions raise exceptions far more often than they return matched tokens.

I looked at the font config parser, and it is pretty simple. I don't see any obvious places for improvement.

In _mathtext.py there are a couple of places I would start.

  • Remove the unnecessary Forward declarations for a number of your expressions. While these expression wrappers have no effect on the parsing logic, they incur an added couple of levels of function calls, and function calls are performance killers in Python. Start with some of the most commonly used (like p.bslash), or the most frequently encountered (probably symbol and its component expressions). For experimentation, you could just replace the <<= assignment with = and run your tests. Once you've deForwarded the expressions you want, run your tests with -Wd:::pyparsing and let the warnings tell you where to clean up the unneeded Forward assignments

  • Change the definition of symbol_name to p.bslash + oneOf(list(tex2uni), asKeyword=True). Using asKeyword wraps the regex created by oneOf in \b markers, which I believe should be equivalent to the code in place now. This will allow you to drop the following FollowedBy/Regex lookahead and the wrapping Combine - fewer terms -> fewer function calls. I didn't see anywhere else that oneOf was combined with a following lookahead, but you could probably benefit from adding asKeyword on other oneOf's as well (function maybe?).

  • Given the size of this grammar, there might be some benefit in bumping the packrat cache size from the 128 default to 256.

Some things not to do:

  • Don't go through replacing punctuation expressions (like p.bslash) with string literals. Each of these would be converted to separate pyparsing Literals, and so would defeat the packrat cache's ability to detect the recurrence of that shared expression. I've attached a little demo script to show how this would affect the cache (uses pyparsing3's cache hit indicators in the debug output).

Just to set expectations: In general, I've found my pyparsing tweaking to have at best 10-30% improvement, no orders of magnitude boosts (other than when adding packrat parsing to a recursive grammar, especially one using infix_notation).

I've also attached PDF and HTML versions of a railroad diagram for this parser. From the generated regexes and expression names for the oneOf expressions, it is interesting to see the reordering that oneOf does to avoid masking of longer names by shorter names (like 'ggg' being tried before 'gg').
pyparsing_files.zip

@anntzer
Copy link
Contributor

anntzer commented Oct 26, 2021

Thank you very much for the many detailed suggestions.

In the following, I use pytest lib/matplotlib/tests/test_mathtext.py -k 'mathtext_rendering and png' as a quick and dirty benchmark system. This covers much more than just parsing, but it's easy enough to run. AFAICT, #21448 already speeds up the above command (consistently across >5 runs; the first run is always a bit slower, likely due to cache issues) from ~11s to ~10s, which is quite significant given that parsing is itself only one part of the code. I have a followup PR which gets rid of the unneeded Forward definitions, which shaves off again ~0.3s. I have also tried restoring the reusable Literals for backslashes and braces, but this doesn't actually result in any improvement. However, I guess the better way to handle most backslashes may be to just include them in the following term; e.g., replace the definition of function from "\\" + oneOf(self._function_names)("name") to oneOf([rf"\{name}" for name in self._function_names])("name") (and strip the backslash in the parse action), which would remove one level from the tree. As for the asKeyword suggestion, I also have some followup things along these lines (originally aimed at getting rid of the accentprefixed hack). Given that #21448 already improves performance, I guess I'd rather have it go in first and re-adjust the backslashes later (as it won't be a simple revert to the original state anyways).

As a side point, it seems a pity that literal strings each generate their own Literal instance that is not shared. I guess in theory they are mutable (one could re-access the nodes and set different parse actions on them), but perhaps (just armchair-quarterbacking here) it may make sense to declare that literal strings generate immutable Literal instances and dedupe them?

@ptmcg
Copy link

ptmcg commented Oct 26, 2021

Ouch! I misspoke. oneOf with asKeyword=True does not just wrap a regex in \b markers, it goes the slow boat route and generates a MatchFirst of Keywords (or CaselessKeywords if caseless is also True). This will be much slower than creating a Regex, which is probably why mathtext.py was written the way it was.

I can see why I did not do this before, in that CaselessKeywords and CaselessLiterals don't just match caselessly, but also return the given string, not whatever was matched from the input. I have an implementation for this now, will be included in 3.0.2. With all your asKeyword oneOf's, I think it will be some help.

@ptmcg
Copy link

ptmcg commented Oct 27, 2021

3.0.2 has been pushed out, please try it, your oneOf's with asKeyword=True should be snappier now.

@anntzer
Copy link
Contributor

anntzer commented Oct 29, 2021

Thanks for the suggestion, but I guess I'll probably just go for the good-old-regex route (discussion starting at #21454 (comment)). In any case that'll probably wait until after 3.5 is release to settle things down.

@anntzer
Copy link
Contributor

anntzer commented Oct 30, 2021

Also, AFAICS restoring explicitly reused Literals for braces and backslashes on top of #21448 is actually slower (by ~5% on #21448 (comment)).

@anntzer
Copy link
Contributor

anntzer commented Nov 17, 2021

Notes to self:

@story645 story645 modified the milestones: unassigned, needs sorting Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants