Merge runs of consecutive simple rules in RegexLexer into one regex#3155
Open
eendebakpt wants to merge 3 commits into
Open
Merge runs of consecutive simple rules in RegexLexer into one regex#3155eendebakpt wants to merge 3 commits into
eendebakpt wants to merge 3 commits into
Conversation
RegexLexer.get_tokens_unprocessed tries each rule's compiled regex in turn at every position, so a state with N rules can cost up to N match() calls per character. For most lexers the common tokens (whitespace, names, numbers, operators, ...) sit in a long run of plain-token rules that are all attempted on every identifier character. Merge maximal runs of *consecutive* "simple" rules -- a plain _TokenType action, no state transition, shared flags, and a foldable pattern (no named group, backreference, or global inline flag) -- into a single combined regex ``(?P<g0>r0)|(?P<g1>r1)|...``. Python's alternation is leftmost-match, so this is exactly equivalent to trying those rules in order; non-simple rules stay in place as barriers, preserving every rule's relative order. Dispatch to the matched rule's token via the capturing-group index, which is robust even when a rule has inner groups. The transformation is output-preserving and on by default; set ``RegexLexer.merge_simple_rules = False`` to disable. This roughly halves the number of per-position match attempts (PythonLexer's root state: 56 -> 32 entries) for ~1.2x faster lexing, with no change to the emitted token stream (verified against the full test suite and a new parity test across bundled lexers). Co-Authored-By: Claude Opus 4.8 <[email protected]>
Co-Authored-By: Claude Opus 4.8 <[email protected]>
Co-Authored-By: Claude Opus 4.8 <[email protected]>
Member
|
Thanks for the PR, this is very interesting! Reducing the amount of match calls seems like an easy win. I'll probably not able to review very soon, but I'll get to it (or maybe @Anteru of course) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RegexLexer.get_tokens_unprocessedtries each rule's regex in turn at every position, so a state with N rules costs up to Nre.match()calls per character. In practice the common tokens (whitespace, names, numbers, operators, …) sit in a long run of plain-token rules that are attempted on every identifier character.This merges maximal runs of consecutive simple rules in a state into one combined regex
(?P<g0>r0)|(?P<g1>r1)|…. A rule is simple when it has a plain_TokenTypeaction, no state transition, shared flags, and a foldable pattern(no named group, backreference, or global inline flag). Python's alternation is leftmost-match, so the combined regex is exactly equivalent to trying those rules in order; non-simple rules stay in place as barriers, preserving every rule's relative order. The matched rule is recovered from the capturing-group index, so it works even when a rule has inner groups.
The transformation is output-preserving and on by default; set
RegexLexer.merge_simple_rules = Falseto disable it. ForPythonLexerit halves the per-position match attempts in therootstate (56 → 32 entries).Benchmark
pyperf compare_to, lexing a representative multi-screen source file with eachlexer (script below):
bench_lexers.py
Impact on IPython
This optimization came out of investigations of the latency of the python/ipython REPL. IPython highlights its input prompt with
PygmentsLexer(Python3Lexer), re-lexing the visible buffer on every keystroke (prompt_toolkit keeps no cross-keystroke token cache). This change cuts that per-keystroke lexing cost by ~1.27x with identical highlighting, so typing latency in the terminal REPL improves proportionally — most noticeably while editing larger multi-line cells, where lexing dominates the per-keystroke work.