Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Merge runs of consecutive simple rules in RegexLexer into one regex#3155

Open
eendebakpt wants to merge 3 commits into
pygments:masterfrom
eendebakpt:regexlexer-group-simple-rules
Open

Merge runs of consecutive simple rules in RegexLexer into one regex#3155
eendebakpt wants to merge 3 commits into
pygments:masterfrom
eendebakpt:regexlexer-group-simple-rules

Conversation

@eendebakpt

Copy link
Copy Markdown

RegexLexer.get_tokens_unprocessed tries each rule's regex in turn at every position, so a state with N rules costs up to N re.match() calls per character. In practice the common tokens (whitespace, names, numbers, operators, …) sit in a long run of plain-token rules that are attempted on every identifier character.

This merges maximal runs of consecutive simple rules in a state into one combined regex (?P<g0>r0)|(?P<g1>r1)|…. A rule is simple when it has a plain _TokenType action, no state transition, shared flags, and a foldable pattern
(no named group, backreference, or global inline flag). Python's alternation is leftmost-match, so the combined regex is exactly equivalent to trying those rules in order; non-simple rules stay in place as barriers, preserving every rule's relative order. The matched rule is recovered from the capturing-group index, so it works even when a rule has inner groups.

The transformation is output-preserving and on by default; set RegexLexer.merge_simple_rules = False to disable it. For PythonLexer it halves the per-position match attempts in the root state (56 → 32 entries).

Benchmark

pyperf compare_to, lexing a representative multi-screen source file with each
lexer (script below):

lexer before after speedup
python 5.72 ms 4.52 ms 1.27x
javascript 1.96 ms 1.35 ms 1.46x
c 5.12 ms 3.59 ms 1.43x
ruby 3.29 ms 2.56 ms 1.29x
geometric mean 1.36x
bench_lexers.py
"""pyperf benchmark for RegexLexer lexing throughput.

    python bench_lexers.py -o base.json     # on master
    python bench_lexers.py -o patch.json    # on the branch
    python -m pyperf compare_to base.json patch.json
"""
import pyperf
from pygments.lexers import PythonLexer, JavascriptLexer, CLexer, RubyLexer

PYTHON = '''import os, sys
from collections import defaultdict


class Cache(dict):
    """A small bounded cache."""

    def __init__(self, maxsize=128):
        self.maxsize = maxsize
        self._hits = 0

    def get_or_set(self, key, factory):
        if key in self:
            self._hits += 1
            return self[key]
        value = self[key] = factory(key)
        return value


def main(argv=None):
    data = defaultdict(list)
    for i, line in enumerate(sys.stdin):
        parts = line.strip().split(",")
        if not parts or parts[0].startswith("#"):
            continue
        data[parts[0]].append(float(parts[1]) * 2 + 1)
    total = sum(v for vs in data.values() for v in vs)
    print(f"{total=:.3f} over {len(data)} keys")
    return 0
'''

JAVASCRIPT = '''const cache = new Map();

function memoize(fn) {
  return function (...args) {
    const key = JSON.stringify(args);
    if (cache.has(key)) return cache.get(key);
    const result = fn.apply(this, args);
    cache.set(key, result);
    return result;
  };
}

const fib = memoize((n) => (n < 2 ? n : fib(n - 1) + fib(n - 2)));
console.log(`fib(20) = ${fib(20)}`);
'''

C = '''#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct node { int value; struct node *next; } node_t;

static node_t *push(node_t *head, int v) {
    node_t *n = malloc(sizeof(node_t));
    if (!n) { perror("malloc"); exit(1); }
    n->value = v; n->next = head;
    return n;
}

int main(int argc, char **argv) {
    node_t *head = NULL;
    for (int i = 0; i < argc; i++)
        head = push(head, (int)strtol(argv[i], NULL, 10) + 0xFF);
    for (node_t *p = head; p; p = p->next) printf("%d\\n", p->value);
    return 0;
}
'''

RUBY = '''require "json"

class Stack
  def initialize
    @items = []
  end

  def push(x)
    @items << x
    self
  end

  def pop
    @items.pop
  end

  def empty?
    @items.empty?
  end
end

s = Stack.new
[1, 2, 3].each { |n| s.push(n * 2) }
puts s.pop until s.empty?
'''

# Repeat so each measured op lexes a realistic multi-screen file.
CASES = [
    ("python", PythonLexer(), PYTHON * 4),
    ("javascript", JavascriptLexer(), JAVASCRIPT * 4),
    ("c", CLexer(), C * 4),
    ("ruby", RubyLexer(), RUBY * 4),
]


def make(lexer, code):
    def run():
        for _ in lexer.get_tokens(code):
            pass
    return run


if __name__ == "__main__":
    runner = pyperf.Runner()
    for name, lexer, code in CASES:
        runner.bench_func(f"lex {name}", make(lexer, code))

Impact on IPython

This optimization came out of investigations of the latency of the python/ipython REPL. IPython highlights its input prompt with PygmentsLexer(Python3Lexer), re-lexing the visible buffer on every keystroke (prompt_toolkit keeps no cross-keystroke token cache). This change cuts that per-keystroke lexing cost by ~1.27x with identical highlighting, so typing latency in the terminal REPL improves proportionally — most noticeably while editing larger multi-line cells, where lexing dominates the per-keystroke work.

eendebakpt and others added 3 commits June 10, 2026 19:49
RegexLexer.get_tokens_unprocessed tries each rule's compiled regex in turn at
every position, so a state with N rules can cost up to N match() calls per
character.  For most lexers the common tokens (whitespace, names, numbers,
operators, ...) sit in a long run of plain-token rules that are all attempted on
every identifier character.

Merge maximal runs of *consecutive* "simple" rules -- a plain _TokenType action,
no state transition, shared flags, and a foldable pattern (no named group,
backreference, or global inline flag) -- into a single combined regex
``(?P<g0>r0)|(?P<g1>r1)|...``.  Python's alternation is leftmost-match, so this
is exactly equivalent to trying those rules in order; non-simple rules stay in
place as barriers, preserving every rule's relative order.  Dispatch to the
matched rule's token via the capturing-group index, which is robust even when a
rule has inner groups.  The transformation is output-preserving and on by
default; set ``RegexLexer.merge_simple_rules = False`` to disable.

This roughly halves the number of per-position match attempts (PythonLexer's
root state: 56 -> 32 entries) for ~1.2x faster lexing, with no change to the
emitted token stream (verified against the full test suite and a new parity
test across bundled lexers).

Co-Authored-By: Claude Opus 4.8 <[email protected]>
@birkenfeld

Copy link
Copy Markdown
Member

Thanks for the PR, this is very interesting! Reducing the amount of match calls seems like an easy win.

I'll probably not able to review very soon, but I'll get to it (or maybe @Anteru of course)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants