Codestin Search App

eendebakpt · 2026-06-10T21:43:26Z

RegexLexer.get_tokens_unprocessed tries each rule's regex in turn at every position, so a state with N rules costs up to N re.match() calls per character. In practice the common tokens (whitespace, names, numbers, operators, …) sit in a long run of plain-token rules that are attempted on every identifier character.

This merges maximal runs of consecutive simple rules in a state into one combined regex (?P<g0>r0)|(?P<g1>r1)|…. A rule is simple when it has a plain _TokenType action, no state transition, shared flags, and a foldable pattern
(no named group, backreference, or global inline flag). Python's alternation is leftmost-match, so the combined regex is exactly equivalent to trying those rules in order; non-simple rules stay in place as barriers, preserving every rule's relative order. The matched rule is recovered from the capturing-group index, so it works even when a rule has inner groups.

The transformation is output-preserving and on by default; set RegexLexer.merge_simple_rules = False to disable it. For PythonLexer it halves the per-position match attempts in the root state (56 → 32 entries).

Benchmark

pyperf compare_to, lexing a representative multi-screen source file with each
lexer (script below):

lexer	before	after	speedup
python	5.72 ms	4.52 ms	1.27x
javascript	1.96 ms	1.35 ms	1.46x
c	5.12 ms	3.59 ms	1.43x
ruby	3.29 ms	2.56 ms	1.29x
geometric mean			1.36x

bench_lexers.py

"""pyperf benchmark for RegexLexer lexing throughput.

    python bench_lexers.py -o base.json     # on master
    python bench_lexers.py -o patch.json    # on the branch
    python -m pyperf compare_to base.json patch.json
"""
import pyperf
from pygments.lexers import PythonLexer, JavascriptLexer, CLexer, RubyLexer

PYTHON = '''import os, sys
from collections import defaultdict


class Cache(dict):
    """A small bounded cache."""

    def __init__(self, maxsize=128):
        self.maxsize = maxsize
        self._hits = 0

    def get_or_set(self, key, factory):
        if key in self:
            self._hits += 1
            return self[key]
        value = self[key] = factory(key)
        return value


def main(argv=None):
    data = defaultdict(list)
    for i, line in enumerate(sys.stdin):
        parts = line.strip().split(",")
        if not parts or parts[0].startswith("#"):
            continue
        data[parts[0]].append(float(parts[1]) * 2 + 1)
    total = sum(v for vs in data.values() for v in vs)
    print(f"{total=:.3f} over {len(data)} keys")
    return 0
'''

JAVASCRIPT = '''const cache = new Map();

function memoize(fn) {
  return function (...args) {
    const key = JSON.stringify(args);
    if (cache.has(key)) return cache.get(key);
    const result = fn.apply(this, args);
    cache.set(key, result);
    return result;
  };
}

const fib = memoize((n) => (n < 2 ? n : fib(n - 1) + fib(n - 2)));
console.log(`fib(20) = ${fib(20)}`);
'''

C = '''#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct node { int value; struct node *next; } node_t;

static node_t *push(node_t *head, int v) {
    node_t *n = malloc(sizeof(node_t));
    if (!n) { perror("malloc"); exit(1); }
    n->value = v; n->next = head;
    return n;
}

int main(int argc, char **argv) {
    node_t *head = NULL;
    for (int i = 0; i < argc; i++)
        head = push(head, (int)strtol(argv[i], NULL, 10) + 0xFF);
    for (node_t *p = head; p; p = p->next) printf("%d\\n", p->value);
    return 0;
}
'''

RUBY = '''require "json"

class Stack
  def initialize
    @items = []
  end

  def push(x)
    @items << x
    self
  end

  def pop
    @items.pop
  end

  def empty?
    @items.empty?
  end
end

s = Stack.new
[1, 2, 3].each { |n| s.push(n * 2) }
puts s.pop until s.empty?
'''

# Repeat so each measured op lexes a realistic multi-screen file.
CASES = [
    ("python", PythonLexer(), PYTHON * 4),
    ("javascript", JavascriptLexer(), JAVASCRIPT * 4),
    ("c", CLexer(), C * 4),
    ("ruby", RubyLexer(), RUBY * 4),
]


def make(lexer, code):
    def run():
        for _ in lexer.get_tokens(code):
            pass
    return run


if __name__ == "__main__":
    runner = pyperf.Runner()
    for name, lexer, code in CASES:
        runner.bench_func(f"lex {name}", make(lexer, code))

Impact on IPython

This optimization came out of investigations of the latency of the python/ipython REPL. IPython highlights its input prompt with PygmentsLexer(Python3Lexer), re-lexing the visible buffer on every keystroke (prompt_toolkit keeps no cross-keystroke token cache). This change cuts that per-keystroke lexing cost by ~1.27x with identical highlighting, so typing latency in the terminal REPL improves proportionally — most noticeably while editing larger multi-line cells, where lexing dominates the per-keystroke work.

RegexLexer.get_tokens_unprocessed tries each rule's compiled regex in turn at every position, so a state with N rules can cost up to N match() calls per character. For most lexers the common tokens (whitespace, names, numbers, operators, ...) sit in a long run of plain-token rules that are all attempted on every identifier character. Merge maximal runs of *consecutive* "simple" rules -- a plain _TokenType action, no state transition, shared flags, and a foldable pattern (no named group, backreference, or global inline flag) -- into a single combined regex ``(?P<g0>r0)|(?P<g1>r1)|...``. Python's alternation is leftmost-match, so this is exactly equivalent to trying those rules in order; non-simple rules stay in place as barriers, preserving every rule's relative order. Dispatch to the matched rule's token via the capturing-group index, which is robust even when a rule has inner groups. The transformation is output-preserving and on by default; set ``RegexLexer.merge_simple_rules = False`` to disable. This roughly halves the number of per-position match attempts (PythonLexer's root state: 56 -> 32 entries) for ~1.2x faster lexing, with no change to the emitted token stream (verified against the full test suite and a new parity test across bundled lexers). Co-Authored-By: Claude Opus 4.8 <[email protected]>

Co-Authored-By: Claude Opus 4.8 <[email protected]>

birkenfeld · 2026-06-11T13:18:22Z

Thanks for the PR, this is very interesting! Reducing the amount of match calls seems like an easy win.

I'll probably not able to review very soon, but I'll get to it (or maybe @Anteru of course)

eendebakpt and others added 3 commits June 10, 2026 19:49

Add CHANGES and AUTHORS entries for the RegexLexer optimization

0fc2e1e

Co-Authored-By: Claude Opus 4.8 <[email protected]>

Remove unused imports in test_merge_simple_rules (ruff F401)

b48c1a4

Co-Authored-By: Claude Opus 4.8 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge runs of consecutive simple rules in RegexLexer into one regex#3155

Merge runs of consecutive simple rules in RegexLexer into one regex#3155
eendebakpt wants to merge 3 commits into
pygments:masterfrom
eendebakpt:regexlexer-group-simple-rules

eendebakpt commented Jun 10, 2026

Uh oh!

birkenfeld commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eendebakpt commented Jun 10, 2026

Uh oh!

birkenfeld commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants