Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improve memory usage in the contextual lexer ? #1539

@erezsh

Description

@erezsh

Quite randomly, I realized today that there is a way to make the contextual lexer more memory efficient.

Currently, the contextual-lexer generates a basic-lexer for each set of accepted tokens. That can result in hundreds of lexers, each compiling its own regexes.

But I came to realize that if one set of tokens is a subset of another, we can use the superset lexer for both. That is based on assumption that if such superset lexer exists, it implicitly validates that there are no conflicts between its tokens.
i.e. if they were already used together, there are probably no conflicts, and using it for a subset will simply leave some parts of the dfa unused.

To test it, I implemented this in Lark. All the tests pass.

For the Python grammar, it reduces the number of lexers from 229 to 88.

Caveats:

  1. While all the tests pass, it doesn't mean it won't break backwards compatibility. Perhaps there are some grammars out there that will break because of this change? (though I can't think of any)

  2. While it saves memory, it's unclear how much it will be noticeable by the users. A few quick measurements didn't reveal significant difference in speed or memory footprint.

So, while an "easy win", given the risk of breaking compatibility and very low impact on performance, maybe it's better to abandon this improvement?

Anyway, I just thought I'd document this idea for the future.


Appendix: this is the implementation (in lexer.py):

        states_list = list(states.items())
        states_list.sort(key=lambda x: len(x[1]), reverse=True)  # Sort by number of accepts, descending

        i = 0
        for state, accepts in states_list:
            key = frozenset(accepts)

            if key in lexer_by_tokens:
                continue

            # Check if we already have a "superset" lexer that accepts all of these tokens.
            # If we do, we can skip creating a new one to save space, based on assumption that if such lexer exists,
            # it implicitly validates that there are no conflicts between its tokens.
            # For the Python grammar, it reduces the number of lexers from 229 to 88.
            superset_lexer = next((v for k, v in lexer_by_tokens.items() if key < k), None)
            if superset_lexer is not None:
                # We already have a lexer that accepts all of these tokens, so we can skip creating a new one
                lexer_by_tokens[key] = superset_lexer
                continue

            accepts = set(accepts) | set(conf.ignore) | set(always_accept)
            lexer_conf = copy(trad_conf)
            lexer_conf.terminals = [terminals_by_name[n] for n in accepts if n in terminals_by_name]
            lexer = self.BasicLexer(lexer_conf, comparator)
            i += 1
            lexer_by_tokens[key] = lexer

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions