Improve memory usage in the contextual lexer ?

Quite randomly, I realized today that there is a way to make the contextual lexer more memory efficient.

Currently, the contextual-lexer generates a basic-lexer for each set of accepted tokens. That can result in hundreds of lexers, each compiling its own regexes.

But I came to realize that if one set of tokens is a subset of another, we can use the superset lexer for both. That is based on assumption that if such superset lexer exists, it implicitly validates that there are no conflicts between its tokens.
i.e. if they were already used together, there are probably no conflicts, and using it for a subset will simply leave some parts of the dfa unused.

To test it, I implemented this in Lark. All the tests pass. 

For the Python grammar, it reduces the number of lexers from 229 to 88.

Caveats:

1. While all the tests pass, it doesn't mean it won't break backwards compatibility. Perhaps there are some grammars out there that will break because of this change? (though I can't think of any)

2. While it saves memory, it's unclear how much it will be noticeable by the users. A few quick measurements didn't reveal significant difference in speed or memory footprint.

So, while an "easy win", given the risk of breaking compatibility and very low impact on performance, maybe it's better to abandon this improvement?

Anyway, I just thought I'd document this idea for the future.


-------

Appendix: this is the implementation (in lexer.py):

```python
        states_list = list(states.items())
        states_list.sort(key=lambda x: len(x[1]), reverse=True)  # Sort by number of accepts, descending

        i = 0
        for state, accepts in states_list:
            key = frozenset(accepts)

            if key in lexer_by_tokens:
                continue

            # Check if we already have a "superset" lexer that accepts all of these tokens.
            # If we do, we can skip creating a new one to save space, based on assumption that if such lexer exists,
            # it implicitly validates that there are no conflicts between its tokens.
            # For the Python grammar, it reduces the number of lexers from 229 to 88.
            superset_lexer = next((v for k, v in lexer_by_tokens.items() if key < k), None)
            if superset_lexer is not None:
                # We already have a lexer that accepts all of these tokens, so we can skip creating a new one
                lexer_by_tokens[key] = superset_lexer
                continue

            accepts = set(accepts) | set(conf.ignore) | set(always_accept)
            lexer_conf = copy(trad_conf)
            lexer_conf.terminals = [terminals_by_name[n] for n in accepts if n in terminals_by_name]
            lexer = self.BasicLexer(lexer_conf, comparator)
            i += 1
            lexer_by_tokens[key] = lexer
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve memory usage in the contextual lexer ? #1539

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improve memory usage in the contextual lexer ? #1539

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions