Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Inefficient keyword matching #11

@eivindjahren

Description

@eivindjahren

The tokenizer is quite inefficient when the stream is at the start of a tag key (or possibly the end of the tag). The state of the tokenizer is what it is in at the start of the third line in the following file:
"""
roff-asc
tag a
array int b 1 1
endtag
"""

The list of accepted keywords in this state is

  • endtag
  • char
  • byte
  • bool
  • int
  • float
  • double
  • array

If endtag is read we should yield from the tokenize_tag generator, if array is read we should yield from the tokenize_array_tagkey generator and for the others we should yield from the tokenize_simple_tagkey generator.

The current tokenizer is woefully inefficient here as it will invoke a seek for each failed match.

Since these keywords are unique on the two first letters we could create a much more efficient tokenizer in this state:

kw_lookup = {kw[0:2]: kw for kw in ["endtag", "char",  "byte", "bool", "int", "float", "double", "array"]}
def tokenize_endtag_or_tagkey(stream):
    kw_candidate = kw_lookup[stream.read(2)]
    if kw_candiate[2:] = stream.read(len(kw_candidate) - 2):
        kw =  TokenKind.keyword[kw_candidate]
        yield (Token(kw))
        if kw == TokenKind.ENDTAG:
           return
        elif kw == TokenKind.ARRAY:
           yield from tokenize_array_tagkey(stream)
        else
           yield from tokenize_simple_tagkey(stream)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions