-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
The tokenizer is quite inefficient when the stream is at the start of a tag key (or possibly the end of the tag). The state of the tokenizer is what it is in at the start of the third line in the following file:
"""
roff-asc
tag a
array int b 1 1
endtag
"""
The list of accepted keywords in this state is
endtagcharbyteboolintfloatdoublearray
If endtag is read we should yield from the tokenize_tag generator, if array is read we should yield from the tokenize_array_tagkey generator and for the others we should yield from the tokenize_simple_tagkey generator.
The current tokenizer is woefully inefficient here as it will invoke a seek for each failed match.
Since these keywords are unique on the two first letters we could create a much more efficient tokenizer in this state:
kw_lookup = {kw[0:2]: kw for kw in ["endtag", "char", "byte", "bool", "int", "float", "double", "array"]}
def tokenize_endtag_or_tagkey(stream):
kw_candidate = kw_lookup[stream.read(2)]
if kw_candiate[2:] = stream.read(len(kw_candidate) - 2):
kw = TokenKind.keyword[kw_candidate]
yield (Token(kw))
if kw == TokenKind.ENDTAG:
return
elif kw == TokenKind.ARRAY:
yield from tokenize_array_tagkey(stream)
else
yield from tokenize_simple_tagkey(stream)Metadata
Metadata
Assignees
Labels
No labels