-
Notifications
You must be signed in to change notification settings - Fork 16
Description
vscode-textmate uses oniguruma (and soon onigasm) to tokenize source code. It does that line by line and by calling 'findNextMatch' with increasing offset and a varied set of regexp-patterns until all tokens of a line have been evaluated.
The set of regexp-patterns can be different for each find, based on the state of the tokenizer. The tokenizer is a state machine where the previously found tokens define the state the tokenizer is in.
Tokenizing a long line is quite expensive. The extreme case is uglified, optimized source code where all newlines are removed and a whole content consist of just one line.
To improve the performance of tokenizing a long line, the following characteristics of the tokenize algorithm can be used:
- the same string is searched on multiple times
- some regexp-pattern sets are used multiple times with a line, e.g. if they are associated to a 'base' state that the tokenizer state machine falls in often.
- some regexp patterns appear in multiple pattern-sets. An example is the regexp to find a comment token which is present in almost every pattern-set. Similarly, the regexp-patterns to find identifiers or keywords.
Based on these observations there's the potential to do the some performance improvements. Most of them are already implemented in node-oniguruma.
- store the encoded UTF-8 string and the UTF-8 to UTF-16 offset mapping table along with the input string to avoid re-encoding and recalculating the offsets. The first part is nicely tackled in Encode string once and reuse for subsequent calls #5
- remembers the last search result of each regexp-pattern. The previous search result can be reused if:
- the previous input string was the same
- if the current search position is same or larger than the previous search position
- if the result offset is larger than the current search position
- often, the first find done on a string is done with index 0. In the best case, no match was found on the whole line, even better if that pattern was the 'comment' pattern. See here how it is done in node-oniguruma.
- reuse reg-exp pattern search result across pattern-sets (if I'm not mistaken, 'node-oniguruma' doesn't do that)