Tackling performance issues when searching on the same line multiple times

`vscode-textmate` uses oniguruma (and soon onigasm) to tokenize source code. It does that line by line and by calling 'findNextMatch' with increasing offset and a varied set of regexp-patterns until all tokens of a line have been evaluated.
The set of regexp-patterns can be different for each find, based on the state of the tokenizer. The tokenizer is a state machine where the previously found tokens define the state the tokenizer is in. 

Tokenizing a long line is quite expensive. The extreme case is uglified, optimized source code where all newlines are removed and a whole content consist of just one line.

To improve the performance of tokenizing a long line, the following characteristics of the tokenize algorithm can be used:
- the same string is searched on multiple times
- some regexp-pattern sets are used multiple times with a line, e.g. if they are associated to a 'base' state that the tokenizer state machine falls in often.
- some regexp patterns appear in multiple pattern-sets. An example is the regexp to find a comment token which is present in almost every pattern-set. Similarly, the regexp-patterns to find identifiers or keywords.

Based on these observations there's the potential to do the some performance improvements. Most of them are already implemented in `node-oniguruma`.

- store the encoded UTF-8 string and the UTF-8 to UTF-16 offset mapping table along with the input string to avoid re-encoding and recalculating the offsets. The first part is nicely tackled in #5 
- remembers the last search result of each regexp-pattern. The previous search result can be reused if:
   - the previous input string was the same
   - if the current search position is same or larger than the previous search position
   - if the result offset is larger than the current search position
   - often, the first find done on a string is done with index 0. In the best case, no match was found on the whole line, even better if that pattern was the 'comment' pattern. See [here](https://github.com/atom/node-oniguruma/blob/4d2c67eb5619c2f5b0ac6784f3e4ca1cb80997e5/src/onig-reg-exp.cc#L31) how it is done in node-oniguruma.
- reuse reg-exp pattern search result across pattern-sets (if I'm not mistaken, 'node-oniguruma' doesn't do that)





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tackling performance issues when searching on the same line multiple times #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Tackling performance issues when searching on the same line multiple times #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions