Replace the tokenizer with a flex-based scanner#3846
Replace the tokenizer with a flex-based scanner#3846kivikakk merged 17 commits intogithub-linguist:masterfrom kivikakk:c-ext
Conversation
|
Looking good. I assume we're now properly handling all the weird Flex crashes you were seeing? I also think it'd be neat if the Rakefile had a way to rebuild the lexer (and also to check that we're using the right version of Flex). |
|
I've not tested this yet, but I wonder how well this new tokenizer will fair with non-ASCII. From the look of things, it should be 👌 , but thought I'd ask to be sure. For context, an attempt to improve Linguist's support of non-ASCII in the ruby implementation has been started in #3748. |
Yeah; I constrained our use of features which turn out to be dangerous (LOOKING AT YOU, TRAILING CONTEXT), and everything works as expected now. ✨
+1, will add.
It'll do as well as it currently does, which is to say Not Hugely Well; non-ASCII stuff will get skipped. It wouldn't be too hard to make it grok things we're likely to see in UTF-8 text, though it'd be a lot harder to do this and only match word-characters (since we'd have to add actual Unicode understanding to our lexer at that stage). |
* Don't read and split the entire file if we only ever use the first/last n lines * Only consider the first 50KiB when using heuristics/classifying. This can save a *lot* of time; running a large number of regexes over 1MiB of text takes a while. * Memoize File.size/read/stat; re-reading in a 500KiB file every time `data` is called adds up a lot.
| end | ||
|
|
||
| def encoded_newlines_re | ||
| @encoded_newlines_re ||= Regexp.union(["\r\n", "\r", "\n"]. |
There was a problem hiding this comment.
Does the \R extension not work here?
I also take it Ruby's regex engine doesn't have the equivalent of Perl's /a modifier?
There was a problem hiding this comment.
I'm changing as little code as I can; this is just a refactor from:
\R also catches [\v\f] which we definitely don't want.
I also take it Ruby's regex engine doesn't have the equivalent of Perl's
/amodifier?
~$ ruby -e '//a'
-e:1: unknown regexp option - aIt doesn't, and more to the point, it wouldn't help for our use here, which isn't about Unicode-aware matching so much as avoiding terrible encoding exceptions rising from the deep. /a modifies the meaning of several sequences in the regular expressions itself, rather than changing how a regular expression is applied to a given byte-sequence-tagged-with-an-encoding (i.e. a String), whatever the meaning of its contents.
There was a problem hiding this comment.
Ah, I see. ;) Just thought to ask, since it's used very little in Perl (for good reasons). Thanks!
|
I'd like to merge this! Anyone feel like doing a final review? |
lildude
left a comment
There was a problem hiding this comment.
Caveat pre-emptor: I have a copy of Dennis Richie's book but I'm far from being a C expert.
From what I do know this looks good to me, and the perf improvement is fantastic!!
|
@lildude Thank you! The responsibility is mine if this somehow goes belly-up. |
Preliminary benchmarks put this in at a 12x speedup.
It doesn't produce identical results, but very near enough to. (Enough that all the tests should pass.)
/cc @vmg because he luuuuurves C