Unicode #5

kmarius · 2022-07-22T18:19:26Z

Notes:

for the string input, converting ascii (non-unicode) to utf16 incurs a performance cost of, say, 10% (depending on the length, of course)
we can detect non-ascii in the string by checking the highest bit of each character and do the conversion to utf16 only if necessary
the regex input does not have to be converted, we can simply check for non-bmp (i.e. two code units in utf16) and set a UTF16 flag for the regex compiler. Everything below bmp works fine. I assume there is a tradeoff when activating UTF16.

Todo:

make all unicode work
add more tests
check with asan/valgrind
tidy up

L3MON4D3 · 2022-07-24T16:31:18Z

Doing the conversion only if necessary sounds like a great idea 👍

L3MON4D3 · 2022-08-15T17:11:48Z

Hi :D
How long do you think you'll need to finish this feature?

It would be awesome if you can get it ready soon-ish, I'd love if luasnip immediately handles unicode correctly.
If it takes some more time, that's alright too, not supporting transforming unicode is not a huge loss IMO

kmarius · 2022-08-15T17:23:54Z

I think I does work correctly, but I don't have a good idea on what to test yet. There are some lists like https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt and https://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt. I would also like some malformed utf-8. This is arguably more important than being correct because a crash here will crash neovim.

L3MON4D3 · 2022-08-15T17:29:57Z

Good point!

I guess quickjs's applying a regex to unicode is well tested, so there would need to be tests for transforming utf8 to utf16 (and back)?

kmarius · 2022-08-15T18:41:59Z

@L3MON4D3 Ok, feeding random bytes into lre_compile will crash when repeated tens of thousands of times. I think checking for valid utf8 can be done if we don't trust the user. Stopping the conversion from utf8 to utf16 prevents crashing during matching. Would you expect some kind of error in that case? Or just no matches?

L3MON4D3 · 2022-08-15T19:23:23Z

An error sounds appropriate 👍
Better to distinguish between error and no match (we should just log errors).

kmarius · 2022-08-15T20:05:50Z

I went with returning nil, err on error.

For the crashing, it looks like lre_compile segfaults if the input string contains 0xfd. This byte used to indicate the beginning of a six byte sequence which is unneeded and was removed. Valid unicode can not contain 0xfd so we can check for that.

L3MON4D3 · 2022-08-16T07:40:02Z

Nice, that sounds good
Looking at the spec, 0xfcindicate six bytes as well, does it also crash?

kmarius · 2022-08-16T11:02:45Z

No, only 0xfd, and only in combination with other non-ascii chars. I also cannot reproduce it outside of the lua/luajit context. I think I will merge #8 and then this PR later today, seems easier in this order.

L3MON4D3 · 2022-08-16T12:30:26Z

No, only 0xfd, and only in combination with other non-ascii chars.

Okay, that's good I guess :D

I think I will merge #8 and then this PR later today, seems easier in this order.

Niiice, TY for getting it done this quickly ❤️

kmarius force-pushed the unicode branch from 5f0f1d8 to b429668 Compare July 29, 2022 22:43

kmarius marked this pull request as draft August 4, 2022 07:35

kmarius force-pushed the unicode branch 2 times, most recently from 2951999 to 1ea5a0f Compare August 15, 2022 18:00

kmarius force-pushed the unicode branch from 29cfc54 to b9f1c43 Compare August 15, 2022 19:36

kmarius and others added 15 commits August 16, 2022 20:32

unicode, first version

a434146

also allocate space for string terminator

4b96706

add conversion to CESU8

6dedbf7

convert properly

12d9ff5

safer cesu8 conversion

2f2ef4e

tests for unicode

0d0fbb3

handle unicode regex

bbf85ca

distinguish between ascii/non-ascii input

8a9facf

move utf16 conversion into its function

aef7b21

index shift

f77667d

update gitignore

021c05c

more unicode tests

9767c48

abort on malformed utf8 in matcher input

c78feba

return error on malformed unicode

dc6c951

scan regex input for invalid byte

ee2650c

kmarius added 2 commits August 16, 2022 20:42

update test.lua

4fce1ad

tidy up

bc43a6d

kmarius force-pushed the unicode branch from 22a1b62 to bc43a6d Compare August 16, 2022 18:42

add tests for named groups with unicode

ba9bee4

kmarius mentioned this pull request Aug 16, 2022

support named groups #8

Merged

kmarius marked this pull request as ready for review August 16, 2022 19:02

kmarius merged commit dd65498 into master Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unicode #5

Unicode #5

Uh oh!

kmarius commented Jul 22, 2022 •

edited

Loading

Uh oh!

L3MON4D3 commented Jul 24, 2022

Uh oh!

L3MON4D3 commented Aug 15, 2022

Uh oh!

kmarius commented Aug 15, 2022

Uh oh!

L3MON4D3 commented Aug 15, 2022

Uh oh!

kmarius commented Aug 15, 2022

Uh oh!

L3MON4D3 commented Aug 15, 2022

Uh oh!

kmarius commented Aug 15, 2022

Uh oh!

L3MON4D3 commented Aug 16, 2022

Uh oh!

kmarius commented Aug 16, 2022

Uh oh!

L3MON4D3 commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Unicode #5

Unicode #5

Uh oh!

Conversation

kmarius commented Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes:

Todo:

Uh oh!

L3MON4D3 commented Jul 24, 2022

Uh oh!

L3MON4D3 commented Aug 15, 2022

Uh oh!

kmarius commented Aug 15, 2022

Uh oh!

L3MON4D3 commented Aug 15, 2022

Uh oh!

kmarius commented Aug 15, 2022

Uh oh!

L3MON4D3 commented Aug 15, 2022

Uh oh!

kmarius commented Aug 15, 2022

Uh oh!

L3MON4D3 commented Aug 16, 2022

Uh oh!

kmarius commented Aug 16, 2022

Uh oh!

L3MON4D3 commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kmarius commented Jul 22, 2022 •

edited

Loading