Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kmarius
Copy link
Owner

@kmarius kmarius commented Jul 22, 2022

Notes:

  • for the string input, converting ascii (non-unicode) to utf16 incurs a performance cost of, say, 10% (depending on the length, of course)
  • we can detect non-ascii in the string by checking the highest bit of each character and do the conversion to utf16 only if necessary
  • the regex input does not have to be converted, we can simply check for non-bmp (i.e. two code units in utf16) and set a UTF16 flag for the regex compiler. Everything below bmp works fine. I assume there is a tradeoff when activating UTF16.

Todo:

  • make all unicode work
  • add more tests
  • check with asan/valgrind
  • tidy up

@L3MON4D3
Copy link
Contributor

Doing the conversion only if necessary sounds like a great idea 👍

@kmarius kmarius marked this pull request as draft August 4, 2022 07:35
@L3MON4D3
Copy link
Contributor

Hi :D
How long do you think you'll need to finish this feature?

It would be awesome if you can get it ready soon-ish, I'd love if luasnip immediately handles unicode correctly.
If it takes some more time, that's alright too, not supporting transforming unicode is not a huge loss IMO

@kmarius
Copy link
Owner Author

kmarius commented Aug 15, 2022

I think I does work correctly, but I don't have a good idea on what to test yet. There are some lists like https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt and https://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt. I would also like some malformed utf-8. This is arguably more important than being correct because a crash here will crash neovim.

@L3MON4D3
Copy link
Contributor

Good point!

I guess quickjs's applying a regex to unicode is well tested, so there would need to be tests for transforming utf8 to utf16 (and back)?

@kmarius kmarius force-pushed the unicode branch 2 times, most recently from 2951999 to 1ea5a0f Compare August 15, 2022 18:00
@kmarius
Copy link
Owner Author

kmarius commented Aug 15, 2022

@L3MON4D3 Ok, feeding random bytes into lre_compile will crash when repeated tens of thousands of times. I think checking for valid utf8 can be done if we don't trust the user. Stopping the conversion from utf8 to utf16 prevents crashing during matching. Would you expect some kind of error in that case? Or just no matches?

@L3MON4D3
Copy link
Contributor

An error sounds appropriate 👍
Better to distinguish between error and no match (we should just log errors).

@kmarius
Copy link
Owner Author

kmarius commented Aug 15, 2022

I went with returning nil, err on error.

For the crashing, it looks like lre_compile segfaults if the input string contains 0xfd. This byte used to indicate the beginning of a six byte sequence which is unneeded and was removed. Valid unicode can not contain 0xfd so we can check for that.

@L3MON4D3
Copy link
Contributor

Nice, that sounds good
Looking at the spec, 0xfcindicate six bytes as well, does it also crash?

@kmarius
Copy link
Owner Author

kmarius commented Aug 16, 2022

No, only 0xfd, and only in combination with other non-ascii chars. I also cannot reproduce it outside of the lua/luajit context. I think I will merge #8 and then this PR later today, seems easier in this order.

@L3MON4D3
Copy link
Contributor

No, only 0xfd, and only in combination with other non-ascii chars.

Okay, that's good I guess :D

I think I will merge #8 and then this PR later today, seems easier in this order.

Niiice, TY for getting it done this quickly ❤️

@kmarius kmarius mentioned this pull request Aug 16, 2022
@kmarius kmarius marked this pull request as ready for review August 16, 2022 19:02
@kmarius kmarius merged commit dd65498 into master Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants