-
Notifications
You must be signed in to change notification settings - Fork 5k
Validating UTF-8 in less than one instruction per byte #43688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
We looked at this back in 2018 when the original blog post was published. The ultimate result was that our in-box decoder is faster for real-world payloads and scenarios. See dotnet/corefxlab#1831 for some discussion on this. |
Ah, good to know. It's basically
(from dotnet/corefxlab#1831 (comment)) As you've mentioned the original blog-post, that's about a state-machine. |
I pinged the team offline about this. After running benchmarks, our in-box implementation of all-Latin / mostly-Latin text exceeds the performance of the updated Lemire logic. The Lemire logic likely exceeds the performance of the in-box implementation when validating CJK text. That's a potential area of improvement for us. Note: I have to couch this in non-absolutes, as the Lemire logic is pure validation ("is this valid UTF-8?"), while the in-box logic is further analysis ("Is this valid UTF-8? If so, how many chars would result from transcoding? Of those, how many are surrogate pairs?"). So the comparison is a bit rough. |
🚀 Nice. Thanks for the update. |
In the paper, he benchmarks on JSON and HTML which should be mostly-Latin. He gives a 10x performance advantage over other methods. What could explain that his result is so wildly different from the result documented here in this issue tracker? |
The inbox is not just validation:
|
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process. This process is part of our issue cleanup automation. |
This issue will now be closed since it had been marked |
Lemire, et.al. published an intersting paper: Validating UTF-8 in less than one instruction per byte. It's something we should have a look at.
/cc: @GrabYourPitchforks
The text was updated successfully, but these errors were encountered: