Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Validating UTF-8 in less than one instruction per byte #43688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gfoidl opened this issue Oct 21, 2020 · 10 comments
Closed

Validating UTF-8 in less than one instruction per byte #43688

gfoidl opened this issue Oct 21, 2020 · 10 comments
Labels
area-System.Text.Encoding backlog-cleanup-candidate An inactive issue that has been marked for automated closure. enhancement Product code improvement that does NOT require public API changes/additions no-recent-activity tenet-performance Performance related issue

Comments

@gfoidl
Copy link
Member

gfoidl commented Oct 21, 2020

Lemire, et.al. published an intersting paper: Validating UTF-8 in less than one instruction per byte. It's something we should have a look at.

/cc: @GrabYourPitchforks

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 21, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@GrabYourPitchforks
Copy link
Member

We looked at this back in 2018 when the original blog post was published. The ultimate result was that our in-box decoder is faster for real-world payloads and scenarios. See dotnet/corefxlab#1831 for some discussion on this.

@ghost
Copy link

ghost commented Oct 21, 2020

Tagging subscribers to this area: @tarekgh, @krwq
See info in area-owners.md if you want to be subscribed.

@gfoidl
Copy link
Member Author

gfoidl commented Oct 21, 2020

Ah, good to know. It's basically

UTF8 doesn't consist of randomly distributed characters. If you see a character from a certain character set, the next several characters (excluding whitespace and punctuation) are almost certainly going to be in that same range

(from dotnet/corefxlab#1831 (comment))

As you've mentioned the original blog-post, that's about a state-machine.
Ridiculously fast unicode (UTF-8) validation is the "new" blog post (from yesterday) that comes along the paper linked above, which is about a (vectorized) lookup.

@tarekgh tarekgh added enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue and removed untriaged New issue has not been triaged by the area owner labels Oct 21, 2020
@tarekgh tarekgh added this to the Future milestone Oct 21, 2020
@GrabYourPitchforks
Copy link
Member

I pinged the team offline about this. After running benchmarks, our in-box implementation of all-Latin / mostly-Latin text exceeds the performance of the updated Lemire logic. The Lemire logic likely exceeds the performance of the in-box implementation when validating CJK text. That's a potential area of improvement for us.

Note: I have to couch this in non-absolutes, as the Lemire logic is pure validation ("is this valid UTF-8?"), while the in-box logic is further analysis ("Is this valid UTF-8? If so, how many chars would result from transcoding? Of those, how many are surrogate pairs?"). So the comparison is a bit rough.

@gfoidl
Copy link
Member Author

gfoidl commented Oct 21, 2020

in-box implementation of all-Latin / mostly-Latin text exceeds the performance of the updated Lemire logic.

🚀 Nice.

Thanks for the update.

@GSPP
Copy link

GSPP commented Oct 22, 2020

In the paper, he benchmarks on JSON and HTML which should be mostly-Latin. He gives a 10x performance advantage over other methods. What could explain that his result is so wildly different from the result documented here in this issue tracker?

@gfoidl
Copy link
Member Author

gfoidl commented Oct 22, 2020

The inbox is not just validation:

("is this valid UTF-8?"), while the in-box logic is further analysis ("Is this valid UTF-8? If so, how many chars would result from transcoding? Of those, how many are surrogate pairs?")

Copy link
Contributor

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

@dotnet-policy-service dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Apr 23, 2025
Copy link
Contributor

This issue will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.

@dotnet-policy-service dotnet-policy-service bot removed this from the Future milestone May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Text.Encoding backlog-cleanup-candidate An inactive issue that has been marked for automated closure. enhancement Product code improvement that does NOT require public API changes/additions no-recent-activity tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants