Validating UTF-8 in less than one instruction per byte #43688

gfoidl · 2020-10-21T16:29:21Z

Lemire, et.al. published an intersting paper: Validating UTF-8 in less than one instruction per byte. It's something we should have a look at.

/cc: @GrabYourPitchforks

Dotnet-GitSync-Bot · 2020-10-21T16:29:25Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

GrabYourPitchforks · 2020-10-21T18:03:56Z

We looked at this back in 2018 when the original blog post was published. The ultimate result was that our in-box decoder is faster for real-world payloads and scenarios. See dotnet/corefxlab#1831 for some discussion on this.

ghost · 2020-10-21T18:09:02Z

Tagging subscribers to this area: @tarekgh, @krwq
See info in area-owners.md if you want to be subscribed.

gfoidl · 2020-10-21T18:40:59Z

Ah, good to know. It's basically

UTF8 doesn't consist of randomly distributed characters. If you see a character from a certain character set, the next several characters (excluding whitespace and punctuation) are almost certainly going to be in that same range

(from dotnet/corefxlab#1831 (comment))

As you've mentioned the original blog-post, that's about a state-machine.
Ridiculously fast unicode (UTF-8) validation is the "new" blog post (from yesterday) that comes along the paper linked above, which is about a (vectorized) lookup.

GrabYourPitchforks · 2020-10-21T19:17:12Z

I pinged the team offline about this. After running benchmarks, our in-box implementation of all-Latin / mostly-Latin text exceeds the performance of the updated Lemire logic. The Lemire logic likely exceeds the performance of the in-box implementation when validating CJK text. That's a potential area of improvement for us.

Note: I have to couch this in non-absolutes, as the Lemire logic is pure validation ("is this valid UTF-8?"), while the in-box logic is further analysis ("Is this valid UTF-8? If so, how many chars would result from transcoding? Of those, how many are surrogate pairs?"). So the comparison is a bit rough.

gfoidl · 2020-10-21T19:42:17Z

in-box implementation of all-Latin / mostly-Latin text exceeds the performance of the updated Lemire logic.

🚀 Nice.

Thanks for the update.

GSPP · 2020-10-22T08:15:14Z

In the paper, he benchmarks on JSON and HTML which should be mostly-Latin. He gives a 10x performance advantage over other methods. What could explain that his result is so wildly different from the result documented here in this issue tracker?

gfoidl · 2020-10-22T10:36:49Z

The inbox is not just validation:

("is this valid UTF-8?"), while the in-box logic is further analysis ("Is this valid UTF-8? If so, how many chars would result from transcoding? Of those, how many are surrogate pairs?")

dotnet-policy-service · 2025-04-23T16:28:10Z

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

dotnet-policy-service · 2025-05-07T19:35:06Z

This issue will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 21, 2020

GrabYourPitchforks added the area-System.Text.Encoding label Oct 21, 2020

tarekgh added enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue and removed untriaged New issue has not been triaged by the area owner labels Oct 21, 2020

tarekgh added this to the Future milestone Oct 21, 2020

dotnet-policy-service bot added backlog-cleanup-candidate An inactive issue that has been marked for automated closure. no-recent-activity labels Apr 23, 2025

dotnet-policy-service bot removed this from the Future milestone May 7, 2025

dotnet-policy-service bot closed this as completed May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validating UTF-8 in less than one instruction per byte #43688

Validating UTF-8 in less than one instruction per byte #43688

gfoidl commented Oct 21, 2020

Dotnet-GitSync-Bot commented Oct 21, 2020

GrabYourPitchforks commented Oct 21, 2020

ghost commented Oct 21, 2020

gfoidl commented Oct 21, 2020

GrabYourPitchforks commented Oct 21, 2020

gfoidl commented Oct 21, 2020

GSPP commented Oct 22, 2020

gfoidl commented Oct 22, 2020

dotnet-policy-service bot commented Apr 23, 2025

dotnet-policy-service bot commented May 7, 2025

Validating UTF-8 in less than one instruction per byte #43688

Validating UTF-8 in less than one instruction per byte #43688

Comments

gfoidl commented Oct 21, 2020

Dotnet-GitSync-Bot commented Oct 21, 2020

GrabYourPitchforks commented Oct 21, 2020

ghost commented Oct 21, 2020

gfoidl commented Oct 21, 2020

GrabYourPitchforks commented Oct 21, 2020

gfoidl commented Oct 21, 2020

GSPP commented Oct 22, 2020

gfoidl commented Oct 22, 2020

dotnet-policy-service bot commented Apr 23, 2025

dotnet-policy-service bot commented May 7, 2025