Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tags: hryx/zig

Tags

tautological-feature-only

Toggle tautological-feature-only's commit message

Verified

This commit was signed with the committer’s verified signature.
hryx Stevie Hryciw
Elide evaluation of tautological integral comparison operations

json-invalid-utf8-v2

Toggle json-invalid-utf8-v2's commit message

Verified

This commit was signed with the committer’s verified signature.
hryx Stevie Hryciw
json: disallow overlong and out-of-range UTF-8

Fixes ziglang#2379

= Overlong (non-shortest) sequences

UTF-8's unique encoding scheme allows for some Unicode codepoints
to be represented in multiple ways. For any of these characters,
the spec forbids all but the shortest form. These disallowed longer
sequences are called "overlong". As an interesting side effect of
this rule, the bytes C0 and C1 never appear in valid UTF-8.

= Codepoint range

UTF-8 disallows representation of codepoints beyond U+10FFFF,
which is the highest character which can be encoded in UTF-16.
Because a 4-byte sequence is capable of resulting in such characters,
they must be explicitly rejected. This rule also has an interesting
side effect, which is that bytes F5 to FF never appear.

= References

Detecting an overlong version of a codepoint could get gnarly, but
luckily The Unicode Consortium did the hard work by creating this
handy table of valid byte sequences:

https://unicode.org/versions/corrigendum1.html

I thought this mapped nicely to the parser's state machine, so I
rearranged the relevant states to make use of it.

json-invalid-utf8-v1

Toggle json-invalid-utf8-v1's commit message

Verified

This commit was signed with the committer’s verified signature.
hryx Stevie Hryciw
Fixes ziglang#2379

There were a few UTF-8 edge cases not covered, which this patch
covers.

The octets C0 and C1 are universally invalid according to UTF-8. Why?

1. As solo bytes, they are outside of the range of ASCII (00-7F).
2. As continuation bytes, they do not start with the bit pattern 0b10.
3. As leading bytes of a multibyte sequence, they would result in a
   2-byte "overlong" representation of ASCII which could have been
   encoded as plain ASCII. UTF-8 disallows this inferior encoding.

This patch fixes the third case above; the others were already covered.

Next, the octets F5 through FF are never seen in UTF-8. Why?

1. Same as above!
2. Same as above!
3. As leading bytes of a multibyte sequence, they would result in a
   4-byte sequence which encodes a codepoint beyond U+10FFFF. That
   is the highest codepoint supported by UTF-16, and UTF-8 explicitly
   limits its encoding range to that of UTF-16.

Again, this patch fixes the third case.

References used for this change:

- https://tools.ietf.org/html/rfc3629
- https://tools.ietf.org/html/rfc8259

This patch explicitly ignores the case of (technically) disallowed
"surrogate pairs" bytes in the range D800 to DFFF because the Zig
community already has ongoing discussions about how to handle that.

0.5.0

Toggle 0.5.0's commit message
Release 0.5.0

stage2-recursive-parser-pre-cleanup

Toggle stage2-recursive-parser-pre-cleanup's commit message

Verified

This commit was signed with the committer’s verified signature.
hryx Stevie Hryciw
Remove redundant TODO comments

0.4.0

Toggle 0.4.0's commit message
Release 0.4.0

0.3.0

Toggle 0.3.0's commit message
Release 0.3.0

0.2.0

Toggle 0.2.0's commit message
Release 0.2.0

0.1.1

Toggle 0.1.1's commit message
Release 0.1.1

0.1.0

Toggle 0.1.0's commit message
Release 0.1.0