Inconsistent about invalid UTF-8 #30

lilyball · 2016-10-27T19:45:57Z

There are multiple tests that check how the parser handles encountering invalid UTF-8 sequences in the byte stream. Some of them are i_ tests, but some of them are n_ tests. I think the n_ tests are incorrect and should be turned into i_ tests, because it's perfectly reasonable for parsers to convert invalid UTF-8 sequences to U+FFFD instead of failing to parse.

nst · 2016-10-30T17:03:33Z

I understand RFC 7159 as follows: JSON is a text format made of Unicode characters, encoded in UTF-8 or UTF-16 or UTF-32. So, data that is not valid UTF-8, UTF-16 or UTF-32 cannot be JSON.

While it may make sense for a parser to convert invalid sequences into U+FFFD, this conversion is not a Unicode requirement, that's why I think that a strict JSON validator should reject invalid UTF-8. I'm all willing to be convinced by another interpretation :-)

Meanwhile, the tests have to be consistent, and it seems they are not.

lilyball · 2016-10-31T05:57:37Z

Unicode allows parsers to either reject ill-formed input, or to replace ill-formed input with U+FFFD. It's very common to pick the latter approach. JSON operates on Unicode text, so while it's correct for a JSON parser to reject ill-formed input, it's also correct for a JSON parser to replace ill-formed input with U+FFFD. I think it's valuable to have tests in the test suite for this to ensure that the parser doesn't crash, but the tests should be considered implementation-defined as the parser can legitimately choose either option.

nst closed this as completed in d4f056c Nov 4, 2016

ryanofsky mentioned this issue May 2, 2017

Detect non-minimal UTF-8 decoding jgarzik/univalue#33

Closed

Alcaro mentioned this issue Nov 27, 2017

parsing_json.php improvements #75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent about invalid UTF-8 #30

Inconsistent about invalid UTF-8 #30

lilyball commented Oct 27, 2016

nst commented Oct 30, 2016

lilyball commented Oct 31, 2016 •

edited

Loading

Inconsistent about invalid UTF-8 #30

Inconsistent about invalid UTF-8 #30

Comments

lilyball commented Oct 27, 2016

nst commented Oct 30, 2016

lilyball commented Oct 31, 2016 • edited Loading

lilyball commented Oct 31, 2016 •

edited

Loading