-
Notifications
You must be signed in to change notification settings - Fork 156
Inconsistent about invalid UTF-8 #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I understand RFC 7159 as follows: JSON is a text format made of Unicode characters, encoded in UTF-8 or UTF-16 or UTF-32. So, data that is not valid UTF-8, UTF-16 or UTF-32 cannot be JSON. While it may make sense for a parser to convert invalid sequences into U+FFFD, this conversion is not a Unicode requirement, that's why I think that a strict JSON validator should reject invalid UTF-8. I'm all willing to be convinced by another interpretation :-) Meanwhile, the tests have to be consistent, and it seems they are not. |
Unicode allows parsers to either reject ill-formed input, or to replace ill-formed input with U+FFFD. It's very common to pick the latter approach. JSON operates on Unicode text, so while it's correct for a JSON parser to reject ill-formed input, it's also correct for a JSON parser to replace ill-formed input with U+FFFD. I think it's valuable to have tests in the test suite for this to ensure that the parser doesn't crash, but the tests should be considered implementation-defined as the parser can legitimately choose either option. |
There are multiple tests that check how the parser handles encountering invalid UTF-8 sequences in the byte stream. Some of them are i_ tests, but some of them are n_ tests. I think the n_ tests are incorrect and should be turned into i_ tests, because it's perfectly reasonable for parsers to convert invalid UTF-8 sequences to U+FFFD instead of failing to parse.
The text was updated successfully, but these errors were encountered: