Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Inconsistent about invalid UTF-8 #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lilyball opened this issue Oct 27, 2016 · 2 comments
Closed

Inconsistent about invalid UTF-8 #30

lilyball opened this issue Oct 27, 2016 · 2 comments

Comments

@lilyball
Copy link

There are multiple tests that check how the parser handles encountering invalid UTF-8 sequences in the byte stream. Some of them are i_ tests, but some of them are n_ tests. I think the n_ tests are incorrect and should be turned into i_ tests, because it's perfectly reasonable for parsers to convert invalid UTF-8 sequences to U+FFFD instead of failing to parse.

@nst
Copy link
Owner

nst commented Oct 30, 2016

I understand RFC 7159 as follows: JSON is a text format made of Unicode characters, encoded in UTF-8 or UTF-16 or UTF-32. So, data that is not valid UTF-8, UTF-16 or UTF-32 cannot be JSON.

While it may make sense for a parser to convert invalid sequences into U+FFFD, this conversion is not a Unicode requirement, that's why I think that a strict JSON validator should reject invalid UTF-8. I'm all willing to be convinced by another interpretation :-)

Meanwhile, the tests have to be consistent, and it seems they are not.

@lilyball
Copy link
Author

lilyball commented Oct 31, 2016

Unicode allows parsers to either reject ill-formed input, or to replace ill-formed input with U+FFFD. It's very common to pick the latter approach. JSON operates on Unicode text, so while it's correct for a JSON parser to reject ill-formed input, it's also correct for a JSON parser to replace ill-formed input with U+FFFD. I think it's valuable to have tests in the test suite for this to ensure that the parser doesn't crash, but the tests should be considered implementation-defined as the parser can legitimately choose either option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants