-
-
Notifications
You must be signed in to change notification settings - Fork 36
Test for unpaired surrogates is rejected by some JSON parsers #1062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is tricky, since we're fighting other potential uses. A JSON parser that produces UTF-8 can't encode an isolated surrogate (except as FFFD). The iso-surrogate stuff is to permit badly behaved/badly formed strings in UTF-16 implementations. Are the tests for this truly generic? |
My proposal, since this is the test format, is to change the encoding to be base64 (could be base 64 encoded UTF-32) - or numeric code points such as Could the test structure be modified so that the JSON source doesn't have to include invalid strings in the processing pipeline? That might be helpful for implementations. |
The syntax-errors.json file contains a test (which should fail) for an unpaired surrogate in a literal:
https://github.com/unicode-org/message-format-wg/blob/main/test/tests/syntax-errors.json#L195
This JSON is rejected by the parser I'm using in ICU4C, https://github.com/nlohmann/json/ :
{ "src": "{\ud800}" }
because it's not valid UTF-8. (Note: I would think that the escaping with
\u
would make it valid, but that's the error that the JSON parser emits.)I solved the problem by adding an ICU-specific patch to the JSON parser, but that's not ideal. In this comment, @srl295 suggested using base64 encoding for the test instead.
The text was updated successfully, but these errors were encountered: