Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Test for unpaired surrogates is rejected by some JSON parsers #1062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
catamorphism opened this issue Mar 27, 2025 · 2 comments · Fixed by #1072
Closed

Test for unpaired surrogates is rejected by some JSON parsers #1062

catamorphism opened this issue Mar 27, 2025 · 2 comments · Fixed by #1072
Labels
test-suite Issue pertains to tests

Comments

@catamorphism
Copy link
Collaborator

The syntax-errors.json file contains a test (which should fail) for an unpaired surrogate in a literal:

https://github.com/unicode-org/message-format-wg/blob/main/test/tests/syntax-errors.json#L195

This JSON is rejected by the parser I'm using in ICU4C, https://github.com/nlohmann/json/ :

{ "src": "{\ud800}" }

because it's not valid UTF-8. (Note: I would think that the escaping with \u would make it valid, but that's the error that the JSON parser emits.)

I solved the problem by adding an ICU-specific patch to the JSON parser, but that's not ideal. In this comment, @srl295 suggested using base64 encoding for the test instead.

@aphillips
Copy link
Member

This is tricky, since we're fighting other potential uses. A JSON parser that produces UTF-8 can't encode an isolated surrogate (except as FFFD). The iso-surrogate stuff is to permit badly behaved/badly formed strings in UTF-16 implementations. Are the tests for this truly generic?

@srl295
Copy link
Member

srl295 commented Apr 14, 2025

This is tricky, since we're fighting other potential uses. A JSON parser that produces UTF-8 can't encode an isolated surrogate (except as FFFD). The iso-surrogate stuff is to permit badly behaved/badly formed strings in UTF-16 implementations. Are the tests for this truly generic?

My proposal, since this is the test format, is to change the encoding to be base64 (could be base 64 encoded UTF-32) - or numeric code points such as "codepoints": "U+DB88" etc.

Could the test structure be modified so that the JSON source doesn't have to include invalid strings in the processing pipeline? That might be helpful for implementations.

@aphillips aphillips added the PR-needed Issue is blocked on having a PR created label Apr 21, 2025
aphillips added a commit that referenced this issue Apr 21, 2025
Fixes #1062 per 2025-04-21 call
@aphillips aphillips removed the PR-needed Issue is blocked on having a PR created label Apr 21, 2025
@eemeli eemeli closed this as completed in 12b82c4 Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-suite Issue pertains to tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants