-
-
Notifications
You must be signed in to change notification settings - Fork 801
ICU-23059 ICU4C MF2: Spec test updates #3423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-23059 ICU4C MF2: Spec test updates #3423
Conversation
4fa63b3
to
e137728
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
e137728
to
82f4fb9
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
82f4fb9
to
6da8982
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, comments about:
- upstreaming patch (check with ICU-TC)
- improving documentation on enum and possibly name of 'soft' error
// ICU PATCH | ||
codepoint = codepoint1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be upstreamed as an option.
Also, ICU PATCH here should probably link to a future ticket noting the patch is here, // TODO ICU-nnnn Patched because …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it would be accepted upstream; see https://json.nlohmann.me/home/faq/#parse-errors-reading-non-ascii-characters --
"Only UTF-8 encoded input is supported which is the default encoding for JSON according to RFC 8259."
Since the codepoint in the test .json file is escaped, I'm not sure why it's considered to be invalid UTF-8, but maybe I'm not understanding something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, what is the impact of the JSON changes on other clients of the JSON library if any?
Does the MF library supporting malformed UTF-8 mean that the JSON parser must also? is there another solution for test files, such as using numeric escapes?
Well, there aren't any other clients within ICU that use the JSON library, as far as I know. If you mean "what would the impact be if the changes were upstreamed?", I'm not sure.
The test file already does use a numeric escape: https://github.com/unicode-org/message-format-wg/blob/main/test/tests/syntax-errors.json#L195 but the (unpatched) JSON parser rejects it even so. Note that this particular test case tests for a syntax error, but the message string still has to be representable within the test file, so this particular problem is not caused by the MF library supporting malformed UTF-8 (although it does), but rather, the storage format for test cases needing to represent malformed UTF-8. Hope that makes sense, if not, let me know. |
Nlohmann wrote somewhere that rejecting unpaired surrogates was required by the JSON standard, but I don't find support for that in either json.org or ECMA-404.
It may be better to have another representation in the test files, perhaps supporting BASE64 or something. Will other implementations also run into trouble using these same test files on other platforms/environments? something parallel to bidiIsolation {
"description": "various characters as unquoted-literal",
"src": "e8Kh2J3hmoHigIvigJDigLDigaDigarjgIHugIDvt7B9",
"exp": "wqHYneGageKAi+KAkOKAsOKBoOKBquOAge6AgO+3sA==",
"encoding": "base64"
}
ICU-TC should be aware of the vendoring/patch, and i think it would be a good idea to open a ticket upstream at least. In fact, it would make a pretty simple #ifdef switch upstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving - i think other issues are good to discuss but don't need to block
I created unicode-org/message-format-wg#1062 to air your suggestion about using base64 in the test. |
I also created https://unicode-org.atlassian.net/browse/ICU-23090 to track the ICU patch to the JSON parser, and will add that as a comment in the file. |
Update spec tests to current version from message-format-wg - Update parser for changed name-start grammar rule - Validate number literals in :number implementation (since parser no longer does this) - Disallow `:number`/`:integer` select option set from variable See unicode-org/message-format-wg#1016 As part of this, un-skip tests where the `bad-option` error is expected, and implement validating digit size options (pending PR unicode-org#2973 is intended to do this more fully)
e525cac
to
e59284f
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Update spec tests to current version from message-format-wg
Update parser for changed name-start grammar rule
Validate number literals in :number implementation (since parser no longer does this)
Disallow
:number
/:integer
select option set from variableSee Require select option to be set by a literal value message-format-wg#1016
As part of this, un-skip tests where the
bad-option
error is expected, and implement validating digit size options (pending PR ICU-22747 MF2: Validate digit size for :number and :integer; validate options for :datetime #2973 is intended to do this more fully)Checklist