Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ICU-23059 ICU4C MF2: Spec test updates #3423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 27, 2025

Conversation

catamorphism
Copy link
Contributor

@catamorphism catamorphism commented Mar 1, 2025

Update spec tests to current version from message-format-wg

Checklist

  • Required: Issue filed: ICU-23059
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@catamorphism catamorphism marked this pull request as ready for review March 1, 2025 01:51
@catamorphism catamorphism requested a review from srl295 March 1, 2025 01:51
@catamorphism catamorphism force-pushed the more-spec-tests-updates branch from 4fa63b3 to e137728 Compare March 1, 2025 01:55
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/test/intltest/messageformat2test_read_json.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism force-pushed the more-spec-tests-updates branch from e137728 to 82f4fb9 Compare March 3, 2025 20:47
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_errors.cpp is different
  • icu4c/source/i18n/messageformat2_errors.h is different
  • icu4c/source/i18n/messageformat2_evaluation.cpp is different
  • icu4c/source/i18n/messageformat2_evaluation.h is now changed in the branch
  • icu4c/source/i18n/messageformat2_function_registry.cpp is different
  • testdata/message2/spec/functions/integer.json is different
  • testdata/message2/spec/functions/number.json is different
  • testdata/message2/spec/syntax.json is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/messageformat2_evaluation.cpp is different
  • icu4c/source/i18n/messageformat2_evaluation.h is different
  • icu4c/source/i18n/messageformat2_formatter.cpp is now changed in the branch
  • icu4c/source/i18n/messageformat2_function_registry_internal.h is different
  • icu4c/source/i18n/messageformat2_function_registry.cpp is different
  • icu4c/source/i18n/messageformat2.cpp is different
  • icu4c/source/i18n/unicode/messageformat2_formattable.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

srl295
srl295 previously approved these changes Mar 27, 2025
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, comments about:

  • upstreaming patch (check with ICU-TC)
  • improving documentation on enum and possibly name of 'soft' error

Comment on lines 7760 to 7761
// ICU PATCH
codepoint = codepoint1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be upstreamed as an option.
Also, ICU PATCH here should probably link to a future ticket noting the patch is here, // TODO ICU-nnnn Patched because …

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it would be accepted upstream; see https://json.nlohmann.me/home/faq/#parse-errors-reading-non-ascii-characters --

"Only UTF-8 encoded input is supported which is the default encoding for JSON according to RFC 8259."

Since the codepoint in the test .json file is escaped, I'm not sure why it's considered to be invalid UTF-8, but maybe I'm not understanding something.

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what is the impact of the JSON changes on other clients of the JSON library if any?

Does the MF library supporting malformed UTF-8 mean that the JSON parser must also? is there another solution for test files, such as using numeric escapes?

@catamorphism
Copy link
Contributor Author

Actually, what is the impact of the JSON changes on other clients of the JSON library if any?

Well, there aren't any other clients within ICU that use the JSON library, as far as I know. If you mean "what would the impact be if the changes were upstreamed?", I'm not sure.

Does the MF library supporting malformed UTF-8 mean that the JSON parser must also? is there another solution for test files, such as using numeric escapes?

The test file already does use a numeric escape: https://github.com/unicode-org/message-format-wg/blob/main/test/tests/syntax-errors.json#L195

but the (unpatched) JSON parser rejects it even so. Note that this particular test case tests for a syntax error, but the message string still has to be representable within the test file, so this particular problem is not caused by the MF library supporting malformed UTF-8 (although it does), but rather, the storage format for test cases needing to represent malformed UTF-8.

Hope that makes sense, if not, let me know.

@srl295
Copy link
Member

srl295 commented Mar 27, 2025

Actually, what is the impact of the JSON changes on other clients of the JSON library if any?

Well, there aren't any other clients within ICU that use the JSON library, as far as I know. If you mean "what would the impact be if the changes were upstreamed?", I'm not sure.

Nlohmann wrote somewhere that rejecting unpaired surrogates was required by the JSON standard, but I don't find support for that in either json.org or ECMA-404.

Does the MF library supporting malformed UTF-8 mean that the JSON parser must also? is there another solution for test files, such as using numeric escapes?

The test file already does use a numeric escape: https://github.com/unicode-org/message-format-wg/blob/main/test/tests/syntax-errors.json#L195

but the (unpatched) JSON parser rejects it even so. Note that this particular test case tests for a syntax error, but the message string still has to be representable within the test file, so this particular problem is not caused by the MF library supporting malformed UTF-8 (although it does), but rather, the storage format for test cases needing to represent malformed UTF-8.

It may be better to have another representation in the test files, perhaps supporting BASE64 or something. Will other implementations also run into trouble using these same test files on other platforms/environments?

something parallel to bidiIsolation

    {
       "description": "various characters as unquoted-literal",
       "src": "e8Kh2J3hmoHigIvigJDigLDigaDigarjgIHugIDvt7B9",
       "exp": "wqHYneGageKAi+KAkOKAsOKBoOKBquOAge6AgO+3sA==",
       "encoding": "base64"
     }

Hope that makes sense, if not, let me know.

ICU-TC should be aware of the vendoring/patch, and i think it would be a good idea to open a ticket upstream at least. In fact, it would make a pretty simple #ifdef switch upstream.

srl295
srl295 previously approved these changes Mar 27, 2025
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving - i think other issues are good to discuss but don't need to block

@catamorphism
Copy link
Contributor Author

I created unicode-org/message-format-wg#1062 to air your suggestion about using base64 in the test.

@catamorphism
Copy link
Contributor Author

I also created https://unicode-org.atlassian.net/browse/ICU-23090 to track the ICU patch to the JSON parser, and will add that as a comment in the file.

Update spec tests to current version from message-format-wg

  - Update parser for changed name-start grammar rule
  - Validate number literals in :number implementation (since parser no longer does this)
  - Disallow `:number`/`:integer` select option set from variable

    See unicode-org/message-format-wg#1016

    As part of this, un-skip tests where the `bad-option` error is
    expected, and implement validating digit size options
    (pending PR unicode-org#2973 is intended
    to do this more fully)
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/tools/toolutil/json-json.hpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism
Copy link
Contributor Author

catamorphism commented Mar 27, 2025

@srl295 I squashed and added a reference to ICU-23090 as a comment in json-json.hpp. That's the only change in the force-push, if you don't mind re-approving one more time.

@catamorphism catamorphism merged commit d0e30ac into unicode-org:main Mar 27, 2025
101 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants