Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Updated Unikemet regular expressions and regenerated UCDXML #1160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 14, 2025

Conversation

jowilco
Copy link
Contributor

@jowilco jowilco commented Jun 14, 2025

Fix for https://github.com/unicode-org/properties/issues/435

Updated Unikemet regular expressions; however, RegEx for kEH_FVal doesn't validate against current data.

There are two issues:

  1. The separator is ambiguous: "The delimiters '/' or '|' are used to separate alternative values..."
  2. The syntax specifies letters and combining characters (e.g., h + COMBINING DOT BELOW [U+0323], rather than ḥ [U+1E25]). But Unikemet.txt in UCD uses the composed forms.

To avoid this for now, we'll treat kEH_FVal as SINGLE_VALUED.

I updated the rest of the Unikemet regular expressions from the latest version of UAX 57.

…sn't validate against current data, so hack for now.
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm, assuming CI is happy now -- please squash-and-merge when ready

@markusicu
Copy link
Member

The separator is ambiguous: "The delimiters '/' or '|' are used to separate alternative values..."

I haven't looked at it. If it's ambiguous or otherwise problematic, do you want to report it? Via a PAG issue?

@jowilco
Copy link
Contributor Author

jowilco commented Jun 14, 2025

The separator is ambiguous: "The delimiters '/' or '|' are used to separate alternative values..."

I haven't looked at it. If it's ambiguous or otherwise problematic, do you want to report it? Via a PAG issue?

Add for the next PAG discussion? I was more worried about the composed vs decomposed characters for the syntax.

@jowilco jowilco merged commit 73a2410 into unicode-org:main Jun 14, 2025
20 checks passed
@jowilco jowilco deleted the issue435_kEH_FVal branch June 14, 2025 00:31
@markusicu
Copy link
Member

Add for the next PAG discussion?

yes, asap given the timing and holidays/vacations

I was more worried about the composed vs decomposed characters for the syntax.

I didn't quite understand the problem. Usually a composite of letter+mark is also a letter.
Can you give an example from spec & data where you think there is a problem?

@jowilco
Copy link
Contributor Author

jowilco commented Jun 14, 2025

Hi Markus:
Compare the syntax for:

Then compare those to U+13002 kEH_FVal ḥmsꞽ from [Unikemet.txt]https://www.unicode.org/Public/UCD/latest/ucd/Unikemet.txt).

The ḥ (U+1E25) from the first char of kEH_FVal for cp 13002 is specified in https://www.unicode.org/reports/tr57/#kEH_FVal, and would be covered by h combined with \x{323} from https://www.unicode.org/reports/tr57/proposed.html#kEH_FVal
However, I've tried the syntax with two different regular expression tools and neither of them match ḥ on [h\x{323}]+.

So, the problem is that the syntax makes sense for humans, but not for (at least) some parsers.

@markusicu
Copy link
Member

The spec says

All data in the Unikemet database is stored in UTF-8 using Normalization Form C (NFC). Note, however, that the “Syntax” descriptions below, used for validation of property values, operate on Normalization Form D (NFD), primarily because that makes the regular expressions simpler.

I think that means that for validation you have to normalize the values to NFD and then run the regexes. But I would expect UCDXML to store the original values (in NFC form) just like Unikemet.txt.

@jowilco
Copy link
Contributor Author

jowilco commented Jun 14, 2025

Do we make that caveat in any other UAX?

@markusicu
Copy link
Member

Do we make that caveat in any other UAX?

Same wording in UAX38: https://www.unicode.org/reports/tr38/#DatabaseDesign

I haven't checked others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants