-
-
Notifications
You must be signed in to change notification settings - Fork 47
Updated Unikemet regular expressions and regenerated UCDXML #1160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sn't validate against current data, so hack for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes lgtm, assuming CI is happy now -- please squash-and-merge when ready
I haven't looked at it. If it's ambiguous or otherwise problematic, do you want to report it? Via a PAG issue? |
Add for the next PAG discussion? I was more worried about the composed vs decomposed characters for the syntax. |
yes, asap given the timing and holidays/vacations
I didn't quite understand the problem. Usually a composite of letter+mark is also a letter. |
Hi Markus:
Then compare those to The ḥ (U+1E25) from the first char of kEH_FVal for cp 13002 is specified in https://www.unicode.org/reports/tr57/#kEH_FVal, and would be covered by h combined with \x{323} from https://www.unicode.org/reports/tr57/proposed.html#kEH_FVal So, the problem is that the syntax makes sense for humans, but not for (at least) some parsers. |
I think that means that for validation you have to normalize the values to NFD and then run the regexes. But I would expect UCDXML to store the original values (in NFC form) just like Unikemet.txt. |
Do we make that caveat in any other UAX? |
Same wording in UAX38: https://www.unicode.org/reports/tr38/#DatabaseDesign I haven't checked others. |
Fix for https://github.com/unicode-org/properties/issues/435
Updated Unikemet regular expressions; however, RegEx for kEH_FVal doesn't validate against current data.
There are two issues:
To avoid this for now, we'll treat kEH_FVal as SINGLE_VALUED.
I updated the rest of the Unikemet regular expressions from the latest version of UAX 57.