-
Notifications
You must be signed in to change notification settings - Fork 400
CLDR-15725 compound unit transforms #2156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
CLDR-15725 compound unit transforms #2156
Conversation
Put in a proof of concept, with rules using regex. Need to analyze more cases, seeing what rules would result. |
@@ -0,0 +1,219 @@ | |||
package org.unicode.cldr.util; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in an existing or new icu class (since presumably it will go there eventually)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we get this further along, it (or an optimized version of it) could go into ICU. We should work first on the rules, and make sure that the syntax is powerful enough for what we need (and reasonably clear).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the current implementation uses regex for convenience. We would need to make sure that we use syntax that is easily supported across platforms. It doesn't have to be the same, but would have to be easily transformed into "native" regex.
This looks very promising! {presumably we want the code in ICU eventually...) |
5cd9898
to
d20ceca
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
Fleshed out a bit more, added first cut at the list rules for ICU. Also tested inserting spaces between chinese and non-chinese letters, and changing 'a' to 'an' before vowels in English. |
" a⦅ ❙⦆[aeiou]→n ;" // doesn't handle "an hour", or "an underground" | ||
)) | ||
.put("zh", BoundaryTransform.from( | ||
"[\\p{L}&&\\p{sc=hani}]⦅❙⦆[\\p{L}&&\\P{sc=hani}]→ ;" // add space at chinese/non-chinese letter junction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this add a space between Han chars and shared CJK punctuation like brackets etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The \p{L}&& restricts it to letters
{"en", "Take a {0}", "book", "Take a book"}, | ||
{"en", "Take a {0}", "apple", "Take an apple"}, | ||
{"zh", "舘{0}", "豈", "舘豈"}, | ||
{"zh", "舘{0}", "a", "舘 a"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need som test cases with punctuation like CJK brackets etc.
Looks good but I had a concern about spacing getting added between Han characters and CJK punctuation like brackets. |
Good question. That is only adding spaces between letters (between a Han
and a non-Han). I suspect it will need some refinement....
…On Mon, Jul 18, 2022 at 5:04 PM Peter Edberg ***@***.***> wrote:
Looks good but I had a concern about spacing getting added between Han
characters and CJK punctuation like brackets.
—
Reply to this email directly, view it on GitHub
<#2156 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMBAVPA2AQDLKGSSXKLVUXWHLANCNFSM52RZWS4A>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
If the current code does not add code between Han chars and punctuation, that is good, I was afraid that it did. I think we may need to add spaces between Han chars and Latin digits, will check on that. But it seems like the current code makes some of the improvements we want without doing any harm. so that is good. |
CLDR-15725