Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CLDR-15725 compound unit transforms #2156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

macchiati
Copy link
Member

CLDR-15725

  • This PR completes the ticket.

@macchiati macchiati requested a review from pedberg-icu July 5, 2022 22:01
@macchiati
Copy link
Member Author

Put in a proof of concept, with rules using regex.
Added spreadsheet with current results, just a few rule sets.
https://docs.google.com/spreadsheets/d/1L58nWgxn8sWiOmfTf1VgdejRp12fJqZ91fG5ru4v-VQ/edit#gid=0

Need to analyze more cases, seeing what rules would result.

@@ -0,0 +1,219 @@
package org.unicode.cldr.util;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in an existing or new icu class (since presumably it will go there eventually)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we get this further along, it (or an optimized version of it) could go into ICU. We should work first on the rules, and make sure that the syntax is powerful enough for what we need (and reasonably clear).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the current implementation uses regex for convenience. We would need to make sure that we use syntax that is easily supported across platforms. It doesn't have to be the same, but would have to be easily transformed into "native" regex.

@pedberg-icu
Copy link
Contributor

This looks very promising! {presumably we want the code in ICU eventually...)

@macchiati macchiati marked this pull request as ready for review July 5, 2022 22:58
@macchiati macchiati force-pushed the CLDR-15725-compound-unit-transforms branch from 5cd9898 to d20ceca Compare July 17, 2022 19:11
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@macchiati macchiati requested a review from pedberg-icu July 18, 2022 17:25
@macchiati
Copy link
Member Author

Fleshed out a bit more, added first cut at the list rules for ICU. Also tested inserting spaces between chinese and non-chinese letters, and changing 'a' to 'an' before vowels in English.

" a⦅ ❙⦆[aeiou]→n ;" // doesn't handle "an hour", or "an underground"
))
.put("zh", BoundaryTransform.from(
"[\\p{L}&&\\p{sc=hani}]⦅❙⦆[\\p{L}&&\\P{sc=hani}]→ ;" // add space at chinese/non-chinese letter junction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this add a space between Han chars and shared CJK punctuation like brackets etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The \p{L}&& restricts it to letters

{"en", "Take a {0}", "book", "Take a book"},
{"en", "Take a {0}", "apple", "Take an apple"},
{"zh", "舘{0}", "豈", "舘豈"},
{"zh", "舘{0}", "a", "舘 a"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need som test cases with punctuation like CJK brackets etc.

@pedberg-icu
Copy link
Contributor

Looks good but I had a concern about spacing getting added between Han characters and CJK punctuation like brackets.

@macchiati
Copy link
Member Author

macchiati commented Jul 19, 2022 via email

@pedberg-icu
Copy link
Contributor

If the current code does not add code between Han chars and punctuation, that is good, I was afraid that it did. I think we may need to add spaces between Han chars and Latin digits, will check on that. But it seems like the current code makes some of the improvements we want without doing any harm. so that is good.

pedberg-icu
pedberg-icu previously approved these changes Jul 19, 2022
@macchiati macchiati marked this pull request as draft August 4, 2022 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants