CLDR-15725 compound unit transforms #2156

macchiati · 2022-07-04T04:10:04Z

This PR completes the ticket.

macchiati · 2022-07-05T22:02:58Z

Put in a proof of concept, with rules using regex.
Added spreadsheet with current results, just a few rule sets.
https://docs.google.com/spreadsheets/d/1L58nWgxn8sWiOmfTf1VgdejRp12fJqZ91fG5ru4v-VQ/edit#gid=0

Need to analyze more cases, seeing what rules would result.

pedberg-icu · 2022-07-05T22:48:55Z

tools/cldr-code/src/main/java/org/unicode/cldr/util/BoundaryTransform.java

@@ -0,0 +1,219 @@
+package org.unicode.cldr.util;


Should this be in an existing or new icu class (since presumably it will go there eventually)?

Once we get this further along, it (or an optimized version of it) could go into ICU. We should work first on the rules, and make sure that the syntax is powerful enough for what we need (and reasonably clear).

Note that the current implementation uses regex for convenience. We would need to make sure that we use syntax that is easily supported across platforms. It doesn't have to be the same, but would have to be easily transformed into "native" regex.

pedberg-icu · 2022-07-05T22:49:48Z

This looks very promising! {presumably we want the code in ICU eventually...)

jira-pull-request-webhook · 2022-07-17T19:11:29Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

macchiati · 2022-07-18T17:27:12Z

Fleshed out a bit more, added first cut at the list rules for ICU. Also tested inserting spaces between chinese and non-chinese letters, and changing 'a' to 'an' before vowels in English.

pedberg-icu · 2022-07-19T00:00:51Z

tools/cldr-code/src/main/java/org/unicode/cldr/util/BoundaryTransform.java

+            " a⦅ ❙⦆[aeiou]→n ;" // doesn't handle "an hour", or "an underground"
+            ))
+        .put("zh", BoundaryTransform.from(
+            "[\\p{L}&&\\p{sc=hani}]⦅❙⦆[\\p{L}&&\\P{sc=hani}]→ ;" // add space at chinese/non-chinese letter junction


won't this add a space between Han chars and shared CJK punctuation like brackets etc?

The \p{L}&& restricts it to letters

pedberg-icu · 2022-07-19T00:03:05Z

tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestBoundaryTransform.java

+            {"en", "Take a {0}", "book", "Take a book"},
+            {"en", "Take a {0}", "apple", "Take an apple"},
+            {"zh", "舘{0}", "豈", "舘豈"},
+            {"zh", "舘{0}", "a", "舘 a"},


Need som test cases with punctuation like CJK brackets etc.

pedberg-icu · 2022-07-19T00:03:55Z

Looks good but I had a concern about spacing getting added between Han characters and CJK punctuation like brackets.

macchiati · 2022-07-19T01:10:29Z

Good question. That is only adding spaces between letters (between a Han and a non-Han). I suspect it will need some refinement....

…

On Mon, Jul 18, 2022 at 5:04 PM Peter Edberg ***@***.***> wrote: Looks good but I had a concern about spacing getting added between Han characters and CJK punctuation like brackets. — Reply to this email directly, view it on GitHub <#2156 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMBAVPA2AQDLKGSSXKLVUXWHLANCNFSM52RZWS4A> . You are receiving this because you authored the thread.Message ID: ***@***.***>

pedberg-icu · 2022-07-19T03:29:50Z

If the current code does not add code between Han chars and punctuation, that is good, I was afraid that it did. I think we may need to add spaces between Han chars and Latin digits, will check on that. But it seems like the current code makes some of the improvements we want without doing any harm. so that is good.

macchiati requested a review from pedberg-icu July 5, 2022 22:01

pedberg-icu reviewed Jul 5, 2022

View reviewed changes

macchiati marked this pull request as ready for review July 5, 2022 22:58

macchiati added 4 commits July 17, 2022 12:10

CLDR-15725 First cut, very draft

caa6371

CLDR-15725 cleaned up syntax, initial rules

25d5878

CLDR-15725 fix typo

575da75

CLDR-15725 Modified to allow for named groups, with replacement

d20ceca

macchiati force-pushed the CLDR-15725-compound-unit-transforms branch from 5cd9898 to d20ceca Compare July 17, 2022 19:11

CLDR-15725 add a few more boundary conditions, tweak code

43ee1f6

macchiati requested a review from pedberg-icu July 18, 2022 17:25

pedberg-icu reviewed Jul 19, 2022

View reviewed changes

pedberg-icu previously approved these changes Jul 19, 2022

View reviewed changes

CLDR-15725 tweaks

5ae382c

macchiati dismissed pedberg-icu’s stale review via 5ae382c July 19, 2022 23:04

macchiati marked this pull request as draft August 4, 2022 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLDR-15725 compound unit transforms #2156

CLDR-15725 compound unit transforms #2156

Uh oh!

macchiati commented Jul 4, 2022

Uh oh!

macchiati commented Jul 5, 2022

Uh oh!

pedberg-icu Jul 5, 2022

Uh oh!

macchiati Jul 5, 2022

Uh oh!

macchiati Jul 5, 2022

Uh oh!

pedberg-icu commented Jul 5, 2022

Uh oh!

jira-pull-request-webhook bot commented Jul 17, 2022

Uh oh!

macchiati commented Jul 18, 2022

Uh oh!

pedberg-icu Jul 19, 2022

Uh oh!

macchiati Jul 19, 2022

Uh oh!

pedberg-icu Jul 19, 2022

Uh oh!

pedberg-icu commented Jul 19, 2022

Uh oh!

macchiati commented Jul 19, 2022 via email

Uh oh!

pedberg-icu commented Jul 19, 2022

Uh oh!

Uh oh!

CLDR-15725 compound unit transforms #2156

Are you sure you want to change the base?

CLDR-15725 compound unit transforms #2156

Uh oh!

Conversation

macchiati commented Jul 4, 2022

Uh oh!

macchiati commented Jul 5, 2022

Uh oh!

pedberg-icu Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

macchiati Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

macchiati Jul 5, 2022

Choose a reason for hiding this comment

Uh oh!

pedberg-icu commented Jul 5, 2022

Uh oh!

jira-pull-request-webhook bot commented Jul 17, 2022

Uh oh!

macchiati commented Jul 18, 2022

Uh oh!

pedberg-icu Jul 19, 2022

Choose a reason for hiding this comment

Uh oh!

macchiati Jul 19, 2022

Choose a reason for hiding this comment

Uh oh!

pedberg-icu Jul 19, 2022

Choose a reason for hiding this comment

Uh oh!

pedberg-icu commented Jul 19, 2022

Uh oh!

macchiati commented Jul 19, 2022 via email

Uh oh!

pedberg-icu commented Jul 19, 2022

Uh oh!

Uh oh!