ICU-13441 For zh/ja, tailor linebreak classes for quotations such as “ 201C and ” 201D #223

xhacker · 2018-10-18T09:21:39Z

Special tailoring for CJ so that “ 201C and ” 201D become OP and CL instead of QU.

line_cj.txt was copied from line.txt. The diffs between line_cj.txt and line.txt are identical to the changes for line_loose_cj.txt and line_normal_cj.txt.

Checklist

Issue filed: https://unicode-org.atlassian.net/browse/ICU-13441
Updated PR title and link in previous line to include Issue number
Issue accepted
Tests included
Documentation is changed or added

…“ 201C and ” 201D

CLAassistant · 2018-10-18T09:21:48Z

All committers have signed the CLA.

jungshik · 2018-10-31T02:52:22Z

As opposed to what @pedberg wrote in the bug, this PR does add a new rule file, line_cj and increased the data size by tens of kilobytes. We really need to find a way to share rules across locales because the differences are very small.

jungshik · 2018-10-31T02:55:17Z

BTW, what's the canonicalization source of brk rules? Is it icu instead of cldr?

xhacker · 2018-10-31T07:24:02Z

BTW, what's the canonicalization source of brk rules? Is it icu instead of cldr?

They are currently in ICU. Although ideally they should be in CLDR.

sffc · 2018-11-03T05:37:09Z

This PR is approved. Should it be merged?

jungshik · 2018-11-03T14:45:22Z

IMHO, we have to go back to the bug. I don't fully understand why U+201C and U+201D have to be treated specially in zh/ja only. This can open a floodgate for other locales + quotation character combinations, too.

xhacker · 2018-11-04T04:55:42Z

IMHO, we have to go back to the bug. I don't fully understand why U+201C and U+201D have to be treated specially in zh/ja only. This can open a floodgate for other locales + quotation character combinations, too.

CJ don’t have spaces to provide breaking opportunities. If we don’t allow breaking for quotation marks, in some cases it can be rediculous:

Other locales may have the same issue but the results won’t be that bad.

pedberg-icu · 2018-11-07T00:43:29Z

As opposed to what @pedberg wrote in the bug, this PR does add a new rule file, line_cj and increased the data size by tens of kilobytes.

Yes, when I wrote that in the bug, I had forgotten that the default behavior for CJ is not the same as the behavior in line_normal_cj.txt (that is, for CJ, in addition to the 3 W3C linebreak modes of strict, normal, and loose, we also have the default behavior which is none of those). We could reclaim that tens of kilobytes by actually making the default CJ behavior be the same as the line_normal_cj behavior.

BTW, what's the canonicalization source of brk rules? Is it icu instead of cldr?

For the standard rules, it is basically Unicode (UAX #29 and UAX #14), sometimes with some emoji-related extensions from UTS #51 or elsewhere. For the tailored rules, it is generally ICU; these are then manually synched over to CLDR rules late in each CLDR development cycle. But they do not exactly match because, for instance, CLDR does have the dictionary-based breaking that ICU supports.

This PR is approved. Should it be merged?
IMHO, we have to go back to the bug. I don't fully understand why U+201C and U+201D have to be treated specially in zh/ja only. This can open a floodgate for other locales + quotation character combinations, too.

Dongyuan responded about the motivation for the quote-related CJ tailoring. But it seems like there may not be agreement on merging this, I will bring it to the TC meeting.

pedberg-icu · 2018-11-15T04:32:20Z

I just created [ICU-20274] to consider whether the default Japanese linebreak rules should match the W3C line-break=normal rules as defined for Japanese (i.e., whether ja should default to line_normal_cj rather than line_normal).

jungshik · 2018-11-16T16:32:39Z

Thank you, Peter, for the explanation about the canonical source.

We could reclaim that tens of kilobytes by actually making the default CJ behavior be the same as the line_normal_cj behavior.

I'll follow up on ICU-20274.

As for the motivation for this PR, I know why this PR was made, but I don't think it's just Chinese or Japanese that may not have a space before or after U+201[CD].

FrankYFTang · 2019-04-04T06:42:11Z

Could we fold the change in _cj into the base one instead? This will reduce the need of _cj and will benefit other languages- for example the using of U+201C/201D in yi pages (see pages in http://yi.people.com.cn for example)

For example the following text copied from http://yi.people.com.cn/157841/15690736.html There are no space and they also use “ and ” :
ꃀꀉꒉꊰꉆꃢꌠꉌꊈꃄꌬꂿꄷ，ꄈꍏꑷꐊꈴꃀꏃꃢꈿ，ꂱꄷꃅꍔꃚꏲꉻꄺꀱꌋꆀꌅꊋꏭꆦ、ꐊꑷꃅꃹꅼꃅꄺꀱ、ꐊꑷꃅꏦꈴꇩꏲꌠꄹꎆ、“ꊯꌕꉬ”ꄐꎖꏤꄻ、ꐊꑷꃅꏦꇨꃅꄈꏱꄻꑠꌤꃆꀉꒉꀊꂾꌠꏤꇽꄐꁆ。ꉬꈎꅑꅸꇁꌠ，ꉪꊇꏓꇲꄉ“ꉬꑵꋍꉻ”ꏓꄜꄐꉻꄹꎆ、ꋮꄮꃅ“ꇖꑵꐊꑷ”ꌠꄐꇐꄐꉻꄹꎆ，“ꊰꑋꉬ”ꄐꎖꃅꊋꐛ，“ꊯꌕꉬ”ꄐꎖꁣꀋꐥꃅꃅ，ꄈꌋꆀꇩꏤꌤꑘꐊꑷꃅꐛꊂꀊꏀꌠꀹꄻ。

FrankYFTang · 2019-04-04T06:51:49Z

Another usage of U+201C/201D in language which has no space is Tibet. See examples in pages in http://tibet.people.com.cn . For example, the first line in the text of
http://tibet.cpc.people.com.cn/15760510.html

  ཤིན་ཧྭ་གསར་འགྱུར་ཁང་པེ་ཅིང་ནས་ཟླ་4ཚེས་1ཉིན་གློག་འཕྲིན་འབྱོར་གསལ། གོང་ཚེས་ཉིན་རྒྱལ་ཁབ་ཀྱི་ཀྲུའུ་ཞི་ཞི་ཅིན་ཕིང་གིས་མི་དམངས་ཚོགས་ཁང་ཆེ་མོར“བསླབ་པ་རྒན་གྲས་ཚོགས་པའི”འཐུས་མི་ཚོགས་པ་དང་མཇལ་འཕྲད་གནང་བ་རེད།

show there are no space in tibet and they also use U+201C/201D. So this need is clearly not just for zh and ja.

FrankYFTang · 2019-04-04T07:05:37Z

https://www.thairath.co.th/content/1536760 show Thai also use U+201C/201D and we know Thai is not line break by space.

ICU-13441 For zh/ja, tailor linebreak classes for quotations such as …

c5e3f6b

…“ 201C and ” 201D

pedberg-icu self-assigned this Oct 24, 2018

pedberg-icu requested review from aheninger and pedberg-icu October 24, 2018 18:11

aheninger approved these changes Oct 30, 2018

View reviewed changes

pedberg-icu approved these changes Nov 7, 2018

View reviewed changes

pedberg-icu merged commit 46a888b into unicode-org:master Nov 15, 2018

xhacker deleted the ICU-13441 branch November 15, 2018 05:15

xfq mentioned this pull request Feb 20, 2020

UAX #14 for line-breaking with quotation marks w3c/clreq#245

Open

peng1999 mentioned this pull request May 30, 2023

Use icu4x for linebreaking algorithm typst/typst#1355

Merged

YDX-2147483647 mentioned this pull request Jan 16, 2024

Chinese punctuation is placed at the beginning of the line in some cases typst/typst#3082

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ICU-13441 For zh/ja, tailor linebreak classes for quotations such as “ 201C and ” 201D #223

ICU-13441 For zh/ja, tailor linebreak classes for quotations such as “ 201C and ” 201D #223

Uh oh!

xhacker commented Oct 18, 2018 •

edited

Loading

Uh oh!

CLAassistant commented Oct 18, 2018 •

edited

Loading

Uh oh!

jungshik commented Oct 31, 2018

Uh oh!

jungshik commented Oct 31, 2018

Uh oh!

xhacker commented Oct 31, 2018

Uh oh!

sffc commented Nov 3, 2018

Uh oh!

jungshik commented Nov 3, 2018

Uh oh!

xhacker commented Nov 4, 2018

Uh oh!

pedberg-icu commented Nov 7, 2018 •

edited

Loading

Uh oh!

pedberg-icu commented Nov 15, 2018 •

edited

Loading

Uh oh!

jungshik commented Nov 16, 2018

Uh oh!

FrankYFTang commented Apr 4, 2019

Uh oh!

FrankYFTang commented Apr 4, 2019

Uh oh!

FrankYFTang commented Apr 4, 2019

Uh oh!

Uh oh!

Uh oh!

ICU-13441 For zh/ja, tailor linebreak classes for quotations such as “ 201C and ” 201D #223

ICU-13441 For zh/ja, tailor linebreak classes for quotations such as “ 201C and ” 201D #223

Uh oh!

Conversation

xhacker commented Oct 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

CLAassistant commented Oct 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jungshik commented Oct 31, 2018

Uh oh!

jungshik commented Oct 31, 2018

Uh oh!

xhacker commented Oct 31, 2018

Uh oh!

sffc commented Nov 3, 2018

Uh oh!

jungshik commented Nov 3, 2018

Uh oh!

xhacker commented Nov 4, 2018

Uh oh!

pedberg-icu commented Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pedberg-icu commented Nov 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jungshik commented Nov 16, 2018

Uh oh!

FrankYFTang commented Apr 4, 2019

Uh oh!

FrankYFTang commented Apr 4, 2019

Uh oh!

FrankYFTang commented Apr 4, 2019

Uh oh!

Uh oh!

xhacker commented Oct 18, 2018 •

edited

Loading

CLAassistant commented Oct 18, 2018 •

edited

Loading

pedberg-icu commented Nov 7, 2018 •

edited

Loading

pedberg-icu commented Nov 15, 2018 •

edited

Loading