Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ICU-13441 For zh/ja, tailor linebreak classes for quotations such as “ 201C and ” 201D #223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 15, 2018

Conversation

xhacker
Copy link
Contributor

@xhacker xhacker commented Oct 18, 2018

Special tailoring for CJ so that “ 201C and ” 201D become OP and CL instead of QU.

line_cj.txt was copied from line.txt. The diffs between line_cj.txt and line.txt are identical to the changes for line_loose_cj.txt and line_normal_cj.txt.

Checklist

@CLAassistant
Copy link

CLAassistant commented Oct 18, 2018

CLA assistant check
All committers have signed the CLA.

@pedberg-icu pedberg-icu self-assigned this Oct 24, 2018
@jungshik
Copy link
Contributor

As opposed to what @pedberg wrote in the bug, this PR does add a new rule file, line_cj and increased the data size by tens of kilobytes. We really need to find a way to share rules across locales because the differences are very small.

@jungshik
Copy link
Contributor

BTW, what's the canonicalization source of brk rules? Is it icu instead of cldr?

@xhacker
Copy link
Contributor Author

xhacker commented Oct 31, 2018

BTW, what's the canonicalization source of brk rules? Is it icu instead of cldr?

They are currently in ICU. Although ideally they should be in CLDR.

@sffc
Copy link
Member

sffc commented Nov 3, 2018

This PR is approved. Should it be merged?

@jungshik
Copy link
Contributor

jungshik commented Nov 3, 2018

IMHO, we have to go back to the bug. I don't fully understand why U+201C and U+201D have to be treated specially in zh/ja only. This can open a floodgate for other locales + quotation character combinations, too.

@xhacker
Copy link
Contributor Author

xhacker commented Nov 4, 2018

IMHO, we have to go back to the bug. I don't fully understand why U+201C and U+201D have to be treated specially in zh/ja only. This can open a floodgate for other locales + quotation character combinations, too.

CJ don’t have spaces to provide breaking opportunities. If we don’t allow breaking for quotation marks, in some cases it can be rediculous:

2018-08-29 3 17 20

Other locales may have the same issue but the results won’t be that bad.

@pedberg-icu
Copy link
Contributor

pedberg-icu commented Nov 7, 2018

As opposed to what @pedberg wrote in the bug, this PR does add a new rule file, line_cj and increased the data size by tens of kilobytes.

Yes, when I wrote that in the bug, I had forgotten that the default behavior for CJ is not the same as the behavior in line_normal_cj.txt (that is, for CJ, in addition to the 3 W3C linebreak modes of strict, normal, and loose, we also have the default behavior which is none of those). We could reclaim that tens of kilobytes by actually making the default CJ behavior be the same as the line_normal_cj behavior.

BTW, what's the canonicalization source of brk rules? Is it icu instead of cldr?

For the standard rules, it is basically Unicode (UAX #29 and UAX #14), sometimes with some emoji-related extensions from UTS #51 or elsewhere. For the tailored rules, it is generally ICU; these are then manually synched over to CLDR rules late in each CLDR development cycle. But they do not exactly match because, for instance, CLDR does have the dictionary-based breaking that ICU supports.

This PR is approved. Should it be merged?
IMHO, we have to go back to the bug. I don't fully understand why U+201C and U+201D have to be treated specially in zh/ja only. This can open a floodgate for other locales + quotation character combinations, too.

Dongyuan responded about the motivation for the quote-related CJ tailoring. But it seems like there may not be agreement on merging this, I will bring it to the TC meeting.

@pedberg-icu pedberg-icu merged commit 46a888b into unicode-org:master Nov 15, 2018
@pedberg-icu
Copy link
Contributor

pedberg-icu commented Nov 15, 2018

I just created [ICU-20274] to consider whether the default Japanese linebreak rules should match the W3C line-break=normal rules as defined for Japanese (i.e., whether ja should default to line_normal_cj rather than line_normal).

@xhacker xhacker deleted the ICU-13441 branch November 15, 2018 05:15
@jungshik
Copy link
Contributor

Thank you, Peter, for the explanation about the canonical source.

We could reclaim that tens of kilobytes by actually making the default CJ behavior be the same as the line_normal_cj behavior.

I'll follow up on ICU-20274.

As for the motivation for this PR, I know why this PR was made, but I don't think it's just Chinese or Japanese that may not have a space before or after U+201[CD].

@FrankYFTang
Copy link
Contributor

Could we fold the change in _cj into the base one instead? This will reduce the need of _cj and will benefit other languages- for example the using of U+201C/201D in yi pages (see pages in http://yi.people.com.cn for example)

For example the following text copied from http://yi.people.com.cn/157841/15690736.html There are no space and they also use “ and ” :
ꃀꀉꒉꊰꉆꃢꌠꉌꊈꃄꌬꂿꄷ,ꄈꍏꑷꐊꈴꃀꏃꃢꈿ,ꂱꄷꃅꍔꃚꏲꉻꄺꀱꌋꆀꌅꊋꏭꆦ、ꐊꑷꃅꃹꅼꃅꄺꀱ、ꐊꑷꃅꏦꈴꇩꏲꌠꄹꎆ、“ꊯꌕꉬ”ꄐꎖꏤꄻ、ꐊꑷꃅꏦꇨꃅꄈꏱꄻꑠꌤꃆꀉꒉꀊꂾꌠꏤꇽꄐꁆ。ꉬꈎꅑꅸꇁꌠ,ꉪꊇꏓꇲꄉ“ꉬꑵꋍꉻ”ꏓꄜꄐꉻꄹꎆ、ꋮꄮꃅ“ꇖꑵꐊꑷ”ꌠꄐꇐꄐꉻꄹꎆ,“ꊰꑋꉬ”ꄐꎖꃅꊋꐛ,“ꊯꌕꉬ”ꄐꎖꁣꀋꐥꃅꃅ,ꄈꌋꆀꇩꏤꌤꑘꐊꑷꃅꐛꊂꀊꏀꌠꀹꄻ。

@FrankYFTang
Copy link
Contributor

Another usage of U+201C/201D in language which has no space is Tibet. See examples in pages in http://tibet.people.com.cn . For example, the first line in the text of
http://tibet.cpc.people.com.cn/15760510.html

  ཤིན་ཧྭ་གསར་འགྱུར་ཁང་པེ་ཅིང་ནས་ཟླ་4ཚེས་1ཉིན་གློག་འཕྲིན་འབྱོར་གསལ། གོང་ཚེས་ཉིན་རྒྱལ་ཁབ་ཀྱི་ཀྲུའུ་ཞི་ཞི་ཅིན་ཕིང་གིས་མི་དམངས་ཚོགས་ཁང་ཆེ་མོར“བསླབ་པ་རྒན་གྲས་ཚོགས་པའི”འཐུས་མི་ཚོགས་པ་དང་མཇལ་འཕྲད་གནང་བ་རེད།

show there are no space in tibet and they also use U+201C/201D. So this need is clearly not just for zh and ja.

@FrankYFTang
Copy link
Contributor

https://www.thairath.co.th/content/1536760 show Thai also use U+201C/201D and we know Thai is not line break by space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants