-
-
Notifications
You must be signed in to change notification settings - Fork 815
ICU-13441 For zh/ja, tailor linebreak classes for quotations such as “ 201C and ” 201D #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…“ 201C and ” 201D
As opposed to what @pedberg wrote in the bug, this PR does add a new rule file, line_cj and increased the data size by tens of kilobytes. We really need to find a way to share rules across locales because the differences are very small. |
BTW, what's the canonicalization source of brk rules? Is it icu instead of cldr? |
They are currently in ICU. Although ideally they should be in CLDR. |
This PR is approved. Should it be merged? |
IMHO, we have to go back to the bug. I don't fully understand why U+201C and U+201D have to be treated specially in zh/ja only. This can open a floodgate for other locales + quotation character combinations, too. |
CJ don’t have spaces to provide breaking opportunities. If we don’t allow breaking for quotation marks, in some cases it can be rediculous: Other locales may have the same issue but the results won’t be that bad. |
Yes, when I wrote that in the bug, I had forgotten that the default behavior for CJ is not the same as the behavior in line_normal_cj.txt (that is, for CJ, in addition to the 3 W3C linebreak modes of strict, normal, and loose, we also have the default behavior which is none of those). We could reclaim that tens of kilobytes by actually making the default CJ behavior be the same as the line_normal_cj behavior.
For the standard rules, it is basically Unicode (UAX #29 and UAX #14), sometimes with some emoji-related extensions from UTS #51 or elsewhere. For the tailored rules, it is generally ICU; these are then manually synched over to CLDR rules late in each CLDR development cycle. But they do not exactly match because, for instance, CLDR does have the dictionary-based breaking that ICU supports.
Dongyuan responded about the motivation for the quote-related CJ tailoring. But it seems like there may not be agreement on merging this, I will bring it to the TC meeting. |
I just created [ICU-20274] to consider whether the default Japanese linebreak rules should match the W3C line-break=normal rules as defined for Japanese (i.e., whether ja should default to line_normal_cj rather than line_normal). |
Thank you, Peter, for the explanation about the canonical source.
I'll follow up on ICU-20274. As for the motivation for this PR, I know why this PR was made, but I don't think it's just Chinese or Japanese that may not have a space before or after U+201[CD]. |
Could we fold the change in _cj into the base one instead? This will reduce the need of _cj and will benefit other languages- for example the using of U+201C/201D in yi pages (see pages in http://yi.people.com.cn for example) For example the following text copied from http://yi.people.com.cn/157841/15690736.html There are no space and they also use “ and ” : |
Another usage of U+201C/201D in language which has no space is Tibet. See examples in pages in http://tibet.people.com.cn . For example, the first line in the text of
show there are no space in tibet and they also use U+201C/201D. So this need is clearly not just for zh and ja. |
https://www.thairath.co.th/content/1536760 show Thai also use U+201C/201D and we know Thai is not line break by space. |
Special tailoring for CJ so that
“ 201C
and” 201D
becomeOP
andCL
instead ofQU
.line_cj.txt was copied from line.txt. The diffs between line_cj.txt and line.txt are identical to the changes for line_loose_cj.txt and line_normal_cj.txt.
Checklist