Recognize supplementary (non-BMP) punctuation & symbols#190
Recognize supplementary (non-BMP) punctuation & symbols#190tats-u wants to merge 5 commits intomicromark:mainfrom
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #190 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 57 58 +1
Lines 11932 12496 +564
==========================================
+ Hits 11932 12496 +564 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks for working on this! Appreciate it! I would like to see a test case in the next version of CommonMark for this. Then things can be changed here |
|
Do you know which repository I should submit a PR to, https://github.com/commonmark/commonmark-spec-web or https://github.com/commonmark/commonmark-spec? Also 'Should handle lonely surrogate pair around emphasis' is not suitable for common tests shared with implementations in other than JS/Java/C# because it contains lonely surrogates, which are not valid for UTF-8 (and possibly UTF-32). |
|
PR goes to Maintainers are conservative with breaking changes like this. For the algorithm in the appendix (https://spec.commonmark.org/0.31.2/#phase-2-inline-structure), I think that is very complex, you might want to ask John to do that? |
|
I’d recommend not testing lonely surrogates for now then. You can also ask john on how best to test that. The CM spec does not mention UTF8, so perhaps this is something that is out of scope to CM. |
Indeed it may be better left as implementation-defined. |
|
I was talking about the CM spec. If it is impossible to add a test there, then I do not recommend trying to add a test there. |
|
I was wrong; I think that the spec treats isolated surrogate code units as non-punctuation now because a character there is an Unicode code point, which includes surrogate code points (D800-DFFF), and the category of surrogate code points is Cs (not P or S). |
|
I rebased this PR to main, but I got in CI:
I have not modified this file. It is strange. |
Probably will be fixed by #196 |
|
Rebased to main. Please tell me if I need to squash commits into one. |
Initial checklist
Description of changes
Fixes #189
We might need more tests to deal with abnormal surrogate pair patterns.