Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Recognize supplementary (non-BMP) punctuation & symbols#190

Open
tats-u wants to merge 5 commits intomicromark:mainfrom
tats-u:non-bmp
Open

Recognize supplementary (non-BMP) punctuation & symbols#190
tats-u wants to merge 5 commits intomicromark:mainfrom
tats-u:non-bmp

Conversation

@tats-u
Copy link
Contributor

@tats-u tats-u commented Jan 22, 2025

Initial checklist

  • I read the support docs
  • I read the contributing guide
  • I agree to follow the code of conduct
  • I searched issues and discussions and couldn’t find anything or linked relevant results below
  • I made sure the docs are up to date
  • I included tests (or that’s not needed)

Description of changes

Fixes #189
We might need more tests to deal with abnormal surrogate pair patterns.

@github-actions github-actions bot added 👋 phase/new Post is being triaged automatically 🤞 phase/open Post is being triaged manually and removed 👋 phase/new Post is being triaged automatically labels Jan 22, 2025
@codecov
Copy link

codecov bot commented Jan 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (2edb5e7) to head (ba3ae01).
Report is 46 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##              main      #190    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           57        58     +1     
  Lines        11932     12496   +564     
==========================================
+ Hits         11932     12496   +564     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wooorm
Copy link
Member

wooorm commented Jan 22, 2025

Thanks for working on this! Appreciate it!

I would like to see a test case in the next version of CommonMark for this. Then things can be changed here

@tats-u
Copy link
Contributor Author

tats-u commented Jan 26, 2025

Do you know which repository I should submit a PR to, https://github.com/commonmark/commonmark-spec-web or https://github.com/commonmark/commonmark-spec?

Also 'Should handle lonely surrogate pair around emphasis' is not suitable for common tests shared with implementations in other than JS/Java/C# because it contains lonely surrogates, which are not valid for UTF-8 (and possibly UTF-32).

@wooorm
Copy link
Member

wooorm commented Jan 27, 2025

PR goes to commonmark/commonmark-spec.

Maintainers are conservative with breaking changes like this.
I recommend making it as easy as possible to merge. As short and clear as possible.
You do not have to do everything in one big PR: if someone finds something controversial, that would block the whole PR.
If things are blocked, try and get something in there first, and improve on it later.

For the algorithm in the appendix (https://spec.commonmark.org/0.31.2/#phase-2-inline-structure), I think that is very complex, you might want to ask John to do that?
From what I heard, John already has a (local?) branch for CJK+emphasis in cmark? So perhaps John can develop/merge that together with a PR to the spec to change the appendix?

@wooorm
Copy link
Member

wooorm commented Jan 27, 2025

I’d recommend not testing lonely surrogates for now then. You can also ask john on how best to test that. The CM spec does not mention UTF8, so perhaps this is something that is out of scope to CM.

@tats-u
Copy link
Contributor Author

tats-u commented Feb 16, 2025

I’d recommend not testing lonely surrogates for now then.

Indeed it may be better left as implementation-defined. I will remove it from the test case in this repository later. Update: should I just delete assert.equal to assure only that micromark does not throw a runtime exception for such ill-formed inputs?
In the first place, it seems to be officially called isolated surrogate code unit, and strings containing it returns false for .isWellFormed().

@tats-u tats-u changed the title Recognize non-BMP punctuation & symbols Recognize supplementary (non-BMP) punctuation & symbols Feb 16, 2025
@wooorm
Copy link
Member

wooorm commented Feb 17, 2025

I was talking about the CM spec. If it is impossible to add a test there, then I do not recommend trying to add a test there.
It is possible to have a test here, so we can have a test here.

@tats-u
Copy link
Contributor Author

tats-u commented Feb 17, 2025

I was wrong; I think that the spec treats isolated surrogate code units as non-punctuation now because a character there is an Unicode code point, which includes surrogate code points (D800-DFFF), and the category of surrogate code points is Cs (not P or S).
However, I think that the spec should be revised to stop implementations from overthinking about surrogate code units or other ill-formed code unit sequences. (e.g. allows implementations to replace them with FFFD, which is a punctuation, in advance at their discretion)
I think the test cases for isolated surrogate code units may remain but is not so much recommended.

@tats-u
Copy link
Contributor Author

tats-u commented Mar 23, 2025

I rebased this PR to main, but I got in CI:

Error: test/util/slow-stream.js(23,49): error TS2345: Argument of type 'BufferEncoding | undefined' is not assignable to parameter of type 'BufferEncoding'.

I have not modified this file. It is strange.

@tats-u
Copy link
Contributor Author

tats-u commented Mar 30, 2025

I have not modified this file. It is strange.

Probably will be fixed by #196

@tats-u
Copy link
Contributor Author

tats-u commented Apr 2, 2025

Rebased to main. Please tell me if I need to squash commits into one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤞 phase/open Post is being triaged manually

Development

Successfully merging this pull request may close these issues.

Recognize non-BMP punctuation & symbols (to prepare for CJK support in the future)

2 participants