Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Rationalize name-char #1008

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 17, 2025
Merged

Rationalize name-char #1008

merged 14 commits into from
Feb 17, 2025

Conversation

macchiati
Copy link
Member

@macchiati macchiati commented Feb 16, 2025

#724 Rationalize name-char

@macchiati macchiati marked this pull request as ready for review February 16, 2025 01:27
@macchiati
Copy link
Member Author

Here is a first cut at changes for name-char. Feedback welcome

Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rather cursory initial pass.

In addition to line comments, this PR needs to be accompanied by some test cases.

Comment on lines 60 to 62
/ %xA1-61B ; omit Cc %x7F-9F, Whitespace %xA0, Ascii 【`】 【{|}~】
/ %x61D-167F ; omit BidiControl %x61C
/ %x1681-1FFF ; omit Whitespace %x1680
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other character range definitions, the "omit" comments are offset by one compared to how they're here, as in (showing these three lines only as an example):

Suggested change
/ %xA1-61B ; omit Cc %x7F-9F, Whitespace %xA0, Ascii 【`】 【{|}~】
/ %x61D-167F ; omit BidiControl %x61C
/ %x1681-1FFF ; omit Whitespace %x1680
/ %xA1-61B ; omit BidiControl %x61C
/ %x61D-167F ; omit Whitespace %x1680
/ %x1681-1FFF ; omit Whitespace %x2000-200A

The same style should be used in all these comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a difference on my screen. Do you want an additional space before or after 'omit', or a space deleted before or after 'omit'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the EA brackets to guillemets, since they line up better for monospace.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant offset in a vertical direction, so a comment like "omit BidiControl %x61C" should follow the range %xA1-61B, rather than the range %x61D-167F.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will work on that.

/ %xA1-61B ; omit Cc %x7F-9F, Whitespace %xA0, Ascii 【`】 【{|}~】
/ %x61D-167F ; omit BidiControl %x61C
/ %x1681-1FFF ; omit Whitespace %x1680
/ %x200B-200D ; omit Whitespace %x2000-200A
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set is ZWSP, ZWNJ, and ZWJ. Should they really be included in name-start? That seems surprising to me, and with no positive utility.

We will need ZWNJ and ZWJ within names, though, so maybe it's fine for them to be here. But why ZWSP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have name-char. Why not put the joiners in there?

I kind of also question ZWSP

Copy link
Member Author

@macchiati macchiati Feb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no particular utility to having ZWSP start a name-char, nor any utility to having it end a name-char. (

It doesn't hurt to move that one (ZWSP) to name-char, but it doesn't really make a dent either — and we really wouldn't want to go too far down the very long and slippery slope. That's for linters and guidance.

That being said, if people want it out I can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question here is why put these characters in name-start, where they have no utility? At least in name-char they would be enclosed or at the end?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm afraid of is that if we move that to name-char, it will just open it up to people endlessly complaining that:

"ZWSP" is in name-chart instead of name-start: why is XXX in name-start when it should also be just be in name-chart???:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, the basic difference between name-char and name-start is that

  • name is used in identifiers and variables, and can't start with a digit, -, or .
  • name-char is used in literals, and can start with digit, -, .

The syntactic motivation is clear: to make sure that identifiers and variables are distinguishable from numbers. That is a clear syntactic need.

ZWSP certainly isn't needed at the start of an identifier or variable, but there is an large and complicated list of characters that are also not needed at start of identifiers and variables, and plucking just one of those characters out, without any syntactic need, doesn't actually provide much value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we do not allow space characters or control characters, I'd prefer not allowing zero-width spaces in names or unquoted literals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A zero-width space is not a space; that is just a name used for familiarity. It is a Format character, like many others.

https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bgc%3Dformat%7D&g=&i=

Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good start. Some small things.

@aphillips aphillips added syntax Issues related with syntax or ABNF normative Issue affects normative text in the specification LDML47 LDML 47 Release (Stable) labels Feb 16, 2025
@eemeli
Copy link
Collaborator

eemeli commented Feb 17, 2025

This also needs to be fixed:

Valid content for _names_ is based on <cite>Namespaces in XML 1.0</cite>'s
[NCName](https://www.w3.org/TR/xml-names/#NT-NCName).
This is different from XML's [Name](https://www.w3.org/TR/xml/#NT-Name)
in that it MUST NOT contain a U+003A COLON `:`.
Otherwise, the set of characters allowed in a _name_ is large.

spec/syntax.md Outdated
Comment on lines 790 to 791
They are similar to <cite>Namespaces in XML 1.0</cite>'s [NCName](https://www.w3.org/TR/xml-names/#NT-NCName),
but have been updated to be more consistent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how far we've departed from that spec, this note seems only relevant within the history of the spec, and not with where we're ending up with this PR. I'd prefer dropping it, and moving the preceding sentence (if keeping) above the preceding note.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

Co-authored-by: Addison Phillips <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
Comment on lines +57 to +58
/ %x2B ; «+» omit Ascii: «,-./0123456789:;<=>?@» «[\]^»
/ %x5F ; «_» omit Cc: %x7F-9F, Whitespace: %xA0, Ascii: «`» «{|}~»
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there two sets of guillemet-quoted ASCII characters here?

Suggested change
/ %x2B ; «+» omit Ascii: «,-./0123456789:;<=>?@» «[\]^»
/ %x5F ; «_» omit Cc: %x7F-9F, Whitespace: %xA0, Ascii: «`» «{|}~»
/ %x2B ; «+» omit Ascii: «,-./0123456789:;<=>?@[]^» and REVERSE SOLIDUS "\"
/ %x5F ; «_» omit Ascii: «`{|}~», Cc: %x7F-9F, Whitespace: %xA0

@aphillips aphillips merged commit b343bfa into main Feb 17, 2025
1 check passed
@aphillips aphillips deleted the macchiati-i724-rationalize-name-char branch February 17, 2025 18:04
eemeli added a commit to messageformat/messageformat that referenced this pull request Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LDML47 LDML 47 Release (Stable) normative Issue affects normative text in the specification syntax Issues related with syntax or ABNF
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants