From 681963f5c0dd6339cc97bb71eab67c2581b1e960 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Mon, 18 Sep 2023 00:03:01 +0200 Subject: [PATCH 01/32] Document the design of quoted literals --- exploration/0000-quoted-literals.md | 163 ++++++++++++++++++++++++++++ 1 file changed, 163 insertions(+) create mode 100644 exploration/0000-quoted-literals.md diff --git a/exploration/0000-quoted-literals.md b/exploration/0000-quoted-literals.md new file mode 100644 index 0000000000..026871940f --- /dev/null +++ b/exploration/0000-quoted-literals.md @@ -0,0 +1,163 @@ +# Quoted Literals + +## Objective + +Document the rationale for including quoted literals in MessageFormat and for delimiting them with the vertical line character, `|`. + +## Background + +MessageFormat allows both quoted and unquoted literals. Unquoted literals satisfy many common use-cases for literals: they are sufficient to represent numbers and single-word option values and variant keys. Quoted literals are helpful in exotic use-cases. + +In early drafts of the MessageFormat syntax, quoted literals used to be delimited first with quotation marks (`"foo bar"`), and then with round parentheses, e.g. `(foo bar)`. See [#263](https://github.com/unicode-org/message-format-wg/issues/263). + +In [#414](https://github.com/unicode-org/message-format-wg/pull/414) proposed to revert these changes and go back to using single and/or double quotes as delimiters. The propsal was rejected. This document is an artifact of that rejection. + +## Use-Cases + +_What use-cases do we see? Ideally, quote concrete examples._ + +In general, quoted literals are useful for: + +1. encoding literals containing whitespace, like literals consisting of multiple words, +1. encoding literals containing exotic characters that do not conform to the `unquoted` production in ABNF. + +More specifically: + +* Message authors and translators need to be able to use the apostrophe in the message content, and may want to use the single quote character to represent it instead of the typograhic (curly) apostrophe. + + > ``` + > {…{|New Year's Eve|}…} + > ``` + +* Message authors may want to use literals to define locale-aware dates as literals in the RFC 7231 format: + + > ``` + > {The Unix epoch is defined as {|Thu, 01 Jan 1970 00:00:00 GMT| :datetime}.} + > ``` + +* Message authors may want to use multiple words as values of certain options passed to custom functions and markup elements: + + > ``` + > {{+button title=|Click here!|}Submit{-button}} + > ``` + > + > Note that quoted literals cannot contain placeholders, making interpolating data into them impossible. + > + > ``` + > -- This is impossible in MessageFormat 2.0. + > {{+button title=|Goodbye, {$userName}!|}Sign out{-button}} + > ``` + +* Selector function implementers may want to support exotic characters in variant keys to effectively create "mini-DSLs" for the matching logic: + + > ``` + > match {$count :myNumber} + > when |<10| {A handful.} + > when * {Lots.} + > ``` + +* Message authors may want to protect untranslatable strings: + + > ``` + > {Visit {|http://www.example.com| @translate=false}.} + > ``` + > + > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). + +* Message authors may want to decorate substrings as being written in a particular language, different from the message's language, for the purpose of accessibility, text-to-speech, and semantic correctness. + + > ``` + > {The official native name of the Republic of Poland is {|Rzeczpospolita Polska| @lang=pl}.} + > ``` + > + > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). + +* Developers may want to embed messages with quoted literals in code written in another programming language which uses single or double quotes to delimit strings. + + > ```js + > let message = new MessageFormat( + > "en", "{A message with {|a literal|}.}"); + > ``` + +* Developers and localization engineers may want to embed messages with quoted literals in a container format, such as JSON. + + > ```json + > { + > "msg": "{A message with {|a literal|}.}" + > } + > ``` + +## Requirements + +_What properties does the solution have to manifest to enable the use-cases above?_ + +* **[r1; high priority]** Minimize the need to escape characters inside literals. In particular, choose a delimiter that isn't frequently used in translation content. Having to escape characters inside literals is inconvenient and error-prone when done by hand, and it also introduces the backslash into the message, `\`, which is the escape introducer. The backslash then needs to be escaped too, when the message is embedded in code or containers. (This is how some syntaxes produce the gnarly `\\\`.) +* **[r2; high priority]** Minimize the need to escape characters when embedding messages in code or containers. In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. However, note that many programming languages also provide alternative ways of delimiting strings, e.g. *raw strings* or triple-quoted literals. +* **[r3; high priority]** Minimize the need to change the message in other ways than to escape some of its characters (e.g. rephrase content or change syntax). +* **[r4; medium priority]** Don't surprise users with syntax that's too exotic. We expect quoted literals to be rare, which means fewer opportunities to get used to their syntax and remember it. +* **[r5; low priority]** Be able to pair the opening and the closing delimiter, to aid parsers recover from syntax errors, and to leverage IDE's ability to highlight matching pairs of delimiters, to visually indicate to the user editing a message the bounds of the literal under caret. However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) or are outside patterns (when used as variant keys). + +## Constraints + +_What prior decisions and existing conditions limit the possible design?_ + +* **[c1]** MessageFormat uses the backslash, `\`, as the escape sequence introducer. +* **[c2]** Straight quotation marks, `'` and `"`, are common in content across many languages, even if other Unicode codepoints should be used in well-formatted text. +* **[c3]** Straight quotation marks, `'` and `"`, are common as string delimiters in many programming languages. + +## Proposed Design + +_Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ + +Use the vertical line character, `|`, to delimit quoted strings. For example: + +> ``` +> {The Unix epoch is defined as {|Thu, 01 Jan 1970 00:00:00 GMT| :datetime}.} +> ``` + +```abnf +literal = quoted / unquoted +quoted = "|" *(quoted-char / quoted-escape) "|" +quoted-char = %x0-5B ; omit \ + / %x5D-7B ; omit | + / %x7D-D7FF ; omit surrogates + / %xE000-10FFFF +quoted-escape = backslash ( backslash / "|" ) +``` + +By being both uncommon in text content and uncommon as a string delimiter in other programming languages, the vertical line sidesteps the "inwards" and "outwards" problems of escaping. + +* [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`. This means no extra `\` that need escaping. +* [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. +* [r3 GOOD] Message don't have to be modified otherwise before embedding them. +* [r4 FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. There's prior art in a practice of using vertical lines as delimiters for inline code literals. +* [r5 POOR] Vertical lines cannot be paired by parsers nor IDEs. + +## Alternatives Considered + +_What other solutions are available?_ +_How do they compare against the requirements?_ +_What other properties they have?_ + +### Use quotation marks + +Early drafts of the syntax specification used double quotes to delimit literals. This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/263#issue-1233590015), and was later proposed back in [#414](https://github.com/unicode-org/message-format-wg/pull/414). + +* [r1 POOR] Writing `"` and `'` in literals requires escaping them via `\`, which then needs to be escaped itself in code which uses `\` as the escape character (which is common). +* [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters. Most notably, storing MF2 messages in JSON suffers from this. In many programming languages, however, alternatives to quotation marks exist, which could be used to allow unescaped quotes in messages. See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). +* [r3 FAIR] One of the suggestions proposed to allow for both single and double quotation marks, and make them interchangeable in case one set was used by the inner content or surrounding code. This, however, requires directed modification of the message's body. +* [r4 GOOD] Quotation marks are universally recognized as string delimiters. +* [r5 POOR] Quotation marks cannot be paired by parsers nor IDEs. + +### Use round or angle brackets + +* Round parentheses are very uncommon as string delimiters [r2 GOOD], and thus may be surprising [r4 POOR]. Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. +* Angle brackets require escaping in XML-based storage formats [r2 POOR]. + +### Change escape introducer + +Changing the escape sequence introducer from backslash [c1] to another character could help partially mitigate the burden of first escaping literal delimiters and then escaping the escapes themselves [r1]. However, it wouldn't address other requirements and use-cases. + +### Double delimiters to escape them + +This is the approach taken by ICU MessageFormat 1.0 for quotes. It allows literals to contain quotes [r1 GOOD] at the expense of doubling the amount of escaping required when embedding messages in code [r2 POOR]. From 6fd7cc152c0cd917c94eb9096d42d729ff4b5ac6 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" <41898282+github-actions[bot]@users.noreply.github.com> Date: Sun, 17 Sep 2023 22:14:37 +0000 Subject: [PATCH 02/32] style: Apply Prettier --- exploration/0000-quoted-literals.md | 135 ++++++++++++++-------------- 1 file changed, 67 insertions(+), 68 deletions(-) diff --git a/exploration/0000-quoted-literals.md b/exploration/0000-quoted-literals.md index 026871940f..f1dd6f1eaa 100644 --- a/exploration/0000-quoted-literals.md +++ b/exploration/0000-quoted-literals.md @@ -23,87 +23,86 @@ In general, quoted literals are useful for: More specifically: -* Message authors and translators need to be able to use the apostrophe in the message content, and may want to use the single quote character to represent it instead of the typograhic (curly) apostrophe. +- Message authors and translators need to be able to use the apostrophe in the message content, and may want to use the single quote character to represent it instead of the typograhic (curly) apostrophe. - > ``` - > {…{|New Year's Eve|}…} - > ``` + > ``` + > {…{|New Year's Eve|}…} + > ``` -* Message authors may want to use literals to define locale-aware dates as literals in the RFC 7231 format: +- Message authors may want to use literals to define locale-aware dates as literals in the RFC 7231 format: - > ``` - > {The Unix epoch is defined as {|Thu, 01 Jan 1970 00:00:00 GMT| :datetime}.} - > ``` + > ``` + > {The Unix epoch is defined as {|Thu, 01 Jan 1970 00:00:00 GMT| :datetime}.} + > ``` -* Message authors may want to use multiple words as values of certain options passed to custom functions and markup elements: +- Message authors may want to use multiple words as values of certain options passed to custom functions and markup elements: - > ``` - > {{+button title=|Click here!|}Submit{-button}} - > ``` - > - > Note that quoted literals cannot contain placeholders, making interpolating data into them impossible. - > - > ``` - > -- This is impossible in MessageFormat 2.0. - > {{+button title=|Goodbye, {$userName}!|}Sign out{-button}} - > ``` + > ``` + > {{+button title=|Click here!|}Submit{-button}} + > ``` + > + > Note that quoted literals cannot contain placeholders, making interpolating data into them impossible. + > + > ``` + > -- This is impossible in MessageFormat 2.0. + > {{+button title=|Goodbye, {$userName}!|}Sign out{-button}} + > ``` -* Selector function implementers may want to support exotic characters in variant keys to effectively create "mini-DSLs" for the matching logic: +- Selector function implementers may want to support exotic characters in variant keys to effectively create "mini-DSLs" for the matching logic: - > ``` - > match {$count :myNumber} - > when |<10| {A handful.} - > when * {Lots.} - > ``` + > ``` + > match {$count :myNumber} + > when |<10| {A handful.} + > when * {Lots.} + > ``` -* Message authors may want to protect untranslatable strings: +- Message authors may want to protect untranslatable strings: - > ``` - > {Visit {|http://www.example.com| @translate=false}.} - > ``` - > - > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). + > ``` + > {Visit {|http://www.example.com| @translate=false}.} + > ``` + > + > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). -* Message authors may want to decorate substrings as being written in a particular language, different from the message's language, for the purpose of accessibility, text-to-speech, and semantic correctness. +- Message authors may want to decorate substrings as being written in a particular language, different from the message's language, for the purpose of accessibility, text-to-speech, and semantic correctness. - > ``` - > {The official native name of the Republic of Poland is {|Rzeczpospolita Polska| @lang=pl}.} - > ``` - > - > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). + > ``` + > {The official native name of the Republic of Poland is {|Rzeczpospolita Polska| @lang=pl}.} + > ``` + > + > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). -* Developers may want to embed messages with quoted literals in code written in another programming language which uses single or double quotes to delimit strings. +- Developers may want to embed messages with quoted literals in code written in another programming language which uses single or double quotes to delimit strings. - > ```js - > let message = new MessageFormat( - > "en", "{A message with {|a literal|}.}"); - > ``` + > ```js + > let message = new MessageFormat("en", "{A message with {|a literal|}.}"); + > ``` -* Developers and localization engineers may want to embed messages with quoted literals in a container format, such as JSON. +- Developers and localization engineers may want to embed messages with quoted literals in a container format, such as JSON. - > ```json - > { - > "msg": "{A message with {|a literal|}.}" - > } - > ``` + > ```json + > { + > "msg": "{A message with {|a literal|}.}" + > } + > ``` ## Requirements _What properties does the solution have to manifest to enable the use-cases above?_ -* **[r1; high priority]** Minimize the need to escape characters inside literals. In particular, choose a delimiter that isn't frequently used in translation content. Having to escape characters inside literals is inconvenient and error-prone when done by hand, and it also introduces the backslash into the message, `\`, which is the escape introducer. The backslash then needs to be escaped too, when the message is embedded in code or containers. (This is how some syntaxes produce the gnarly `\\\`.) -* **[r2; high priority]** Minimize the need to escape characters when embedding messages in code or containers. In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. However, note that many programming languages also provide alternative ways of delimiting strings, e.g. *raw strings* or triple-quoted literals. -* **[r3; high priority]** Minimize the need to change the message in other ways than to escape some of its characters (e.g. rephrase content or change syntax). -* **[r4; medium priority]** Don't surprise users with syntax that's too exotic. We expect quoted literals to be rare, which means fewer opportunities to get used to their syntax and remember it. -* **[r5; low priority]** Be able to pair the opening and the closing delimiter, to aid parsers recover from syntax errors, and to leverage IDE's ability to highlight matching pairs of delimiters, to visually indicate to the user editing a message the bounds of the literal under caret. However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) or are outside patterns (when used as variant keys). +- **[r1; high priority]** Minimize the need to escape characters inside literals. In particular, choose a delimiter that isn't frequently used in translation content. Having to escape characters inside literals is inconvenient and error-prone when done by hand, and it also introduces the backslash into the message, `\`, which is the escape introducer. The backslash then needs to be escaped too, when the message is embedded in code or containers. (This is how some syntaxes produce the gnarly `\\\`.) +- **[r2; high priority]** Minimize the need to escape characters when embedding messages in code or containers. In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. However, note that many programming languages also provide alternative ways of delimiting strings, e.g. _raw strings_ or triple-quoted literals. +- **[r3; high priority]** Minimize the need to change the message in other ways than to escape some of its characters (e.g. rephrase content or change syntax). +- **[r4; medium priority]** Don't surprise users with syntax that's too exotic. We expect quoted literals to be rare, which means fewer opportunities to get used to their syntax and remember it. +- **[r5; low priority]** Be able to pair the opening and the closing delimiter, to aid parsers recover from syntax errors, and to leverage IDE's ability to highlight matching pairs of delimiters, to visually indicate to the user editing a message the bounds of the literal under caret. However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) or are outside patterns (when used as variant keys). ## Constraints _What prior decisions and existing conditions limit the possible design?_ -* **[c1]** MessageFormat uses the backslash, `\`, as the escape sequence introducer. -* **[c2]** Straight quotation marks, `'` and `"`, are common in content across many languages, even if other Unicode codepoints should be used in well-formatted text. -* **[c3]** Straight quotation marks, `'` and `"`, are common as string delimiters in many programming languages. +- **[c1]** MessageFormat uses the backslash, `\`, as the escape sequence introducer. +- **[c2]** Straight quotation marks, `'` and `"`, are common in content across many languages, even if other Unicode codepoints should be used in well-formatted text. +- **[c3]** Straight quotation marks, `'` and `"`, are common as string delimiters in many programming languages. ## Proposed Design @@ -127,11 +126,11 @@ quoted-escape = backslash ( backslash / "|" ) By being both uncommon in text content and uncommon as a string delimiter in other programming languages, the vertical line sidesteps the "inwards" and "outwards" problems of escaping. -* [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`. This means no extra `\` that need escaping. -* [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. -* [r3 GOOD] Message don't have to be modified otherwise before embedding them. -* [r4 FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. There's prior art in a practice of using vertical lines as delimiters for inline code literals. -* [r5 POOR] Vertical lines cannot be paired by parsers nor IDEs. +- [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`. This means no extra `\` that need escaping. +- [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. +- [r3 GOOD] Message don't have to be modified otherwise before embedding them. +- [r4 FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. There's prior art in a practice of using vertical lines as delimiters for inline code literals. +- [r5 POOR] Vertical lines cannot be paired by parsers nor IDEs. ## Alternatives Considered @@ -143,16 +142,16 @@ _What other properties they have?_ Early drafts of the syntax specification used double quotes to delimit literals. This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/263#issue-1233590015), and was later proposed back in [#414](https://github.com/unicode-org/message-format-wg/pull/414). -* [r1 POOR] Writing `"` and `'` in literals requires escaping them via `\`, which then needs to be escaped itself in code which uses `\` as the escape character (which is common). -* [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters. Most notably, storing MF2 messages in JSON suffers from this. In many programming languages, however, alternatives to quotation marks exist, which could be used to allow unescaped quotes in messages. See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). -* [r3 FAIR] One of the suggestions proposed to allow for both single and double quotation marks, and make them interchangeable in case one set was used by the inner content or surrounding code. This, however, requires directed modification of the message's body. -* [r4 GOOD] Quotation marks are universally recognized as string delimiters. -* [r5 POOR] Quotation marks cannot be paired by parsers nor IDEs. +- [r1 POOR] Writing `"` and `'` in literals requires escaping them via `\`, which then needs to be escaped itself in code which uses `\` as the escape character (which is common). +- [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters. Most notably, storing MF2 messages in JSON suffers from this. In many programming languages, however, alternatives to quotation marks exist, which could be used to allow unescaped quotes in messages. See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). +- [r3 FAIR] One of the suggestions proposed to allow for both single and double quotation marks, and make them interchangeable in case one set was used by the inner content or surrounding code. This, however, requires directed modification of the message's body. +- [r4 GOOD] Quotation marks are universally recognized as string delimiters. +- [r5 POOR] Quotation marks cannot be paired by parsers nor IDEs. ### Use round or angle brackets -* Round parentheses are very uncommon as string delimiters [r2 GOOD], and thus may be surprising [r4 POOR]. Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. -* Angle brackets require escaping in XML-based storage formats [r2 POOR]. +- Round parentheses are very uncommon as string delimiters [r2 GOOD], and thus may be surprising [r4 POOR]. Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. +- Angle brackets require escaping in XML-based storage formats [r2 POOR]. ### Change escape introducer From 1fe4a4c650476ea2f26a58760205e15231743c79 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Mon, 18 Sep 2023 00:19:24 +0200 Subject: [PATCH 03/32] Rename file to match the PR --- exploration/{0000-quoted-literals.md => 0477-quoted-literals.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename exploration/{0000-quoted-literals.md => 0477-quoted-literals.md} (100%) diff --git a/exploration/0000-quoted-literals.md b/exploration/0477-quoted-literals.md similarity index 100% rename from exploration/0000-quoted-literals.md rename to exploration/0477-quoted-literals.md From 10f7354e2c34c2dc02740305b20eebe2a80dd794 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Mon, 18 Sep 2023 00:23:28 +0200 Subject: [PATCH 04/32] Add metadata --- exploration/0477-quoted-literals.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index f1dd6f1eaa..5d15549316 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -1,5 +1,15 @@ # Quoted Literals +Status: **Accepted** + +
+ Metadata +
+
Pull Request
+
#477
+
+
+ ## Objective Document the rationale for including quoted literals in MessageFormat and for delimiting them with the vertical line character, `|`. From d37b89330f456983ef58f05677a1cb8dbca971c3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Mon, 18 Sep 2023 12:45:29 +0200 Subject: [PATCH 05/32] Address Eemeli's review comments --- exploration/0477-quoted-literals.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index 5d15549316..b2bd51013e 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -102,7 +102,7 @@ _What properties does the solution have to manifest to enable the use-cases abov - **[r1; high priority]** Minimize the need to escape characters inside literals. In particular, choose a delimiter that isn't frequently used in translation content. Having to escape characters inside literals is inconvenient and error-prone when done by hand, and it also introduces the backslash into the message, `\`, which is the escape introducer. The backslash then needs to be escaped too, when the message is embedded in code or containers. (This is how some syntaxes produce the gnarly `\\\`.) - **[r2; high priority]** Minimize the need to escape characters when embedding messages in code or containers. In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. However, note that many programming languages also provide alternative ways of delimiting strings, e.g. _raw strings_ or triple-quoted literals. -- **[r3; high priority]** Minimize the need to change the message in other ways than to escape some of its characters (e.g. rephrase content or change syntax). +- **[r3; medium priority]** Minimize the need to change the message in other ways than to escape some of its characters (e.g. rephrase content, use typographic apostrophes, or switch to using a second set of delimtiers). - **[r4; medium priority]** Don't surprise users with syntax that's too exotic. We expect quoted literals to be rare, which means fewer opportunities to get used to their syntax and remember it. - **[r5; low priority]** Be able to pair the opening and the closing delimiter, to aid parsers recover from syntax errors, and to leverage IDE's ability to highlight matching pairs of delimiters, to visually indicate to the user editing a message the bounds of the literal under caret. However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) or are outside patterns (when used as variant keys). @@ -139,7 +139,7 @@ By being both uncommon in text content and uncommon as a string delimiter in oth - [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`. This means no extra `\` that need escaping. - [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. - [r3 GOOD] Message don't have to be modified otherwise before embedding them. -- [r4 FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. There's prior art in a practice of using vertical lines as delimiters for inline code literals. +- [r4 FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. There's prior art in a practice of using vertical lines as delimiters for inline code literals in email and chat. Vertical bars can also be used to delimit [regular expressions in `sed`](https://en.wikipedia.org/wiki/Vertical_bar#Delimiter), and as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). - [r5 POOR] Vertical lines cannot be paired by parsers nor IDEs. ## Alternatives Considered @@ -156,12 +156,13 @@ Early drafts of the syntax specification used double quotes to delimit literals. - [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters. Most notably, storing MF2 messages in JSON suffers from this. In many programming languages, however, alternatives to quotation marks exist, which could be used to allow unescaped quotes in messages. See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). - [r3 FAIR] One of the suggestions proposed to allow for both single and double quotation marks, and make them interchangeable in case one set was used by the inner content or surrounding code. This, however, requires directed modification of the message's body. - [r4 GOOD] Quotation marks are universally recognized as string delimiters. -- [r5 POOR] Quotation marks cannot be paired by parsers nor IDEs. +- [r5 FAIR] Quotation marks cannot be paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. ### Use round or angle brackets -- Round parentheses are very uncommon as string delimiters [r2 GOOD], and thus may be surprising [r4 POOR]. Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. +- Round parentheses are very uncommon as string delimiters [r2 GOOD], and thus may be surprising, especially given the well-established meaning in prose [r4 POOR]. That said, there's prior art in using them for [delimiting strings in PostScript](https://en.wikipedia.org/wiki/PostScript#%22Hello_world%22). Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. - Angle brackets require escaping in XML-based storage formats [r2 POOR]. +- All brackets can be easily paired by parsers and IDEs [r5 GOOD]. ### Change escape introducer From 49e5f0b33f00405953ca51e5a226d9d1e3b28d82 Mon Sep 17 00:00:00 2001 From: Eemeli Aro Date: Mon, 18 Sep 2023 14:22:17 +0300 Subject: [PATCH 06/32] Split "Dual quoting" from "Use quotation marks" --- exploration/0477-quoted-literals.md | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index b2bd51013e..5eac47a754 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -150,14 +150,33 @@ _What other properties they have?_ ### Use quotation marks -Early drafts of the syntax specification used double quotes to delimit literals. This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/263#issue-1233590015), and was later proposed back in [#414](https://github.com/unicode-org/message-format-wg/pull/414). +Early drafts of the syntax specification used double quotes to delimit literals. +This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/263#issue-1233590015). - [r1 POOR] Writing `"` and `'` in literals requires escaping them via `\`, which then needs to be escaped itself in code which uses `\` as the escape character (which is common). - [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters. Most notably, storing MF2 messages in JSON suffers from this. In many programming languages, however, alternatives to quotation marks exist, which could be used to allow unescaped quotes in messages. See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). -- [r3 FAIR] One of the suggestions proposed to allow for both single and double quotation marks, and make them interchangeable in case one set was used by the inner content or surrounding code. This, however, requires directed modification of the message's body. +- [r3 ???] - [r4 GOOD] Quotation marks are universally recognized as string delimiters. - [r5 FAIR] Quotation marks cannot be paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. +### Dual quoting + +PR [#414](https://github.com/unicode-org/message-format-wg/pull/414) proposes to +allow either single quotes `'` or double quotes `"` as literal delimiters, +a variant of the "Use quotation marks" solution. + +- [r1 FAIR] Writing `"` and `'` in literals doesn't require escaping them via `\`, + as long as they do not match the literal's delimiter. + Literals containing both `'` and `"` will need to have at least one of those characters + escaped via `\`, which may itself need escaping in the container format. +- [r2 GOOD] Embedding messages in certain container formats requires escaping the literal delimiters. + If the container format does not itself support dual quoting, + the embedded message's quotes may be adjusted to avoid their escaping. +- [r3 ???] +- [r4 GOOD] Quotation marks are universally recognized as string delimiters. +- [r5 FAIR] Quotation marks cannot be paired by parsers nor IDEs, + but many text editors provide features to make working with and around quotes easier. + ### Use round or angle brackets - Round parentheses are very uncommon as string delimiters [r2 GOOD], and thus may be surprising, especially given the well-established meaning in prose [r4 POOR]. That said, there's prior art in using them for [delimiting strings in PostScript](https://en.wikipedia.org/wiki/PostScript#%22Hello_world%22). Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. From 0ae3235e6ae64e12207edc285f5c528bf01306e3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Mon, 18 Sep 2023 18:25:24 +0200 Subject: [PATCH 07/32] Apply suggestions by Richard Co-authored-by: Richard Gibson --- exploration/0477-quoted-literals.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index 5eac47a754..f3a1367083 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -85,7 +85,7 @@ More specifically: - Developers may want to embed messages with quoted literals in code written in another programming language which uses single or double quotes to delimit strings. > ```js - > let message = new MessageFormat("en", "{A message with {|a literal|}.}"); + > let message = new MessageFormat('en', '{A message with {|a literal|}.}'); > ``` - Developers and localization engineers may want to embed messages with quoted literals in a container format, such as JSON. @@ -102,7 +102,7 @@ _What properties does the solution have to manifest to enable the use-cases abov - **[r1; high priority]** Minimize the need to escape characters inside literals. In particular, choose a delimiter that isn't frequently used in translation content. Having to escape characters inside literals is inconvenient and error-prone when done by hand, and it also introduces the backslash into the message, `\`, which is the escape introducer. The backslash then needs to be escaped too, when the message is embedded in code or containers. (This is how some syntaxes produce the gnarly `\\\`.) - **[r2; high priority]** Minimize the need to escape characters when embedding messages in code or containers. In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. However, note that many programming languages also provide alternative ways of delimiting strings, e.g. _raw strings_ or triple-quoted literals. -- **[r3; medium priority]** Minimize the need to change the message in other ways than to escape some of its characters (e.g. rephrase content, use typographic apostrophes, or switch to using a second set of delimtiers). +- **[r3; medium priority]** Minimize the incentive to avoid escaping by changing messages (e.g. rephrasing content, using typographic apostrophes, or switching outer delimiters). - **[r4; medium priority]** Don't surprise users with syntax that's too exotic. We expect quoted literals to be rare, which means fewer opportunities to get used to their syntax and remember it. - **[r5; low priority]** Be able to pair the opening and the closing delimiter, to aid parsers recover from syntax errors, and to leverage IDE's ability to highlight matching pairs of delimiters, to visually indicate to the user editing a message the bounds of the literal under caret. However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) or are outside patterns (when used as variant keys). From 4aea7449543dc7861b699f049d58e341a44af2db Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" <41898282+github-actions[bot]@users.noreply.github.com> Date: Mon, 18 Sep 2023 16:25:45 +0000 Subject: [PATCH 08/32] style: Apply Prettier --- exploration/0477-quoted-literals.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index f3a1367083..c8eb3f521f 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -85,7 +85,7 @@ More specifically: - Developers may want to embed messages with quoted literals in code written in another programming language which uses single or double quotes to delimit strings. > ```js - > let message = new MessageFormat('en', '{A message with {|a literal|}.}'); + > let message = new MessageFormat("en", "{A message with {|a literal|}.}"); > ``` - Developers and localization engineers may want to embed messages with quoted literals in a container format, such as JSON. From 4b84ef92b01a7f2d3a15983f49e27fac670f1c93 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Tue, 19 Sep 2023 20:13:08 +0200 Subject: [PATCH 09/32] Use a modified RFC 3339 datetime format --- exploration/0477-quoted-literals.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index c8eb3f521f..7bde526a20 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -39,10 +39,10 @@ More specifically: > {…{|New Year's Eve|}…} > ``` -- Message authors may want to use literals to define locale-aware dates as literals in the RFC 7231 format: +- Message authors may want to use literals to define locale-aware dates as literals in a modified RFC 3339 format: > ``` - > {The Unix epoch is defined as {|Thu, 01 Jan 1970 00:00:00 GMT| :datetime}.} + > {The Unix epoch is defined as {|1970-01-01 00:00:00Z| :datetime}.} > ``` - Message authors may want to use multiple words as values of certain options passed to custom functions and markup elements: From 19b9e834b1c8e801e95c19283f9fb9f61f3ecc12 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Tue, 19 Sep 2023 20:13:29 +0200 Subject: [PATCH 10/32] Make the note a note --- exploration/0477-quoted-literals.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index 7bde526a20..348216de76 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -50,8 +50,9 @@ More specifically: > ``` > {{+button title=|Click here!|}Submit{-button}} > ``` - > - > Note that quoted literals cannot contain placeholders, making interpolating data into them impossible. + + > [!NOTE] + > Quoted literals cannot contain placeholders, making interpolating data into them impossible. > > ``` > -- This is impossible in MessageFormat 2.0. From b0dddf05370bcfd152c526c85c1b51c91b5b0d07 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Tue, 19 Sep 2023 20:13:54 +0200 Subject: [PATCH 11/32] Add more examples of exotic variant keys --- exploration/0477-quoted-literals.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index 348216de76..320b8e5aa6 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -59,12 +59,19 @@ More specifically: > {{+button title=|Goodbye, {$userName}!|}Sign out{-button}} > ``` -- Selector function implementers may want to support exotic characters in variant keys to effectively create "mini-DSLs" for the matching logic: +- Selector function implementers may want to support multi-word variant keys or exotic characters in variant keys to effectively create "mini-DSLs" for the matching logic: > ``` - > match {$count :myNumber} + > match ($count :choice} > when |<10| {A handful.} + > when |11..19| {Umpteen.} > when * {Lots.} + > + > match {$arbitraryString} + > when |can't resolve| {Can't resolve!} + > when |11'233.44| {Locale formatted number} + > when |New York| {A multi-word proper name} + > when * {Imagine more...} > ``` - Message authors may want to protect untranslatable strings: From d5975c46fec582956127ca7938e6fe3df47cdaab Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Tue, 19 Sep 2023 20:14:11 +0200 Subject: [PATCH 12/32] quotes are not automatically paired ... --- exploration/0477-quoted-literals.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index 320b8e5aa6..247364659b 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -148,7 +148,7 @@ By being both uncommon in text content and uncommon as a string delimiter in oth - [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. - [r3 GOOD] Message don't have to be modified otherwise before embedding them. - [r4 FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. There's prior art in a practice of using vertical lines as delimiters for inline code literals in email and chat. Vertical bars can also be used to delimit [regular expressions in `sed`](https://en.wikipedia.org/wiki/Vertical_bar#Delimiter), and as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). -- [r5 POOR] Vertical lines cannot be paired by parsers nor IDEs. +- [r5 POOR] Vertical lines are not automatically paired by parsers nor IDEs. ## Alternatives Considered @@ -165,7 +165,7 @@ This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/2 - [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters. Most notably, storing MF2 messages in JSON suffers from this. In many programming languages, however, alternatives to quotation marks exist, which could be used to allow unescaped quotes in messages. See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). - [r3 ???] - [r4 GOOD] Quotation marks are universally recognized as string delimiters. -- [r5 FAIR] Quotation marks cannot be paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. +- [r5 FAIR] Quotation marks are not automatically paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. ### Dual quoting From dafa8b7be6690be2eee19737018a108b097fb5af Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Wed, 20 Sep 2023 22:30:19 +0200 Subject: [PATCH 13/32] Use sembrs --- exploration/0477-quoted-literals.md | 118 ++++++++++++++++++++++------ 1 file changed, 93 insertions(+), 25 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index 247364659b..a75b12139d 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -12,15 +12,26 @@ Status: **Accepted** ## Objective -Document the rationale for including quoted literals in MessageFormat and for delimiting them with the vertical line character, `|`. +Document the rationale for including quoted literals in MessageFormat +and for delimiting them with the vertical line character, `|`. ## Background -MessageFormat allows both quoted and unquoted literals. Unquoted literals satisfy many common use-cases for literals: they are sufficient to represent numbers and single-word option values and variant keys. Quoted literals are helpful in exotic use-cases. +MessageFormat allows both quoted and unquoted literals. +Unquoted literals satisfy many common use-cases for literals: +they are sufficient to represent numbers +and single-word option values and variant keys. +Quoted literals are helpful in exotic use-cases. -In early drafts of the MessageFormat syntax, quoted literals used to be delimited first with quotation marks (`"foo bar"`), and then with round parentheses, e.g. `(foo bar)`. See [#263](https://github.com/unicode-org/message-format-wg/issues/263). +In early drafts of the MessageFormat syntax, +quoted literals used to be delimited first with quotation marks (`"foo bar"`), +and then with round parentheses, e.g. `(foo bar)`. +See [#263](https://github.com/unicode-org/message-format-wg/issues/263). -In [#414](https://github.com/unicode-org/message-format-wg/pull/414) proposed to revert these changes and go back to using single and/or double quotes as delimiters. The propsal was rejected. This document is an artifact of that rejection. +[#414](https://github.com/unicode-org/message-format-wg/pull/414) proposed to revert these changes +and go back to using single and/or double quotes as delimiters. +The propsal was rejected. +This document is an artifact of that rejection. ## Use-Cases @@ -33,7 +44,9 @@ In general, quoted literals are useful for: More specifically: -- Message authors and translators need to be able to use the apostrophe in the message content, and may want to use the single quote character to represent it instead of the typograhic (curly) apostrophe. +- Message authors and translators need to be able to use the apostrophe in the message content, + and may want to use the single quote character + to represent it instead of the typograhic (curly) apostrophe. > ``` > {…{|New Year's Eve|}…} @@ -59,7 +72,9 @@ More specifically: > {{+button title=|Goodbye, {$userName}!|}Sign out{-button}} > ``` -- Selector function implementers may want to support multi-word variant keys or exotic characters in variant keys to effectively create "mini-DSLs" for the matching logic: +- Selector function implementers may want to support multi-word variant keys + or exotic characters in variant keys + to effectively create "mini-DSLs" for the matching logic: > ``` > match ($count :choice} @@ -82,7 +97,9 @@ More specifically: > > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). -- Message authors may want to decorate substrings as being written in a particular language, different from the message's language, for the purpose of accessibility, text-to-speech, and semantic correctness. +- Message authors may want to decorate substrings as being written in a particular language, + different from the message's language, + for the purpose of accessibility, text-to-speech, and semantic correctness. > ``` > {The official native name of the Republic of Poland is {|Rzeczpospolita Polska| @lang=pl}.} @@ -90,7 +107,8 @@ More specifically: > > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). -- Developers may want to embed messages with quoted literals in code written in another programming language which uses single or double quotes to delimit strings. +- Developers may want to embed messages with quoted literals in code written in another programming language + which uses single or double quotes to delimit strings. > ```js > let message = new MessageFormat("en", "{A message with {|a literal|}.}"); @@ -108,19 +126,46 @@ More specifically: _What properties does the solution have to manifest to enable the use-cases above?_ -- **[r1; high priority]** Minimize the need to escape characters inside literals. In particular, choose a delimiter that isn't frequently used in translation content. Having to escape characters inside literals is inconvenient and error-prone when done by hand, and it also introduces the backslash into the message, `\`, which is the escape introducer. The backslash then needs to be escaped too, when the message is embedded in code or containers. (This is how some syntaxes produce the gnarly `\\\`.) -- **[r2; high priority]** Minimize the need to escape characters when embedding messages in code or containers. In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. However, note that many programming languages also provide alternative ways of delimiting strings, e.g. _raw strings_ or triple-quoted literals. -- **[r3; medium priority]** Minimize the incentive to avoid escaping by changing messages (e.g. rephrasing content, using typographic apostrophes, or switching outer delimiters). -- **[r4; medium priority]** Don't surprise users with syntax that's too exotic. We expect quoted literals to be rare, which means fewer opportunities to get used to their syntax and remember it. -- **[r5; low priority]** Be able to pair the opening and the closing delimiter, to aid parsers recover from syntax errors, and to leverage IDE's ability to highlight matching pairs of delimiters, to visually indicate to the user editing a message the bounds of the literal under caret. However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) or are outside patterns (when used as variant keys). +- **[r1; high priority]** Minimize the need to escape characters inside literals. + In particular, choose a delimiter that isn't frequently used in translation content. + Having to escape characters inside literals is inconvenient and error-prone when done by hand, + and it also introduces the backslash into the message, `\`, + which is the escape introducer. + The backslash then needs to be escaped too, + when the message is embedded in code or containers. + (This is how some syntaxes produce the gnarly `\\\`.) + +- **[r2; high priority]** Minimize the need to escape characters when embedding messages in code or containers. + In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. + However, note that many programming languages also provide alternative ways of delimiting strings, e.g. _raw strings_ or triple-quoted literals. + +- **[r3; medium priority]** Minimize the incentive to avoid escaping by changing messages + (e.g. rephrasing content, using typographic apostrophes, or switching outer delimiters). + +- **[r4; medium priority]** Don't surprise users with syntax that's too exotic. + We expect quoted literals to be rare, + which means fewer opportunities to get used to their syntax and remember it. + +- **[r5; low priority]** Be able to pair the opening and the closing delimiter, + to aid parsers recover from syntax errors, + and to leverage IDE's ability to highlight matching pairs of delimiters, + to visually indicate to the user editing a message the bounds of the literal under caret. + However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) + or are outside patterns (when used as variant keys). ## Constraints _What prior decisions and existing conditions limit the possible design?_ -- **[c1]** MessageFormat uses the backslash, `\`, as the escape sequence introducer. -- **[c2]** Straight quotation marks, `'` and `"`, are common in content across many languages, even if other Unicode codepoints should be used in well-formatted text. -- **[c3]** Straight quotation marks, `'` and `"`, are common as string delimiters in many programming languages. +- **[c1]** MessageFormat uses the backslash, `\`, + as the escape sequence introducer. + +- **[c2]** Straight quotation marks, `'` and `"`, + are common in content across many languages, + even if other Unicode codepoints should be used in well-formatted text. + +- **[c3]** Straight quotation marks, `'` and `"`, + are common as string delimiters in many programming languages. ## Proposed Design @@ -142,12 +187,19 @@ quoted-char = %x0-5B ; omit \ quoted-escape = backslash ( backslash / "|" ) ``` -By being both uncommon in text content and uncommon as a string delimiter in other programming languages, the vertical line sidesteps the "inwards" and "outwards" problems of escaping. +By being both uncommon in text content +and uncommon as a string delimiter in other programming languages, +the vertical line sidesteps the "inwards" and "outwards" problems of escaping. -- [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`. This means no extra `\` that need escaping. +- [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`. + This means no extra `\` that need escaping. - [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. - [r3 GOOD] Message don't have to be modified otherwise before embedding them. -- [r4 FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. There's prior art in a practice of using vertical lines as delimiters for inline code literals in email and chat. Vertical bars can also be used to delimit [regular expressions in `sed`](https://en.wikipedia.org/wiki/Vertical_bar#Delimiter), and as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). +- [r4 FAIR] Vertical lines are not commonly used as string delimiters + and thus can be harder to learn for beginners. + There's prior art in a practice of using vertical lines as delimiters for inline code literals in email and chat. + Vertical bars can also be used to delimit [regular expressions in `sed`](https://en.wikipedia.org/wiki/Vertical_bar#Delimiter), + and as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). - [r5 POOR] Vertical lines are not automatically paired by parsers nor IDEs. ## Alternatives Considered @@ -161,11 +213,18 @@ _What other properties they have?_ Early drafts of the syntax specification used double quotes to delimit literals. This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/263#issue-1233590015). -- [r1 POOR] Writing `"` and `'` in literals requires escaping them via `\`, which then needs to be escaped itself in code which uses `\` as the escape character (which is common). -- [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters. Most notably, storing MF2 messages in JSON suffers from this. In many programming languages, however, alternatives to quotation marks exist, which could be used to allow unescaped quotes in messages. See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). +- [r1 POOR] Writing `"` and `'` in literals requires escaping them via `\`, + which then needs to be escaped itself in code + which uses `\` as the escape character (which is common). +- [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters. + Most notably, storing MF2 messages in JSON suffers from this. + In many programming languages, however, alternatives to quotation marks exist, + which could be used to allow unescaped quotes in messages. + See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). - [r3 ???] - [r4 GOOD] Quotation marks are universally recognized as string delimiters. -- [r5 FAIR] Quotation marks are not automatically paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. +- [r5 FAIR] Quotation marks are not automatically paired by parsers nor IDEs, + but many text editors provide features to make working with and around quotes easier. ### Dual quoting @@ -187,14 +246,23 @@ a variant of the "Use quotation marks" solution. ### Use round or angle brackets -- Round parentheses are very uncommon as string delimiters [r2 GOOD], and thus may be surprising, especially given the well-established meaning in prose [r4 POOR]. That said, there's prior art in using them for [delimiting strings in PostScript](https://en.wikipedia.org/wiki/PostScript#%22Hello_world%22). Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. +- Round parentheses are very uncommon as string delimiters [r2 GOOD], + and thus may be surprising, + especially given the well-established meaning in prose [r4 POOR]. + That said, there's prior art in using them for [delimiting strings in PostScript](https://en.wikipedia.org/wiki/PostScript#%22Hello_world%22). + Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. - Angle brackets require escaping in XML-based storage formats [r2 POOR]. - All brackets can be easily paired by parsers and IDEs [r5 GOOD]. ### Change escape introducer -Changing the escape sequence introducer from backslash [c1] to another character could help partially mitigate the burden of first escaping literal delimiters and then escaping the escapes themselves [r1]. However, it wouldn't address other requirements and use-cases. +Changing the escape sequence introducer from backslash [c1] to another character +could help partially mitigate the burden of first escaping literal delimiters +and then escaping the escapes themselves [r1]. +However, it wouldn't address other requirements and use-cases. ### Double delimiters to escape them -This is the approach taken by ICU MessageFormat 1.0 for quotes. It allows literals to contain quotes [r1 GOOD] at the expense of doubling the amount of escaping required when embedding messages in code [r2 POOR]. +This is the approach taken by ICU MessageFormat 1.0 for quotes. +It allows literals to contain quotes [r1 GOOD] +at the expense of doubling the amount of escaping required when embedding messages in code [r2 POOR]. From 842ccda8beb55d7e6537ff0deb0c902532be50cb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Sun, 24 Sep 2023 20:49:50 +0200 Subject: [PATCH 14/32] Add a comparison table --- exploration/0477-quoted-literals.md | 70 ++++++++++++++++++++++++++--- 1 file changed, 65 insertions(+), 5 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index a75b12139d..f9a0699ab4 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -208,7 +208,7 @@ _What other solutions are available?_ _How do they compare against the requirements?_ _What other properties they have?_ -### Use quotation marks +### [a1] Use quotation marks Early drafts of the syntax specification used double quotes to delimit literals. This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/263#issue-1233590015). @@ -226,7 +226,7 @@ This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/2 - [r5 FAIR] Quotation marks are not automatically paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. -### Dual quoting +### [a2] Dual quoting PR [#414](https://github.com/unicode-org/message-format-wg/pull/414) proposes to allow either single quotes `'` or double quotes `"` as literal delimiters, @@ -244,7 +244,7 @@ a variant of the "Use quotation marks" solution. - [r5 FAIR] Quotation marks cannot be paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. -### Use round or angle brackets +### [a3] Use round or angle brackets - Round parentheses are very uncommon as string delimiters [r2 GOOD], and thus may be surprising, @@ -254,15 +254,75 @@ a variant of the "Use quotation marks" solution. - Angle brackets require escaping in XML-based storage formats [r2 POOR]. - All brackets can be easily paired by parsers and IDEs [r5 GOOD]. -### Change escape introducer +### [a4] Change escape introducer Changing the escape sequence introducer from backslash [c1] to another character could help partially mitigate the burden of first escaping literal delimiters and then escaping the escapes themselves [r1]. However, it wouldn't address other requirements and use-cases. -### Double delimiters to escape them +### [a5] Double delimiters to escape them This is the approach taken by ICU MessageFormat 1.0 for quotes. It allows literals to contain quotes [r1 GOOD] at the expense of doubling the amount of escaping required when embedding messages in code [r2 POOR]. + +## Comparison table + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Proposal[a1][a2][a3][a4][a5]
[r1] escape inside literals++-+-++++
[r2] escape when embedding+++++-/++-
[r3] escape by modifying++??++
[r4] no surprises+++++--+
[r5] pair delimiters-++++
From b92eba10c25ce2d6b979233999d3dfe076e9e9d1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Sun, 24 Sep 2023 20:50:52 +0200 Subject: [PATCH 15/32] Change angle brackets' r2 from POOR to FAIR It's POOR when embedding into an XML dialect, but otherwise, <> shouldn't cause too many issues for r2. --- exploration/0477-quoted-literals.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index f9a0699ab4..36fb1b4d8e 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -251,7 +251,7 @@ a variant of the "Use quotation marks" solution. especially given the well-established meaning in prose [r4 POOR]. That said, there's prior art in using them for [delimiting strings in PostScript](https://en.wikipedia.org/wiki/PostScript#%22Hello_world%22). Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. -- Angle brackets require escaping in XML-based storage formats [r2 POOR]. +- Angle brackets require escaping in XML-based storage formats [r2 FAIR]. - All brackets can be easily paired by parsers and IDEs [r5 GOOD]. ### [a4] Change escape introducer @@ -293,7 +293,7 @@ at the expense of doubling the amount of escaping required when embedding messag ++ + ++ - -/++ + +/++ - From 49dd24dd9174fb1e09af7cbc7e71a3f85e23b369 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Sun, 24 Sep 2023 20:56:18 +0200 Subject: [PATCH 16/32] Use single quotes for the JS example Co-authored-by: Eemeli Aro --- exploration/0477-quoted-literals.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index 36fb1b4d8e..ce23756d8b 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -110,8 +110,9 @@ More specifically: - Developers may want to embed messages with quoted literals in code written in another programming language which uses single or double quotes to delimit strings. + > ```js - > let message = new MessageFormat("en", "{A message with {|a literal|}.}"); + > let message = new MessageFormat('en', '{A message with {|a literal|}.}'); > ``` - Developers and localization engineers may want to embed messages with quoted literals in a container format, such as JSON. From bed3efe7457113054b940d7197ccb8e7bf496e63 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Fri, 10 Nov 2023 18:43:45 +0100 Subject: [PATCH 17/32] Update exploration/0477-quoted-literals.md Co-authored-by: Eemeli Aro --- exploration/0477-quoted-literals.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/exploration/0477-quoted-literals.md b/exploration/0477-quoted-literals.md index ce23756d8b..b1c7fe50ae 100644 --- a/exploration/0477-quoted-literals.md +++ b/exploration/0477-quoted-literals.md @@ -199,8 +199,7 @@ the vertical line sidesteps the "inwards" and "outwards" problems of escaping. - [r4 FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. There's prior art in a practice of using vertical lines as delimiters for inline code literals in email and chat. - Vertical bars can also be used to delimit [regular expressions in `sed`](https://en.wikipedia.org/wiki/Vertical_bar#Delimiter), - and as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). + Vertical bars can also be used as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). - [r5 POOR] Vertical lines are not automatically paired by parsers nor IDEs. ## Alternatives Considered From d65481cf655e1acc0797d8fc4fc1b5ecbff79552 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Thu, 9 Nov 2023 10:28:13 +0100 Subject: [PATCH 18/32] Drop the PR number from the filename --- exploration/{0477-quoted-literals.md => quoted-literals.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename exploration/{0477-quoted-literals.md => quoted-literals.md} (100%) diff --git a/exploration/0477-quoted-literals.md b/exploration/quoted-literals.md similarity index 100% rename from exploration/0477-quoted-literals.md rename to exploration/quoted-literals.md From 657e0fffd3d64118766f88a86e367db774188f78 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Fri, 10 Nov 2023 18:50:47 +0100 Subject: [PATCH 19/32] Update to the current syntax --- exploration/quoted-literals.md | 41 +++++++++++++++++----------------- 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index b1c7fe50ae..0503008c14 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -49,19 +49,19 @@ More specifically: to represent it instead of the typograhic (curly) apostrophe. > ``` - > {…{|New Year's Eve|}…} + > …{|New Year's Eve|}… > ``` - Message authors may want to use literals to define locale-aware dates as literals in a modified RFC 3339 format: > ``` - > {The Unix epoch is defined as {|1970-01-01 00:00:00Z| :datetime}.} + > The Unix epoch is defined as {|1970-01-01 00:00:00Z| :datetime}. > ``` - Message authors may want to use multiple words as values of certain options passed to custom functions and markup elements: > ``` - > {{+button title=|Click here!|}Submit{-button}} + > {+button title=|Click here!|}Submit{-button} > ``` > [!NOTE] @@ -69,7 +69,7 @@ More specifically: > > ``` > -- This is impossible in MessageFormat 2.0. - > {{+button title=|Goodbye, {$userName}!|}Sign out{-button}} + > {+button title=|Goodbye, {$userName}!|}Sign out{-button} > ``` - Selector function implementers may want to support multi-word variant keys @@ -77,49 +77,50 @@ More specifically: to effectively create "mini-DSLs" for the matching logic: > ``` - > match ($count :choice} - > when |<10| {A handful.} - > when |11..19| {Umpteen.} - > when * {Lots.} + > {{ match {$count :choice} + > when |<10| {{A handful.}} + > when |11..19| {{Umpteen.}} + > when * {{Lots.}} + > }} > - > match {$arbitraryString} - > when |can't resolve| {Can't resolve!} - > when |11'233.44| {Locale formatted number} - > when |New York| {A multi-word proper name} - > when * {Imagine more...} + > {{ match {$arbitraryString} + > when |can't resolve| {{Can't resolve!}} + > when |11'233.44| {{Locale formatted number}} + > when |New York| {{A multi-word proper name}} + > when * {{Imagine more...}} + > }} > ``` - Message authors may want to protect untranslatable strings: > ``` - > {Visit {|http://www.example.com| @translate=false}.} + > Visit {|http://www.example.com| @translate=false}. > ``` > - > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). + > See the [expression attributes design proposal](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). - Message authors may want to decorate substrings as being written in a particular language, different from the message's language, for the purpose of accessibility, text-to-speech, and semantic correctness. > ``` - > {The official native name of the Republic of Poland is {|Rzeczpospolita Polska| @lang=pl}.} + > The official native name of the Republic of Poland is {|Rzeczpospolita Polska| @lang=pl}. > ``` > - > See [design proposal 0002](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). + > See the [expression attributes design proposal](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). - Developers may want to embed messages with quoted literals in code written in another programming language which uses single or double quotes to delimit strings. - > ```js - > let message = new MessageFormat('en', '{A message with {|a literal|}.}'); + > let message = new MessageFormat('en', 'A message with {|a literal|}.'); > ``` - Developers and localization engineers may want to embed messages with quoted literals in a container format, such as JSON. > ```json > { - > "msg": "{A message with {|a literal|}.}" + > "msg": "A message with {|a literal|}." > } > ``` From 670bd094df341b2e101f08366de59b7913f0752d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Fri, 10 Nov 2023 18:51:46 +0100 Subject: [PATCH 20/32] Address some of review comments --- exploration/quoted-literals.md | 36 ++++++++++++++++++++++++++-------- 1 file changed, 28 insertions(+), 8 deletions(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index 0503008c14..e88b1805c4 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -144,7 +144,7 @@ _What properties does the solution have to manifest to enable the use-cases abov - **[r3; medium priority]** Minimize the incentive to avoid escaping by changing messages (e.g. rephrasing content, using typographic apostrophes, or switching outer delimiters). -- **[r4; medium priority]** Don't surprise users with syntax that's too exotic. +- **[r4; medium/high priority]** Don't surprise users with syntax that's too exotic. We expect quoted literals to be rare, which means fewer opportunities to get used to their syntax and remember it. @@ -155,6 +155,25 @@ _What properties does the solution have to manifest to enable the use-cases abov However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) or are outside patterns (when used as variant keys). +
+ How can paired delimiters improve parsing recovery? + If both paired delimiters are made special in the literal, + i.e. both the opening and the closing delimiter require escaping inside the literal to be part of its contents, + then the start of another literal can be an anchor point for a parser to stop parsing and attempt to rewind and recover. + + ``` + There {:is a=|broken literal=|here|} + ^ ^ + The closing delimiter is missing here. + The syntax error occurs here. + There {:is a=[broken literal=[here]} + ^^ ^ + The closing delimiter is missing here. + | The parser can recognize a new literal here... + and rewind to here. + ``` +
+ ## Constraints _What prior decisions and existing conditions limit the possible design?_ @@ -173,7 +192,9 @@ _What prior decisions and existing conditions limit the possible design?_ _Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ -Use the vertical line character, `|`, to delimit quoted strings. For example: +Use the vertical line character, `|`, to delimit quoted strings. +The vertical line is rarely found in text content, +and it has sufficiently good delimitation properties. > ``` > {The Unix epoch is defined as {|Thu, 01 Jan 1970 00:00:00 GMT| :datetime}.} @@ -189,18 +210,17 @@ quoted-char = %x0-5B ; omit \ quoted-escape = backslash ( backslash / "|" ) ``` -By being both uncommon in text content -and uncommon as a string delimiter in other programming languages, +By being both uncommon in text content and uncommon as a string delimiter in other programming languages, the vertical line sidesteps the "inwards" and "outwards" problems of escaping. - [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`. This means no extra `\` that need escaping. - [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. - [r3 GOOD] Message don't have to be modified otherwise before embedding them. -- [r4 FAIR] Vertical lines are not commonly used as string delimiters +- [r4 POOR/FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. - There's prior art in a practice of using vertical lines as delimiters for inline code literals in email and chat. - Vertical bars can also be used as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). + Vertical bars can be used as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). + However, typically vertical lines tend to be used as delimiters for *separating* rather than for *enclosing*. - [r5 POOR] Vertical lines are not automatically paired by parsers nor IDEs. ## Alternatives Considered @@ -309,7 +329,7 @@ at the expense of doubling the amount of escaping required when embedding messag [r4] no surprises - + + -/+ ++ ++ - From 496a409accfd70d5f7afcb22c207142756ae9467 Mon Sep 17 00:00:00 2001 From: Eemeli Aro Date: Fri, 10 Nov 2023 10:45:28 -0800 Subject: [PATCH 21/32] Add alternative to allow either ', ", or | as delimiters --- exploration/quoted-literals.md | 41 ++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index e88b1805c4..0d15e7af9c 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -288,6 +288,41 @@ This is the approach taken by ICU MessageFormat 1.0 for quotes. It allows literals to contain quotes [r1 GOOD] at the expense of doubling the amount of escaping required when embedding messages in code [r2 POOR]. +### [a6] Accept either `|` or quotes + +Allow any of the following as literal delimiters: + +- the vertical line character `|` +- single quotes `'` +- double quotes `"` + +This approach supports multiple different quoting styles to be used for literals. +This flexibility allows for using a familiar and common style such as `'single'` or `"double"` quotes, +while also allowing for `|pipes|` when the message's contents or embedding would otherwise require additional escaping. + +```abnf +literal = quoted / unquoted +quoted = "|" *(quoted-char / "'" / DQUOTE / quoted-escape) "|" + / "'" *(quoted-char / DQUOTE / "|" / quoted-escape) "'" + / DQUOTE *(quoted-char / "'" / "|" / quoted-escape) DQUOTE +quoted-char = %x0-21 ; omit " + / %x23-26 ; omit ' + / %x28-5B ; omit \ + / %x5D-7B ; omit | + / %x7D-D7FF ; omit surrogates + / %xE000-10FFFF +quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) +``` + +- [r1 GOOD] Writing any two of `|`, `"` and `'` in literals doesn't require escaping them via `\`. + This means no extra `\` that need escaping. +- [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. +- [r3 FAIR] Message don't have to be modified otherwise before embedding them, + unless they happen to contain conflicting quote delimiters. +- [r4 GOOD] Quotation marks are universally recognized as string delimiters. +- [r5 FAIR] Using the same marks for quote-start and quote-end cannot be paired by parsers nor IDEs, + but many text editors provide features to make working with and around quotes easier. + ## Comparison table @@ -299,6 +334,7 @@ at the expense of doubling the amount of escaping required when embedding messag + @@ -308,6 +344,7 @@ at the expense of doubling the amount of escaping required when embedding messag + @@ -317,6 +354,7 @@ at the expense of doubling the amount of escaping required when embedding messag + @@ -326,6 +364,7 @@ at the expense of doubling the amount of escaping required when embedding messag + @@ -335,6 +374,7 @@ at the expense of doubling the amount of escaping required when embedding messag + @@ -344,6 +384,7 @@ at the expense of doubling the amount of escaping required when embedding messag +
[a3] [a4] [a5][a6]
[r1] escape inside literals- ++ ++++
[r2] escape when embedding+/++ -++
[r3] escape by modifying++ +
[r4] no surprises- - +++
[r5] pair delimiters++ +
From d2a3b44254e18107174cbe4d778cac8c91c32fdc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Mon, 13 Nov 2023 17:55:46 +0100 Subject: [PATCH 22/32] Clarify r3; add r6 --- exploration/quoted-literals.md | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index 0d15e7af9c..a693553001 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -142,7 +142,7 @@ _What properties does the solution have to manifest to enable the use-cases abov However, note that many programming languages also provide alternative ways of delimiting strings, e.g. _raw strings_ or triple-quoted literals. - **[r3; medium priority]** Minimize the incentive to avoid escaping by changing messages - (e.g. rephrasing content, using typographic apostrophes, or switching outer delimiters). + (e.g. rephrasing content, using typographic apostrophes, or switching literal delimiters). - **[r4; medium/high priority]** Don't surprise users with syntax that's too exotic. We expect quoted literals to be rare, @@ -155,6 +155,16 @@ _What properties does the solution have to manifest to enable the use-cases abov However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) or are outside patterns (when used as variant keys). +- **[r6; low priority]** Be able to drop messages into host formats without changing them. + For example, if the host format uses `""` to delimit strings literals, + then all occurences of `"` inside messages must be escaped. + While we can't prevent `"` from appearing in translation content, + we can minimize the overhead of the MessageFormat 2 syntax. + This requirement is scored as _low_, because many storage formats don't use delimiters at all (`.properties`, YAML), + or they are meant to be primarily used by machines (JSON), + and because many programming languages provide a way to delimit _raw strings_, + e.g. via backticks in JavaScript and `"""` in Python. +
How can paired delimiters improve parsing recovery? If both paired delimiters are made special in the literal, @@ -242,7 +252,6 @@ This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/2 In many programming languages, however, alternatives to quotation marks exist, which could be used to allow unescaped quotes in messages. See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). -- [r3 ???] - [r4 GOOD] Quotation marks are universally recognized as string delimiters. - [r5 FAIR] Quotation marks are not automatically paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. @@ -260,7 +269,6 @@ a variant of the "Use quotation marks" solution. - [r2 GOOD] Embedding messages in certain container formats requires escaping the literal delimiters. If the container format does not itself support dual quoting, the embedded message's quotes may be adjusted to avoid their escaping. -- [r3 ???] - [r4 GOOD] Quotation marks are universally recognized as string delimiters. - [r5 FAIR] Quotation marks cannot be paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. @@ -386,5 +394,15 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) + + + [r6] drop into host + + + + + + + + From 9495a969f67fd24a553c9d0fa402c57d889ccbcb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Mon, 13 Nov 2023 18:10:31 +0100 Subject: [PATCH 23/32] Apply suggestions from code review Co-authored-by: Addison Phillips --- exploration/quoted-literals.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index a693553001..db8f229c86 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -65,16 +65,17 @@ More specifically: > ``` > [!NOTE] - > Quoted literals cannot contain placeholders, making interpolating data into them impossible. - > - > ``` - > -- This is impossible in MessageFormat 2.0. - > {+button title=|Goodbye, {$userName}!|}Sign out{-button} +> Quoted literals are not evaluated as part of a pattern or option sequence. +> This means that their contents cannot be dynamic. +> ``` +> -- The "title" contains the string "{$userName}" +> {+button title=|Goodbye, {$userName}!|}Sign out{-button} > ``` -- Selector function implementers may want to support multi-word variant keys - or exotic characters in variant keys - to effectively create "mini-DSLs" for the matching logic: +- Selector function implementers might need to match different string values + such as those present in data values. + These might include keys containing arbitrary text, multiple words, + or other sequences not otherwise permitted in the syntax. > ``` > {{ match {$count :choice} From f82ba41b0379f800360d1dc7d30186e7cd3167dd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Mon, 13 Nov 2023 18:15:00 +0100 Subject: [PATCH 24/32] Fix formatting --- exploration/quoted-literals.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index db8f229c86..0817b396a8 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -65,11 +65,11 @@ More specifically: > ``` > [!NOTE] -> Quoted literals are not evaluated as part of a pattern or option sequence. -> This means that their contents cannot be dynamic. -> ``` -> -- The "title" contains the string "{$userName}" -> {+button title=|Goodbye, {$userName}!|}Sign out{-button} + > Quoted literals are not evaluated as part of a pattern or option sequence. + > This means that their contents cannot be dynamic. + > ``` + > -- The "title" contains the string "{$userName}" + > {+button title=|Goodbye, {$userName}!|}Sign out{-button} > ``` - Selector function implementers might need to match different string values From 89494c5532ab31e8476304b5895df77de0b87b84 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Fri, 17 Nov 2023 10:25:24 +0100 Subject: [PATCH 25/32] Factor r6 back into r2; apply @eemeli's suggestion about other reasons for escaping --- exploration/quoted-literals.md | 30 ++++++++---------------------- 1 file changed, 8 insertions(+), 22 deletions(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index 0817b396a8..080d985bc7 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -138,9 +138,15 @@ _What properties does the solution have to manifest to enable the use-cases abov when the message is embedded in code or containers. (This is how some syntaxes produce the gnarly `\\\`.) -- **[r2; high priority]** Minimize the need to escape characters when embedding messages in code or containers. +- **[r2; medium priority]** Minimize the need to escape characters or change the host format's string delimiters when embedding messages in code or containers. In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. - However, note that many programming languages also provide alternative ways of delimiting strings, e.g. _raw strings_ or triple-quoted literals. + + This requirement is scored as _medium_, because many storage formats don't use delimiters at all (`.properties`, YAML), + or they are meant to be primarily used by machines (JSON), + and because many programming languages provide a way to delimit _raw strings_, + e.g. via `` in JavaScript and `"""` in Python. + Also, messages including e.g. newlines or `\` escapes in their source + will likely need those characters accounted for when dropping them into new host formats. - **[r3; medium priority]** Minimize the incentive to avoid escaping by changing messages (e.g. rephrasing content, using typographic apostrophes, or switching literal delimiters). @@ -156,16 +162,6 @@ _What properties does the solution have to manifest to enable the use-cases abov However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) or are outside patterns (when used as variant keys). -- **[r6; low priority]** Be able to drop messages into host formats without changing them. - For example, if the host format uses `""` to delimit strings literals, - then all occurences of `"` inside messages must be escaped. - While we can't prevent `"` from appearing in translation content, - we can minimize the overhead of the MessageFormat 2 syntax. - This requirement is scored as _low_, because many storage formats don't use delimiters at all (`.properties`, YAML), - or they are meant to be primarily used by machines (JSON), - and because many programming languages provide a way to delimit _raw strings_, - e.g. via backticks in JavaScript and `"""` in Python. -
How can paired delimiters improve parsing recovery? If both paired delimiters are made special in the literal, @@ -395,15 +391,5 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) + - - [r6] drop into host - - - - - - - - From 815b822626341148762a2af84e51528fd2f4b491 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Fri, 17 Nov 2023 11:02:33 +0100 Subject: [PATCH 26/32] Factor r3 into r1; rename other reqs --- exploration/quoted-literals.md | 54 +++++++++++++--------------------- 1 file changed, 21 insertions(+), 33 deletions(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index 080d985bc7..9677406c6e 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -132,11 +132,13 @@ _What properties does the solution have to manifest to enable the use-cases abov - **[r1; high priority]** Minimize the need to escape characters inside literals. In particular, choose a delimiter that isn't frequently used in translation content. Having to escape characters inside literals is inconvenient and error-prone when done by hand, - and it also introduces the backslash into the message, `\`, - which is the escape introducer. - The backslash then needs to be escaped too, - when the message is embedded in code or containers. - (This is how some syntaxes produce the gnarly `\\\`.) + and it also introduces the backslash `\` into the message as the escape introducer. + When the message is embedded in code or containers, the backslash then needs to be escaped too; + this is how some syntaxes produce the gnarly `\\\`. + + By minimizing the need to escape characters, + we also minimze the incentive to _avoid_ escaping by changing translation content, + e.g. by rephrasing content or by using typographic punctuation marks. - **[r2; medium priority]** Minimize the need to escape characters or change the host format's string delimiters when embedding messages in code or containers. In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. @@ -148,14 +150,11 @@ _What properties does the solution have to manifest to enable the use-cases abov Also, messages including e.g. newlines or `\` escapes in their source will likely need those characters accounted for when dropping them into new host formats. -- **[r3; medium priority]** Minimize the incentive to avoid escaping by changing messages - (e.g. rephrasing content, using typographic apostrophes, or switching literal delimiters). - -- **[r4; medium/high priority]** Don't surprise users with syntax that's too exotic. +- **[r3; medium/high priority]** Do not surprise users with syntax that's too exotic. We expect quoted literals to be rare, which means fewer opportunities to get used to their syntax and remember it. -- **[r5; low priority]** Be able to pair the opening and the closing delimiter, +- **[r4; low priority]** Be able to pair the opening and the closing delimiter, to aid parsers recover from syntax errors, and to leverage IDE's ability to highlight matching pairs of delimiters, to visually indicate to the user editing a message the bounds of the literal under caret. @@ -223,12 +222,11 @@ the vertical line sidesteps the "inwards" and "outwards" problems of escaping. - [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`. This means no extra `\` that need escaping. - [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. -- [r3 GOOD] Message don't have to be modified otherwise before embedding them. -- [r4 POOR/FAIR] Vertical lines are not commonly used as string delimiters +- [r3 POOR/FAIR] Vertical lines are not commonly used as string delimiters and thus can be harder to learn for beginners. Vertical bars can be used as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). However, typically vertical lines tend to be used as delimiters for *separating* rather than for *enclosing*. -- [r5 POOR] Vertical lines are not automatically paired by parsers nor IDEs. +- [r4 POOR] Vertical lines are not automatically paired by parsers nor IDEs. ## Alternatives Considered @@ -249,8 +247,8 @@ This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/2 In many programming languages, however, alternatives to quotation marks exist, which could be used to allow unescaped quotes in messages. See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). -- [r4 GOOD] Quotation marks are universally recognized as string delimiters. -- [r5 FAIR] Quotation marks are not automatically paired by parsers nor IDEs, +- [r3 GOOD] Quotation marks are universally recognized as string delimiters. +- [r4 FAIR] Quotation marks are not automatically paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. ### [a2] Dual quoting @@ -266,8 +264,8 @@ a variant of the "Use quotation marks" solution. - [r2 GOOD] Embedding messages in certain container formats requires escaping the literal delimiters. If the container format does not itself support dual quoting, the embedded message's quotes may be adjusted to avoid their escaping. -- [r4 GOOD] Quotation marks are universally recognized as string delimiters. -- [r5 FAIR] Quotation marks cannot be paired by parsers nor IDEs, +- [r3 GOOD] Quotation marks are universally recognized as string delimiters. +- [r4 FAIR] Quotation marks cannot be paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. ### [a3] Use round or angle brackets @@ -321,11 +319,11 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) - [r1 GOOD] Writing any two of `|`, `"` and `'` in literals doesn't require escaping them via `\`. This means no extra `\` that need escaping. -- [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. -- [r3 FAIR] Message don't have to be modified otherwise before embedding them, + Message don't have to be modified otherwise before embedding them, unless they happen to contain conflicting quote delimiters. -- [r4 GOOD] Quotation marks are universally recognized as string delimiters. -- [r5 FAIR] Using the same marks for quote-start and quote-end cannot be paired by parsers nor IDEs, +- [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. +- [r3 GOOD] Quotation marks are universally recognized as string delimiters. +- [r4 FAIR] Using the same marks for quote-start and quote-end cannot be paired by parsers nor IDEs, but many text editors provide features to make working with and around quotes easier. ## Comparison table @@ -362,17 +360,7 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) ++ - [r3] escape by modifying - ++ - ? - ? - ++ - - - + - - - [r4] no surprises + [r3] no surprises -/+ ++ ++ @@ -382,7 +370,7 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) ++ - [r5] pair delimiters + [r4] pair delimiters - + + From de03143ea8cc647018c77c005ddfa4209e39f88f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Fri, 17 Nov 2023 11:02:59 +0100 Subject: [PATCH 27/32] Add r5 about one way of doing things --- exploration/quoted-literals.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index 9677406c6e..0bae6b07cd 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -180,6 +180,11 @@ _What properties does the solution have to manifest to enable the use-cases abov ```
+- **[r5; low priority]** Do not require users to choose between too many syntax options. + > There should be one — and preferably only one — obvious way to do it.
+ > Although that way may not be obvious at first unless you're Dutch.
+ > — _[The Zen of Python](https://peps.python.org/pep-0020/)_ + ## Constraints _What prior decisions and existing conditions limit the possible design?_ @@ -379,5 +384,15 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) + + + [r5] one way + ++ + ++ + + + ++ + + + - + From 3d3798bfa5ed4834c1474d23c4f8542c4bbcfaa4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Fri, 17 Nov 2023 11:14:27 +0100 Subject: [PATCH 28/32] Add priorities to the table --- exploration/quoted-literals.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index 0bae6b07cd..1d780c7324 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -336,6 +336,7 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) + @@ -346,6 +347,7 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) + @@ -356,6 +358,7 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) + @@ -366,6 +369,7 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) + @@ -376,6 +380,7 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) + @@ -386,6 +391,7 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) + From b9c149c02149ddff39465e375f6df256fa8765b2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Fri, 17 Nov 2023 11:14:53 +0100 Subject: [PATCH 29/32] Fix the Zen of Python quote --- exploration/quoted-literals.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index 1d780c7324..921e90e6f7 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -182,8 +182,7 @@ _What properties does the solution have to manifest to enable the use-cases abov - **[r5; low priority]** Do not require users to choose between too many syntax options. > There should be one — and preferably only one — obvious way to do it.
- > Although that way may not be obvious at first unless you're Dutch.
- > — _[The Zen of Python](https://peps.python.org/pep-0020/)_ + > —_[The Zen of Python](https://peps.python.org/pep-0020/)_ ## Constraints From 1f3f1b9029098b2fd1f180599570e48bed18c73e Mon Sep 17 00:00:00 2001 From: Eemeli Aro Date: Mon, 20 Nov 2023 19:14:24 +0200 Subject: [PATCH 30/32] Apply suggestions from code review --- exploration/quoted-literals.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index 921e90e6f7..f6f0a3eb05 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -306,6 +306,10 @@ Allow any of the following as literal delimiters: This approach supports multiple different quoting styles to be used for literals. This flexibility allows for using a familiar and common style such as `'single'` or `"double"` quotes, while also allowing for `|pipes|` when the message's contents or embedding would otherwise require additional escaping. +This means that literals could for example prefer `'single quotes'`, +but use `"double 'em"` if necessary, +or `|'pipe' characters|` if the whole message is wrapped in `"quotes"` due to the host format +or if the literal value contains both `'` and `"` quotes. ```abnf literal = quoted / unquoted From e15390a51883001d6eb77800de47d83ae6e01e6c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Stanis=C5=82aw=20Ma=C5=82olepszy?= Date: Mon, 27 Nov 2023 14:02:21 +0100 Subject: [PATCH 31/32] Use a Markdown table --- exploration/quoted-literals.md | 76 ++++------------------------------ 1 file changed, 7 insertions(+), 69 deletions(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index f6f0a3eb05..6eba257ed5 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -336,72 +336,10 @@ quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) ## Comparison table -
Priority Proposal [a1] [a2]
[r1] escape inside literalsHIGH ++ - +
[r2] escape when embeddingMED ++ + ++
[r3] no surprisesMED/HIGH -/+ ++ ++
[r4] pair delimitersLOW - + +
[r5] one wayLOW ++ ++ +
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PriorityProposal[a1][a2][a3][a4][a5][a6]
[r1] escape inside literalsHIGH++-+-++++++
[r2] escape when embeddingMED++++++/++-++
[r3] no surprisesMED/HIGH-/+++++--+++
[r4] pair delimitersLOW-+++++
[r5] one wayLOW+++++++-
+| | Priority | Proposal | [a1] | [a2] | [a3] | [a4] | [a5] | [a6] | +|-----------------------------|----------|:--------:|:----:|:----:|:----:|:----:|:----:|:----:| +| [r1] escape inside literals | HIGH | ++ | - | + | - | ++ | ++ | ++ | +| [r2] escape when embedding | MED | ++ | + | ++ | +/++ | | - | ++ | +| [r3] no surprises | MED/HIGH | -/+ | ++ | ++ | - | - | + | ++ | +| [r4] pair delimiters | LOW | - | + | + | ++ | | | + | +| [r5] one way | LOW | ++ | ++ | + | ++ | | | - | From 709318a9a1a3e10c8383bbbc271b569146af3891 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Mon, 27 Nov 2023 09:43:22 -0800 Subject: [PATCH 32/32] Update exploration/quoted-literals.md Co-authored-by: Eemeli Aro --- exploration/quoted-literals.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/exploration/quoted-literals.md b/exploration/quoted-literals.md index 6eba257ed5..e46d7451c7 100644 --- a/exploration/quoted-literals.md +++ b/exploration/quoted-literals.md @@ -1,6 +1,6 @@ # Quoted Literals -Status: **Accepted** +Status: **Proposed**
Metadata