-
-
Notifications
You must be signed in to change notification settings - Fork 36
Document the design of quoted literals #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
681963f
6fd7cc1
1fe4a4c
10f7354
d37b893
49e5f0b
0ae3235
4aea744
4b84ef9
19b9e83
b0dddf0
d5975c4
dafa8b7
842ccda
b92eba1
49dd24d
bed3efe
d65481c
657e0ff
670bd09
496a409
d2a3b44
9495a96
f82ba41
89494c5
815b822
de03143
3d3798b
b9c149c
1f3f1b9
e15390a
709318a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,345 @@ | ||
# Quoted Literals | ||
|
||
Status: **Proposed** | ||
|
||
<details> | ||
<summary>Metadata</summary> | ||
<dl> | ||
<dt>Pull Request</dt> | ||
<dd><a href="https://github.com/unicode-org/message-format-wg/pull/477">#477</a></dd> | ||
</dl> | ||
</details> | ||
|
||
## Objective | ||
|
||
Document the rationale for including quoted literals in MessageFormat | ||
and for delimiting them with the vertical line character, `|`. | ||
|
||
## Background | ||
|
||
MessageFormat allows both quoted and unquoted literals. | ||
Unquoted literals satisfy many common use-cases for literals: | ||
they are sufficient to represent numbers | ||
and single-word option values and variant keys. | ||
Quoted literals are helpful in exotic use-cases. | ||
|
||
In early drafts of the MessageFormat syntax, | ||
quoted literals used to be delimited first with quotation marks (`"foo bar"`), | ||
and then with round parentheses, e.g. `(foo bar)`. | ||
See [#263](https://github.com/unicode-org/message-format-wg/issues/263). | ||
|
||
[#414](https://github.com/unicode-org/message-format-wg/pull/414) proposed to revert these changes | ||
and go back to using single and/or double quotes as delimiters. | ||
The propsal was rejected. | ||
This document is an artifact of that rejection. | ||
|
||
## Use-Cases | ||
|
||
_What use-cases do we see? Ideally, quote concrete examples._ | ||
|
||
In general, quoted literals are useful for: | ||
|
||
1. encoding literals containing whitespace, like literals consisting of multiple words, | ||
1. encoding literals containing exotic characters that do not conform to the `unquoted` production in ABNF. | ||
|
||
More specifically: | ||
|
||
- Message authors and translators need to be able to use the apostrophe in the message content, | ||
and may want to use the single quote character | ||
to represent it instead of the typograhic (curly) apostrophe. | ||
|
||
> ``` | ||
> …{|New Year's Eve|}… | ||
> ``` | ||
|
||
- Message authors may want to use literals to define locale-aware dates as literals in a modified RFC 3339 format: | ||
|
||
> ``` | ||
> The Unix epoch is defined as {|1970-01-01 00:00:00Z| :datetime}. | ||
> ``` | ||
|
||
- Message authors may want to use multiple words as values of certain options passed to custom functions and markup elements: | ||
|
||
> ``` | ||
> {+button title=|Click here!|}Submit{-button} | ||
> ``` | ||
|
||
> [!NOTE] | ||
> Quoted literals are not evaluated as part of a pattern or option sequence. | ||
> This means that their contents cannot be dynamic. | ||
> ``` | ||
> -- The "title" contains the string "{$userName}" | ||
> {+button title=|Goodbye, {$userName}!|}Sign out{-button} | ||
> ``` | ||
|
||
- Selector function implementers might need to match different string values | ||
such as those present in data values. | ||
These might include keys containing arbitrary text, multiple words, | ||
or other sequences not otherwise permitted in the syntax. | ||
|
||
> ``` | ||
> {{ match {$count :choice} | ||
> when |<10| {{A handful.}} | ||
> when |11..19| {{Umpteen.}} | ||
> when * {{Lots.}} | ||
> }} | ||
> | ||
> {{ match {$arbitraryString} | ||
> when |can't resolve| {{Can't resolve!}} | ||
> when |11'233.44| {{Locale formatted number}} | ||
> when |New York| {{A multi-word proper name}} | ||
> when * {{Imagine more...}} | ||
> }} | ||
> ``` | ||
|
||
- Message authors may want to protect untranslatable strings: | ||
|
||
> ``` | ||
> Visit {|http://www.example.com| @translate=false}. | ||
> ``` | ||
> | ||
> See the [expression attributes design proposal](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). | ||
|
||
- Message authors may want to decorate substrings as being written in a particular language, | ||
different from the message's language, | ||
for the purpose of accessibility, text-to-speech, and semantic correctness. | ||
|
||
> ``` | ||
> The official native name of the Republic of Poland is {|Rzeczpospolita Polska| @lang=pl}. | ||
> ``` | ||
> | ||
> See the [expression attributes design proposal](https://github.com/unicode-org/message-format-wg/blob/main/exploration/0002-expression-attributes.md). | ||
|
||
- Developers may want to embed messages with quoted literals in code written in another programming language | ||
which uses single or double quotes to delimit strings. | ||
|
||
> ```js | ||
> let message = new MessageFormat('en', 'A message with {|a literal|}.'); | ||
> ``` | ||
|
||
- Developers and localization engineers may want to embed messages with quoted literals in a container format, such as JSON. | ||
|
||
> ```json | ||
> { | ||
> "msg": "A message with {|a literal|}." | ||
> } | ||
> ``` | ||
|
||
## Requirements | ||
|
||
_What properties does the solution have to manifest to enable the use-cases above?_ | ||
|
||
- **[r1; high priority]** Minimize the need to escape characters inside literals. | ||
In particular, choose a delimiter that isn't frequently used in translation content. | ||
Having to escape characters inside literals is inconvenient and error-prone when done by hand, | ||
and it also introduces the backslash `\` into the message as the escape introducer. | ||
When the message is embedded in code or containers, the backslash then needs to be escaped too; | ||
this is how some syntaxes produce the gnarly `\\\`. | ||
|
||
By minimizing the need to escape characters, | ||
we also minimze the incentive to _avoid_ escaping by changing translation content, | ||
e.g. by rephrasing content or by using typographic punctuation marks. | ||
|
||
- **[r2; medium priority]** Minimize the need to escape characters or change the host format's string delimiters when embedding messages in code or containers. | ||
In particular, choose a delimiter that isn't frequently used as a string delimiter in programming languages and container formats. | ||
|
||
This requirement is scored as _medium_, because many storage formats don't use delimiters at all (`.properties`, YAML), | ||
or they are meant to be primarily used by machines (JSON), | ||
and because many programming languages provide a way to delimit _raw strings_, | ||
e.g. via <code>``</code> in JavaScript and `"""` in Python. | ||
Also, messages including e.g. newlines or `\` escapes in their source | ||
will likely need those characters accounted for when dropping them into new host formats. | ||
|
||
- **[r3; medium/high priority]** Do not surprise users with syntax that's too exotic. | ||
We expect quoted literals to be rare, | ||
which means fewer opportunities to get used to their syntax and remember it. | ||
|
||
- **[r4; low priority]** Be able to pair the opening and the closing delimiter, | ||
to aid parsers recover from syntax errors, | ||
and to leverage IDE's ability to highlight matching pairs of delimiters, | ||
to visually indicate to the user editing a message the bounds of the literal under caret. | ||
However, quoted literals are usually short and already enclosed in a placeholder (which has its own delimiters) | ||
or are outside patterns (when used as variant keys). | ||
|
||
<details> | ||
<summary>How can paired delimiters improve parsing recovery?</summary> | ||
If both paired delimiters are made special in the literal, | ||
i.e. both the opening and the closing delimiter require escaping inside the literal to be part of its contents, | ||
then the start of another literal can be an anchor point for a parser to stop parsing and attempt to rewind and recover. | ||
|
||
``` | ||
There {:is a=|broken literal=|here|} | ||
^ ^ | ||
The closing delimiter is missing here. | ||
The syntax error occurs here. | ||
There {:is a=[broken literal=[here]} | ||
^^ ^ | ||
The closing delimiter is missing here. | ||
| The parser can recognize a new literal here... | ||
and rewind to here. | ||
``` | ||
</details> | ||
|
||
- **[r5; low priority]** Do not require users to choose between too many syntax options. | ||
> There should be one — and preferably only one — obvious way to do it.<br> | ||
> —_[The Zen of Python](https://peps.python.org/pep-0020/)_ | ||
|
||
## Constraints | ||
|
||
_What prior decisions and existing conditions limit the possible design?_ | ||
|
||
- **[c1]** MessageFormat uses the backslash, `\`, | ||
as the escape sequence introducer. | ||
|
||
- **[c2]** Straight quotation marks, `'` and `"`, | ||
are common in content across many languages, | ||
even if other Unicode codepoints should be used in well-formatted text. | ||
|
||
- **[c3]** Straight quotation marks, `'` and `"`, | ||
are common as string delimiters in many programming languages. | ||
|
||
## Proposed Design | ||
|
||
_Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._ | ||
|
||
Use the vertical line character, `|`, to delimit quoted strings. | ||
The vertical line is rarely found in text content, | ||
and it has sufficiently good delimitation properties. | ||
|
||
> ``` | ||
> {The Unix epoch is defined as {|Thu, 01 Jan 1970 00:00:00 GMT| :datetime}.} | ||
> ``` | ||
|
||
```abnf | ||
literal = quoted / unquoted | ||
quoted = "|" *(quoted-char / quoted-escape) "|" | ||
quoted-char = %x0-5B ; omit \ | ||
/ %x5D-7B ; omit | | ||
/ %x7D-D7FF ; omit surrogates | ||
/ %xE000-10FFFF | ||
quoted-escape = backslash ( backslash / "|" ) | ||
``` | ||
|
||
By being both uncommon in text content and uncommon as a string delimiter in other programming languages, | ||
the vertical line sidesteps the "inwards" and "outwards" problems of escaping. | ||
|
||
- [r1 GOOD] Writing `"` and `'` in literals doesn't require escaping them via `\`. | ||
This means no extra `\` that need escaping. | ||
- [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. | ||
- [r3 POOR/FAIR] Vertical lines are not commonly used as string delimiters | ||
and thus can be harder to learn for beginners. | ||
Vertical bars can be used as a separator in [delimiter-separated data formats](http://www.catb.org/~esr/writings/taoup/html/ch05s02.html). | ||
However, typically vertical lines tend to be used as delimiters for *separating* rather than for *enclosing*. | ||
- [r4 POOR] Vertical lines are not automatically paired by parsers nor IDEs. | ||
|
||
## Alternatives Considered | ||
|
||
_What other solutions are available?_ | ||
_How do they compare against the requirements?_ | ||
_What other properties they have?_ | ||
|
||
### [a1] Use quotation marks | ||
|
||
Early drafts of the syntax specification used double quotes to delimit literals. | ||
This changed in [#263](https://github.com/unicode-org/message-format-wg/issues/263#issue-1233590015). | ||
|
||
- [r1 POOR] Writing `"` and `'` in literals requires escaping them via `\`, | ||
which then needs to be escaped itself in code | ||
which uses `\` as the escape character (which is common). | ||
- [r2 FAIR] Embedding messages in certain programming languages and containers requires escaping the literal delimiters. | ||
Most notably, storing MF2 messages in JSON suffers from this. | ||
In many programming languages, however, alternatives to quotation marks exist, | ||
which could be used to allow unescaped quotes in messages. | ||
See [comment on #263](https://github.com/unicode-org/message-format-wg/issues/263#issuecomment-1430929542). | ||
- [r3 GOOD] Quotation marks are universally recognized as string delimiters. | ||
- [r4 FAIR] Quotation marks are not automatically paired by parsers nor IDEs, | ||
but many text editors provide features to make working with and around quotes easier. | ||
|
||
### [a2] Dual quoting | ||
|
||
PR [#414](https://github.com/unicode-org/message-format-wg/pull/414) proposes to | ||
allow either single quotes `'` or double quotes `"` as literal delimiters, | ||
a variant of the "Use quotation marks" solution. | ||
|
||
- [r1 FAIR] Writing `"` and `'` in literals doesn't require escaping them via `\`, | ||
as long as they do not match the literal's delimiter. | ||
Literals containing both `'` and `"` will need to have at least one of those characters | ||
escaped via `\`, which may itself need escaping in the container format. | ||
- [r2 GOOD] Embedding messages in certain container formats requires escaping the literal delimiters. | ||
If the container format does not itself support dual quoting, | ||
the embedded message's quotes may be adjusted to avoid their escaping. | ||
- [r3 GOOD] Quotation marks are universally recognized as string delimiters. | ||
- [r4 FAIR] Quotation marks cannot be paired by parsers nor IDEs, | ||
but many text editors provide features to make working with and around quotes easier. | ||
|
||
### [a3] Use round or angle brackets | ||
|
||
- Round parentheses are very uncommon as string delimiters [r2 GOOD], | ||
and thus may be surprising, | ||
especially given the well-established meaning in prose [r4 POOR]. | ||
That said, there's prior art in using them for [delimiting strings in PostScript](https://en.wikipedia.org/wiki/PostScript#%22Hello_world%22). | ||
Furthermore, they are relatively common in text, where they'd require escaping [r1 POOR]. | ||
- Angle brackets require escaping in XML-based storage formats [r2 FAIR]. | ||
- All brackets can be easily paired by parsers and IDEs [r5 GOOD]. | ||
|
||
### [a4] Change escape introducer | ||
|
||
Changing the escape sequence introducer from backslash [c1] to another character | ||
could help partially mitigate the burden of first escaping literal delimiters | ||
and then escaping the escapes themselves [r1]. | ||
However, it wouldn't address other requirements and use-cases. | ||
|
||
### [a5] Double delimiters to escape them | ||
|
||
This is the approach taken by ICU MessageFormat 1.0 for quotes. | ||
It allows literals to contain quotes [r1 GOOD] | ||
at the expense of doubling the amount of escaping required when embedding messages in code [r2 POOR]. | ||
|
||
### [a6] Accept either `|` or quotes | ||
|
||
Allow any of the following as literal delimiters: | ||
|
||
- the vertical line character `|` | ||
- single quotes `'` | ||
- double quotes `"` | ||
|
||
This approach supports multiple different quoting styles to be used for literals. | ||
This flexibility allows for using a familiar and common style such as `'single'` or `"double"` quotes, | ||
while also allowing for `|pipes|` when the message's contents or embedding would otherwise require additional escaping. | ||
eemeli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This means that literals could for example prefer `'single quotes'`, | ||
but use `"double 'em"` if necessary, | ||
or `|'pipe' characters|` if the whole message is wrapped in `"quotes"` due to the host format | ||
or if the literal value contains both `'` and `"` quotes. | ||
|
||
```abnf | ||
literal = quoted / unquoted | ||
quoted = "|" *(quoted-char / "'" / DQUOTE / quoted-escape) "|" | ||
/ "'" *(quoted-char / DQUOTE / "|" / quoted-escape) "'" | ||
/ DQUOTE *(quoted-char / "'" / "|" / quoted-escape) DQUOTE | ||
quoted-char = %x0-21 ; omit " | ||
/ %x23-26 ; omit ' | ||
/ %x28-5B ; omit \ | ||
/ %x5D-7B ; omit | | ||
/ %x7D-D7FF ; omit surrogates | ||
/ %xE000-10FFFF | ||
quoted-escape = backslash ( backslash / "|" / "'" / DQUOTE ) | ||
``` | ||
|
||
- [r1 GOOD] Writing any two of `|`, `"` and `'` in literals doesn't require escaping them via `\`. | ||
This means no extra `\` that need escaping. | ||
Message don't have to be modified otherwise before embedding them, | ||
unless they happen to contain conflicting quote delimiters. | ||
Comment on lines
+330
to
+331
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is actually a con "in a trenchcoat" :-) Messages MUST be modified if they "happen to contain conflicting quote delimiters", e.g. those that conflict with the syntax. My big concern here is: how to do tools up and down the translation stack decide what quotes to use? There should be a "con" somewhere for having multiple ways of doing the same thing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops, sorry, typo above: I meant |
||
- [r2 GOOD] Embedding messages in most code or containers doesn't require escaping the literal delimiters. | ||
- [r3 GOOD] Quotation marks are universally recognized as string delimiters. | ||
- [r4 FAIR] Using the same marks for quote-start and quote-end cannot be paired by parsers nor IDEs, | ||
but many text editors provide features to make working with and around quotes easier. | ||
|
||
## Comparison table | ||
|
||
| | Priority | Proposal | [a1] | [a2] | [a3] | [a4] | [a5] | [a6] | | ||
|-----------------------------|----------|:--------:|:----:|:----:|:----:|:----:|:----:|:----:| | ||
| [r1] escape inside literals | HIGH | ++ | - | + | - | ++ | ++ | ++ | | ||
| [r2] escape when embedding | MED | ++ | + | ++ | +/++ | | - | ++ | | ||
| [r3] no surprises | MED/HIGH | -/+ | ++ | ++ | - | - | + | ++ | | ||
| [r4] pair delimiters | LOW | - | + | + | ++ | | | + | | ||
| [r5] one way | LOW | ++ | ++ | + | ++ | | | - | |
Uh oh!
There was an error while loading. Please reload this page.