diff --git a/spec/formatting.md b/spec/formatting.md index 34b5c5f028..c141451217 100644 --- a/spec/formatting.md +++ b/spec/formatting.md @@ -768,7 +768,16 @@ That is, the text can can consist of a mixture of left-to-right and right-to-lef The display of bidirectional text is defined by the [Unicode Bidirectional Algorithm](http://www.unicode.org/reports/tr9/) [UAX9]. -The directionality of the message as a whole is provided by the _formatting context_. +The directionality of the formatted _message_ as a whole is provided by the _formatting context_. + +> [!NOTE] +> Keep in mind the difference between the formatted output of a _message_, +> which is the topic of this section, +> and the syntax of _message_ prior to formatting. +> The processing of a _message_ depends on the logical sequence of Unicode code points, +> not on the presentation of the _message_. +> Affordances to allow users appropriate control over the appearance of the +> _message_'s syntax have been provided. When a _message_ is formatted, _placeholders_ are replaced with their formatted representation. diff --git a/spec/message.abnf b/spec/message.abnf index a5966ee0bf..0b251ce270 100644 --- a/spec/message.abnf +++ b/spec/message.abnf @@ -1,41 +1,41 @@ message = simple-message / complex-message -simple-message = [s] [simple-start pattern] +simple-message = o [simple-start pattern] simple-start = simple-start-char / escaped-char / placeholder pattern = *(text-char / escaped-char / placeholder) placeholder = expression / markup -complex-message = [s] *(declaration [s]) complex-body [s] +complex-message = o *(declaration o) complex-body o declaration = input-declaration / local-declaration complex-body = quoted-pattern / matcher -input-declaration = input [s] variable-expression -local-declaration = local s variable [s] "=" [s] expression +input-declaration = input o variable-expression +local-declaration = local s variable o "=" o expression -quoted-pattern = "{{" pattern "}}" +quoted-pattern = o "{{" pattern "}}" -matcher = match-statement s variant *([s] variant) +matcher = match-statement s variant *(o variant) match-statement = match 1*(s selector) selector = variable -variant = key *(s key) [s] quoted-pattern +variant = key *(s key) quoted-pattern key = literal / "*" ; Expressions expression = literal-expression / variable-expression / function-expression -literal-expression = "{" [s] literal [s function] *(s attribute) [s] "}" -variable-expression = "{" [s] variable [s function] *(s attribute) [s] "}" -function-expression = "{" [s] function *(s attribute) [s] "}" +literal-expression = "{" o literal [s function] *(s attribute) o "}" +variable-expression = "{" o variable [s function] *(s attribute) o "}" +function-expression = "{" o function *(s attribute) o "}" -markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone - / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close +markup = "{" o "#" identifier *(s option) *(s attribute) o ["/"] "}" ; open and standalone + / "{" o "/" identifier *(s option) *(s attribute) o "}" ; close ; Expression and literal parts function = ":" identifier *(s option) -option = identifier [s] "=" [s] (literal / variable) +option = identifier o "=" o (literal / variable) -attribute = "@" identifier [[s] "=" [s] (literal / variable)] +attribute = "@" identifier [o "=" o (literal / variable)] variable = "$" name @@ -52,13 +52,13 @@ match = %s".match" ; Names and identifiers ; identifier matches https://www.w3.org/TR/REC-xml-names/#NT-QName -; name matches https://www.w3.org/TR/REC-xml-names/#NT-NCName but excludes U+FFFD +; name matches https://www.w3.org/TR/REC-xml-names/#NT-NCName but excludes U+FFFD and U+061C identifier = [namespace ":"] name namespace = name -name = name-start *name-char +name = [bidi] name-start *name-char [bidi] name-start = ALPHA / "_" / %xC0-D6 / %xD8-F6 / %xF8-2FF - / %x370-37D / %x37F-1FFF / %x200C-200D + / %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D / %x2070-218F / %x2C00-2FEF / %x3001-D7FF / %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF name-char = name-start / DIGIT / "-" / "." @@ -66,8 +66,8 @@ name-char = name-start / DIGIT / "-" / "." ; Restrictions on characters in various contexts simple-start-char = content-char / "@" / "|" -text-char = content-char / s / "." / "@" / "|" -quoted-char = content-char / s / "." / "@" / "{" / "}" +text-char = content-char / ws / "." / "@" / "|" +quoted-char = content-char / ws / "." / "@" / "{" / "}" content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A) / %x0B-0C ; omit CR (%x0D) / %x0E-1F ; omit SP (%x20) @@ -83,5 +83,15 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A) escaped-char = backslash ( backslash / "{" / "|" / "}" ) backslash = %x5C ; U+005C REVERSE SOLIDUS "\" -; Whitespace -s = 1*( SP / HTAB / CR / LF / %x3000 ) +; Required whitespace +s = *bidi ws o + +; Optional whitespace +o = *(ws / bidi) + +; Bidirectional marks and isolates +; ALM / LRM / RLM / LRI, RLI, FSI & PDI +bidi = %x061C / %x200E / %x200F / %x2066-2069 + +; Whitespace characters +ws = SP / HTAB / CR / LF / %x3000 diff --git a/spec/syntax.md b/spec/syntax.md index aef6720684..ea55af8a06 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -134,17 +134,23 @@ A **_local variable_** is a _variable_ created as the result of a _lo > > An exception to this is: whitespace inside a _pattern_ is **always** significant. > [!NOTE] -> The syntax assumes that each _message_ will be displayed with a left-to-right display order +> The MessageFormat 2 syntax assumes that each _message_ will be displayed +> with a left-to-right display order > and be processed in the logical character order. -> The syntax also permits the use of right-to-left characters in _identifiers_, +> The syntax permits the use of right-to-left characters in _identifiers_, > _literals_, and other values. -> This can result in confusion when viewing the _message_. +> This can result in confusion when viewing the message +> or users might incorrectly insert bidi controls or marks that negatively affect the output +> of the message. +> +> To assist with this, the syntax permits the use of various controls and +> strongly-directional markers in both optional and required _whitespace_ +> in a _message_, as well was encouraging the use of isolating controls +> with _expressions_ and _quoted patterns_. +> See: [whitespace](#whitespace) (below) for more information. > -> Additional restrictions or requirements, -> such as permitting the use of certain bidirectional control characters in the syntax, -> might be added during the Tech Preview to better manage bidirectional text. -> Feedback on the creation and management of _messages_ -> containing bidirectional tokens is strongly desired. +> Additional restrictions or requirements might be added during the +> Tech Preview to better manage bidirectional text. A _message_ can be a _simple message_ or it can be a _complex message_. @@ -160,7 +166,7 @@ Whitespace at the start or end of a _simple message_ is significant, and a part of the _text_ of the _message_. ```abnf -simple-message = [s] [simple-start pattern] +simple-message = o [simple-start pattern] simple-start = simple-start-char / escaped-char / placeholder ``` @@ -176,7 +182,7 @@ Whitespace at the start or end of a _complex message_ is not significant, and does not affect the processing of the _message_. ```abnf -complex-message = [s] *(declaration [s]) complex-body [s] +complex-message = o *(declaration o) complex-body o ``` ### Declarations @@ -193,8 +199,8 @@ A **_local-declaration_** binds a _variable_ to the resolved value of ```abnf declaration = input-declaration / local-declaration -input-declaration = input [s] variable-expression -local-declaration = local s variable [s] "=" [s] expression +input-declaration = input o variable-expression +local-declaration = local s variable o "=" o expression ``` _Variables_, once declared, MUST NOT be redeclared. @@ -254,7 +260,7 @@ A _quoted pattern_ starts with a sequence of two U+007B LEFT CURLY BRACKET `{{` and ends with a sequence of two U+007D RIGHT CURLY BRACKET `}}`. ```abnf -quoted-pattern = "{{" pattern "}}" +quoted-pattern = o "{{" pattern "}}" ``` A _quoted pattern_ MAY be empty. @@ -285,8 +291,8 @@ be preserved during formatting. ```abnf simple-start-char = content-char / "@" / "|" -text-char = content-char / s / "." / "@" / "|" -quoted-char = content-char / s / "." / "@" / "{" / "}" +text-char = content-char / ws / "." / "@" / "|" +quoted-char = content-char / ws / "." / "@" / "{" / "}" content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A) / %x0B-0C ; omit CR (%x0D) / %x0E-1F ; omit SP (%x20) @@ -352,7 +358,7 @@ otherwise, a corresponding _Data Model Error_ will be produced during processing _Literal_ _keys_ are compared by their contents, not their syntactical appearance. ```abnf -matcher = match-statement s variant *([s] variant) +matcher = match-statement s variant *(o variant) match-statement = match 1*(s selector) ``` @@ -425,7 +431,7 @@ Each _key_ is separated from each other by whitespace. Whitespace is permitted but not required between the last _key_ and the _quoted pattern_. ```abnf -variant = key *(s key) [s] quoted-pattern +variant = key *(s key) quoted-pattern key = literal / "*" ``` @@ -461,9 +467,9 @@ A **_function-expression_** contains a _function_ without an _operand expression = literal-expression / variable-expression / function-expression -literal-expression = "{" [s] literal [s function] *(s attribute) [s] "}" -variable-expression = "{" [s] variable [s function] *(s attribute) [s] "}" -function-expression = "{" [s] function *(s attribute) [s] "}" +literal-expression = "{" o literal [s function] *(s attribute) o "}" +variable-expression = "{" o variable [s function] *(s attribute) o "}" +function-expression = "{" o function *(s attribute) o "}" ``` There are several types of _expression_ that can appear in a _message_. @@ -549,7 +555,7 @@ and will produce a _Duplicate Option Name_ error during processing. The order of _options_ is not significant. ```abnf -option = identifier [s] "=" [s] (literal / variable) +option = identifier o "=" o (literal / variable) ``` > Examples of _functions_ with _options_ @@ -594,8 +600,8 @@ It MAY include _options_. is a _pattern_ part ending a span. ```abnf -markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone - / "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close +markup = "{" o "#" identifier *(s option) *(s attribute) o ["/"] "}" ; open and standalone + / "{" o "/" identifier *(s option) *(s attribute) o "}" ; close ``` > A _message_ with one `button` markup span and a standalone `img` markup element: @@ -637,7 +643,7 @@ all but the last _attribute_ with the same _identifier_ are ignored. The order of _attributes_ is not otherwise significant. ```abnf -attribute = "@" identifier [[s] "=" [s] literal] +attribute = "@" identifier [o "=" o literal] ``` > Examples of _expressions_ and _markup_ with _attributes_: @@ -727,7 +733,12 @@ A **_name_** is a character sequence used in an _identifier_ or as the name for a _variable_ or the value of an _unquoted literal_. -_Variable_ names are prefixed with `$`. +A _name_ can be preceded or followed by bidirectional marks or isolating controls +to aid in presenting names that contain right-to-left or neutral characters. +These characters are **not** part of the value of the _name_ and MUST be treated as if they were not present +when matching _name_ or _identifier_ strings or _unquoted literal_ values. + +_Variable_ _names_ are prefixed with `$`. Valid content for _names_ is based on Namespaces in XML 1.0's [NCName](https://www.w3.org/TR/xml-names/#NT-NCName). @@ -763,14 +774,14 @@ in this release. ```abnf variable = "$" name -option = identifier [s] "=" [s] (literal / variable) +option = identifier o "=" o (literal / variable) identifier = [namespace ":"] name namespace = name -name = name-start *name-char +name = [bidi] name-start *name-char [bidi] name-start = ALPHA / "_" / %xC0-D6 / %xD8-F6 / %xF8-2FF - / %x370-37D / %x37F-1FFF / %x200C-200D + / %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D / %x2070-218F / %x2C00-2FEF / %x3001-D7FF / %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF name-char = name-start / DIGIT / "-" / "." @@ -803,24 +814,112 @@ and inside _patterns_ only escape `{` and `}`. ### Whitespace -**_Whitespace_** is defined as one or more of -U+0009 CHARACTER TABULATION (tab), -U+000A LINE FEED (new line), -U+000D CARRIAGE RETURN, -U+3000 IDEOGRAPHIC SPACE, -or U+0020 SPACE. +The syntax limits whitespace characters outside of a _pattern_ to the following: +`U+0009 CHARACTER TABULATION` (tab), +`U+000A LINE FEED` (new line), +`U+000D CARRIAGE RETURN`, +`U+3000 IDEOGRAPHIC SPACE`, +or `U+0020 SPACE`. Inside _patterns_ and _quoted literals_, whitespace is part of the content and is recorded and stored verbatim. Whitespace is not significant outside translatable text, except where required by the syntax. +There are two whitespace productions in the syntax. +**_Optional whitespace_** is whitespace that is not required by the syntax, +but which users might want to include to increase the readability of a _message_. +**_Required whitespace_** is whitespace that is required by the syntax. + +Both types of whitespace optionally permit the use of the bidirectional isolate controls +and certain strongly directional marks. +These can assist users in presenting _messages_ that contain right-to-left +text, _literals_, or _names_ (including those for _functions_, _options_, +_option values_, and _keys_) + +_Messages_ that contain right-to-left (aka RTL) characters SHOULD use one of the +following mechanisms to make messages display intelligibly in plain-text editors: + +1. Use paired isolating bidi controls `U+2066 LEFT-TO-RIGHT ISOLATE` ("LRI") + and `U+2069 POP DIRECTIONAL ISOLATE` ("PDI") as permitted by the ABNF around + parts of any _message_ containing RTL characters: + - _inside_ of _placeholder_ markers `{` and `}` + - _outside_ _quoted-pattern_ markers `{{` and `}}` + - _outside_ of _variable_, _function_, _markup_, or _attribute_, + including the identifying sigil (e.g. `$var` or `:ns:name`) +2. Use the 'local-effect' bidi marks + `U+061C ARABIC LETTER MARK`, `U+200E LEFT-TO-RIGHT MARK` or + `U+200F RIGHT-TO-LEFT MARK` as permitted by the ABNF before or after _identifiers_, + _names_, unquoted _literals_, or _option_ values, + especially when the values contain a mix of neutral, weakly directional, and + strongly directional characters. + +> [!IMPORTANT] +> Always take care **not** to add bidirectional controls or marks +> where they would be semantically significant +> or where they would unintentionally become part of the _message_'s output: +> - do not put them inside of a _literal_ except when they are part of the value, +> (instead put them outside of _literal_ quotes, such as `|...|`) +> - do not put them inside quoted _patterns_ except when they are part of the text, +> (instead put them outside of quoted _patterns_, such as `{{...}}`) +> - do not put them outside _placeholders_, +> (instead put them inside the _placeholder_, such as `{$foo :number}`) +> +> Controls placed inside _literal_ quotes or quoted _patterns_ are part of the _literal_ +> or _pattern_. +> Controls in a _pattern_ will appear in the output of the message. +> Controls inside _literal_ quotes are part of the _literal_ and +> will be considered in operations such as matching a _key_ to a _selector_. + +> [!NOTE] +> Users cannot be expected to create or manage bidirectional controls or +> marks in _messages_, since the characters are invisible and can be difficult +> to manage. +> Tools (such as resource editors or translation editors) +> and other implementations of MessageFormat 2 serialization are strongly +> encouraged to provide paired isolates around any right-to-left +> syntax as described above so that _messages_ display appropriately as plain text. + +These definitions of _whitespace_ implement +[UAX#31 Requirement R3a-2](https://www.unicode.org/reports/tr31/#R3a-2). +It is a profile of R3a-1 in that specification because: +- The following pattern whitespace characters are not allowed: + `U+000B FORM FEED`, + `U+000C VERTICAL TABULATION`, + `U+0085 NEXT LINE`, + `U+2028 LINE SEPARATOR` and + `U+2029 PARAGRAPH SEPARATOR`. +- The character `U+3000 IDEOGRAPHIC SPACE` + _is_ interpreted as whitespace. + - The following directional marks and isolates + are treated as ignorable format controls: + `U+061C ARABIC LETTER MARK`, + `U+200E LEFT-TO-RIGHT MARK`, + `U+200F RIGHT-TO-LEFT MARK`, + `U+2066 LEFT-TO-RIGHT ISOLATE`, + `U+2067 RIGHT-TO-LEFT ISOLATE`, + `U+2068 FIRST STRONG ISOLATE`, + and `U+2069 POP DIRECTIONAL ISOLATE`. + (The character `U+061C` is an addition according to R3a.) + + > [!NOTE] > The character U+3000 IDEOGRAPHIC SPACE is included in whitespace for > compatibility with certain East Asian keyboards and input methods, > in which users might accidentally create these characters in a _message_. ```abnf -s = 1*( SP / HTAB / CR / LF / %x3000 ) +; Required whitespace +s = *bidi ws o + +; Optional whitespace +o = *(s / bidi) + +; Bidirectional marks and isolates +; ALM / LRM / RLM / LRI, RLI, FSI & PDI +bidi = %x061C / %x200E / %x200F / %x2066-2069 + +; Whitespace characters +ws = SP / HTAB / CR / LF / %x3000 ``` ## Complete ABNF