System Guide
This is the guide to the Lezer parser system. It provides a prose description of the system's functionality. For the item-by-item documentation of its interface, see the reference manual.
Overview
Lezer is a parser system written in JavaScript. Given a formal description of a grammar, it can produce a set of parse tables. Such tables provide a description that the parser system can use to efficiently construct a syntax tree for a given piece of text, describing the structure of the text in terms of the grammar.
The tables are generated by the @lezer/generator tool, a build tool that takes a file in the format described later in this guide, and outputs a big, largely unreadable blob of JavaScript that represents the parse tables. This is something that happens offline, as part of the build process for a grammar package.
The @lezer/lr package provides the run-time parsing system. Combined with a parser built by the generator, it gives you a parser object that can take a source file and return a tree.
These trees, represented by data structures from the @lezer/common package, are more limited than the abstract syntax trees you might have seen in other contexts. They are not very abstract. Each node only stores a type, a start and end position, and a flat sequence of children. When writing a grammar, you choose which productions are stored as nodes—the others don't appear in the tree at all.
This means that the tree is very compact in memory and cheap to build. It does make doing refined analysis on it somewhat awkward. The use case guiding the design of this library is an editor system, which keeps a syntax tree of the edited document, and uses it for things like syntax highlighting and smart indentation.
To support this use case the parser has a few other interesting properties. It can be used incrementally, meaning it can efficiently re-parse a document that is slightly changed compared to some previous version given the parse for the old version. And it has error recovery built in, meaning that even if the input doesn't adhere to the grammar, it can still produce some reasonable syntax tree for it.
This system's approach is heavily influenced by tree-sitter, a similar system written in C and Rust, and several papers by Tim Wagner and Susan Graham on incremental parsing (1, 2). It exists as a different system because it has different priorities than tree-sitter—as part of a JavaScript system, it is written in JavaScript, with relatively small library and parser table size. It also generates more compact in-memory trees, to avoid putting too much pressure on the user's machine.
Parser Algorithm
Lezer is based on LR parsing, an algorithm invented by Donald Knuth in 1965, which by pre-analyzing the grammar can derive fully deterministic (and thus efficient) parsers for some grammars.
Roughly, this algorithm abstractly interprets the grammar, recording the various states the parser can be in, and creating a table for each state mapping terminal symbols (tokens) to the action that should be taken when the parser sees that token in this state. If there's more than one action to take for a given token in a given state, the grammar cannot be parsed with this algorithm. Such problems are usually called “shift-reduce” or “reduce-reduce” conflicts. More about that in a moment.
When writing grammars for LR-based tools, it can help to have a rough feel for this algorithm. The Wikipedia article linked above is a good introduction. For a more in-depth treatment, I recommend Chapter 9 of this book (PDF).
Ambiguity
A lot of grammars are either impossible or extremely awkward to
represent as a plain LR grammar. If an element's syntactic role only
becomes clear later in the parse (for example JavaScript's (x = 1) + 1 versus (x = 1) => x, where (x = 1) can either be an expression
or a parameter list), plain LR often isn't very practical.
GLR is an extension of the parsing algorithm that allows a parse to ‘split’ at an ambiguous point by applying more than one action for a given token. The alternative parses then continue beside each other. When a parse can't make any more progress, it is dropped. As long as at least one parse continues, we can get our syntax tree.
GLR can be extremely helpful when dealing with local ambiguities, but when applied naively it can easily lead to an explosion of concurrent parses and, when the grammar is actually ambiguous, multiple parses continuing indefinitely so that you're in effect parsing the document multiple times at once. This completely ruins the properties that made LR parsing useful: predictability and efficiency.
Lezer allows GLR parsing but requires you to explicitly annotate the places in your grammar where it is allowed to happen, so that you can use it to solve problems that were otherwise difficult, but it won't accidentally start happening all over your grammar.
The parse state splitting is optimized in Lezer, so that, though it is more expensive than just doing a linear parse, you can have ambiguities in commonly encountered constructs in your language and still have a fast parser.
Error Recovery
Though the parser has a strict mode,
by default it'll proceed through any text, no matter how badly it fits
the grammar, and come up with a tree at the end.
To do this it uses the GLR mechanism to try various recovery tricks (ignoring the current token or skipping ahead to a place that matches the current token) alongside each other to see which one, a few tokens later, gets the best result.
Ignored tokens are added to the tree, wrapped in an error node. Similarly, the place where the parser skipped ahead is marked with an error node.
Incremental Parsing
In order to avoid re-doing work, the parser allows you to provide a cache of tree fragments, which hold information about trees produced by previous parses, annotated with information about the document changes that happened in the meantime. The parser will, when possible, reuse nodes from this cache rather than re-parsing the parts of the document they cover.
Because the syntax tree represents sequences of matches of repeat
operators (specified in the grammar notation with + and *) as
balanced sub-trees, the cost of re-matching unchanged parts of the
document is low, and you can quickly create a new tree even for a huge
document.
This isn't bulletproof, though—even a tiny document change, if it changes the meaning of the stuff that comes after it, can require a big part of the document to be re-parsed. An example would be adding or removing a block comment opening marker.
Contextual Tokenization
In classical parsing techniques there is a strict separation between the tokenizer, the thing that splits the input into a sequence of atomic tokens, and the parser.
This separation can be problematic, though. Sometimes, the meaning of a piece of text depends on context, such as with the ambiguity between JavaScript's regular expression notation and its division operator syntax. At other times, a sub-language used in the grammar (say, the content of a string) has a different concept of what a token is than the rest of the grammar.
Lezer supports contextual token reading. It will allow the tokens you declare to overlap (to match the same input) as long as such tokens can't occur in the same place anywhere in the grammar.
You can also define external tokenizers, which cause the parser to call out to your code to read a token. Such tokenizers will, again, only be called when the tokens they produce apply at the current position.
Even white space, the type of tokens implicitly skipped by the parser, is contextual, and you can have different rules skip different things.
Writing a Grammar
Lezer's parser generator defines its own notation for grammars. You can take a look at the JavaScript grammar to see an example.
A grammar is a collection of rules, which define terms. Terms can
be either tokens, in which case they directly match a piece of input
text, or nonterminals, which match expressions made up of other
terms. Both tokens and nonterminals are defined using similar syntax,
but they are explicitly distinguished (tokens must appear in a
@tokens block), and tokens are more limited in what they can
match—for example, they can't contain arbitrary recursion.
Formally, tokens must match a regular language (roughly the thing that basic regular expressions can match), whereas nonterminals can express a context-free language.
Here is an example of a simple grammar:
@top Program { expression }
expression { Name | Number | BinaryExpression }
BinaryExpression { "(" expression ("+" | "-") expression ")" }
@tokens {
Name { @asciiLetter+ }
Number { @digit+ }
}
This would match strings like (a+1) or (100-(2+4)). It doesn't
allow white space between tokens and the parentheses around binary
expressions are required, or it would complain about ambiguity.
@top defines the entry point to the grammar. This is the rule that
will be used to match the entire input. It'll usually contain
something that repeats, like statement+.
You'll notice that the example has some terms starting with lowercase letter, and some that are capitalized. This difference is significant. Capitalized rules will show up as nodes in the syntax tree produced by the parser, lower-case rules will not.
(If you are writing rule names in a script that doesn't have case, you can use an underscore at the start of a name to indicate that the rule should not be in the tree.)
Operators
The grammar notation supports the repetition operators * (any number
of repetitions of the thing before), + (one or more repetitions),
and ? (optional, zero or one repetitions). These have high
precedence and only apply to the element directly in front of them.
The pipe | character is used to represent choice, matching either of
the expressions on its sides. It can, of course, be repeated to
express more than two choices (x | y | z). Choice has the lowest
precedence of any operator, and, if there are no parentheses present,
will apply to the entire expressions to its left and right.
Choice in context-free grammars is
commutative,
meaning that a | b is exactly equivalent to b | a.
Parentheses can be used for grouping, as in x (y | z)+.
Sequences of things are expressed by simply putting them next to each
other. a b means a followed by b.
Tokens
Named tokens are defined in a @tokens block. You can also, outside
of the tokens block, use string literals like "+" as tokens. These
automatically define a new token. String literals use the same
escaping rules as JavaScript strings.
String literals inside of token rules work differently. They can be combined with other expressions to form a bigger token. All token rules (and literal tokens) are compiled to a deterministic finite automaton, which can then be used to efficiently match them against a text stream.
So an expression like "a" "b" "c" in a nonterminal rule is a
sequence of three tokens. In a token rule, it is exactly equivalent to
the string "abc".
You can express character sets using set notation, somewhat similar to
square bracket notation in regular expression. $[.,] means either a
period or a comma (there's no special meaning associated with . in
this notation, or with escapes like \s). $[a-z] matches a, z,
and any character that, in the Unicode character code ordering, comes
between them. To create an inverted character set, matching only
characters not mentioned in the set, you write an exclamation mark
rather than a dollar sign before the brackets. So ![x] matches any
character that is not x.
The parser generator defines a few built-in character sets which can
be accessed with @ notation:
@asciiLettermatches$[a-zA-Z]@asciiLowercasematches$[a-z]@asciiUppercaseis equivalent to$[A-Z]@digitmatches$[0-9]@whitespacematches any character the Unicode standard defines as whitespace.@eofmatches the end of the input
In addition, an underscore _ matches any character.
Token rules cannot refer to nonterminal rules. But they can refer to
each other, as long as the references don't form a non-tail recursive
loop. I.e. a rule x cannot, directly or indirectly, include a
reference to x, unless that reference appears at the very end of the
rule.
Skip Expressions
Our initial example does not allow any whitespace between the tokens. Almost all real languages define some kind of special tokens, usually covering spacing and comments, that may occur between the other tokens and are ignored when matching the grammar.
To support whitespace, you must add a @skip rule to your grammar.
@skip { space | Comment }
You could define the space and Comment tokens like this:
@tokens {
space { @whitespace+ }
Comment { "//" ![\n]* }
// ...
}
A skip rule will be matched zero or more times between other tokens. So the rule above also handles a long sequence of comments and whitespace.
Skipped tokens may be capitalized (as Comment is), in which case
they will appear in the syntax tree. It is allowed for a skip
expression to match things more complicated than single tokens.
When your grammar needs more than one kind of whitespace, for example when your strings aren't plain tokens but need their own internal structure, but you don't want to match whitespace and comments between the pieces of string, you can create a skip block like this:
@skip {} {
String {
stringOpen (stringEscape | stringContent)* stringClose
}
}
The initial braces contain the skip expression—in this case we want to skip nothing so it is empty—and the second pair of braces contain rules inside of which this skip expression is used.
Parse states can only have a single set of skip expressions associated with them (otherwise it would be unclear what to skip when in that state). This means that a rule with custom skip expressions like above, if used in another skip context, must be clearly delimited on both sides. It cannot, for example, have an optional or repeated term at the end, because it would be unclear what to skip at the point where that term may or may not follow.
Precedence
Let's go back to the binary operator rule in the example. If we define it like this, removing the parentheses, we get an error message.
BinaryExpression { expression ("+" | "-") expression }
The error says "shift/reduce conflict" at expression "+" expression · "+". The · indicates the parse position, and the parser is telling
us that, after reading what looks like a binary expression and seeing
another operator, it doesn't know whether to first reduce the initial
tokens to a BinaryExpression or to first shift the second operator
and leave the first for later. Basically, it doesn't know whether to
parse x + y + z as (x + y) + z or x + (y + z).
This kind of issue can be solved without resorting to GLR parsing, fortunately. Doing so involves what is basically a crude hack on top of the LR parser generator algorithm: specify precedence to resolve these conflicts, so that when the parser generator runs into the ambiguity, we tell it to take one parse and discard the other.
This can be used not only for associativity (the error above), but
also for operator precedence (giving groupings involving *
precedence over groupings involving +, for example) and various
other issues.
The way to specify precedence in Lezer is with, first, a @precedence
block that enumerates the precedence names you are going to use in the
grammar, in order of precedence (highest first), and then inserting
precedence markers at ambiguous positions.
@precedence { times @left, plus @left }
@top Program { expression }
expression { Number | BinaryExpression }
BinaryExpression {
expression !times "*" expression |
expression !plus "+" expression
}
@tokens {
Number { @digit+ }
}
This tells the parser generator that the !times precedence marker
has a higher precedence than !plus, and that both are left
associative (preferring (a + b) + c to a + (b + c)). You can also
specify @right after the precedence name to make it right
associative, or leave off the associativity not specify any (in which
case the precedence won't resolve shift/reduce conflicts with itself).
The ! markers have to be inserted at the point where the conflict
occurs. In this case, the conflict happens directly in front of the
operators, so that's where the markers are placed.
This grammar can now correctly parse things like 1+2*3+4 into nodes
that group the operators by precedence, as in (1+(2*3))+4.
It is also possible, instead of specifying an associativity for a
given precedence, to make it a cut operator by using the keyword
@cut. A cut operator will override other interpretations even though
no conflict was detected yet. An example of where this is appropriate
is the way JavaScript's function keyword has a different meaning in
statement position. A statement can start with an expression, and an
expression can be a function expression, but when we see the
function token at the start of a statement, we only want to enter
the function declaration statement rule, not the function expression
rule.
@precedence { decl @cut }
@top Program { statement+ }
statement { FunctionDeclaration | FunctionExpression }
FunctionExpression { "function" "..." }
FunctionDeclaration { !decl "function" "..." }
This will parse function... as a FunctionDeclaration, though it
would also match FunctionExpression (which could be used elsewhere
in the grammar, if it was a real grammar).
Allowing Ambiguity
Still, some things cannot be resolved with precedence declarations. To explicitly allow Lezer to try multiple actions at a given point, you can use ambiguity markers.
@top Program { (GoodStatement | BadStatement)+ }
GoodStatement { "(" GoodValue ")" ";" }
GoodValue { "val" ~statement }
BadStatement { "(" BadValue ")" "!" }
BadValue { "val" ~statement }
This grammar is entirely nonsensical, but it shows the problem: in
order to know whether to match GoodValue or BadValue, the parser
has to decide whether it is parsing a GoodStatement or a
BadStatement. But at the time where it has to do that, the next
token is a closing parenthesis, which doesn't yet tell it what it
needs to know.
The parser generator will complain about that (it is a “reduce/reduce“
conflict) unless we add the ~statement ambiguity markers at the
point where the reductions happen. The word (statement) doesn't have
any special meaning, beyond that it has to be the same for all the
conflicting positions, so that the parser generator knows that we are
explicitly annotating these two positions as ambiguous.
With the markers, the grammar compiles, and whenever the parser gets to the ambiguous position, it'll split its parse state, try both approaches, and then drop off one path when it sees that it fails to match on the next token.
In cases where the grammar is truly ambiguous—which means that multiple parses can continue to reach the same state or the end of the input—it becomes hard to predict which parse Lezer will pick as the final parse. To tip the scales in favor of a given variant, it is possible to add dynamic precedence for rules. This is done using prop notation (which we'll get back to later).
@top Program { (A | B)+ }
A[@dynamicPrecedence=1] { "!" ~ambig }
B { "!" ~ambig }
This will parse exclamation points as A, though they also match B,
because that rule has a higher dynamic precedence. Such a precedence
can have a value from -10 to 10. Negative numbers penalize branches
that include this rule, positive numbers give it a bonus.
Template Rules
It is possible to define parameterized rules, which can help avoid repetitive grammar rules. Such rules act as templates—they are copied for each set of parameters they are given.
Lezer uses angle brackets around parameters and arguments. You can define a parameterized rule like this:
commaSep<content> { "" | content ("," content)* }
(The empty string literal is treated as the empty sequence, or ε, matching nothing.)
After having defined the above rule, you can do commaSep<expression>
to match a comma-separated list of expressions.
When you pass arguments to a rule defined in the @tokens block,
those arguments will be interpreted as if they are part of a token
rule, even when you call it from a nonterminal rule.
Token Precedence
By default, tokens are only allowed to overlap (match some prefix of
each other) when they do not occur in the same place of the grammar or
they are both simple tokens, defined without use of repeat
operators. I.e. you can have tokens "+" and "+=", but not tokens
"+" and "+" "="+.
To explicitly specify that, in such a case, one of the tokens takes
precedence, you can add @precedence declarations to your @tokens
block.
@tokens {
Divide { "/" }
Comment { "//" ![\n]* }
@precedence { Comment, Divide }
}
By default, since Divide is a prefix of Comment, these would be
considered overlapping. The @precedence declaration states that
Comment overrides Divide when both match. You can have multiple
@precedence declarations in your @tokens block, and each
declaration can contain any number of tokens, breaking ties between
those tokens.
Token Specialization
You could try to handle things like keywords, which overlap with
identifiers, by specifying that they take precedence. But this is
awkward, because if you define a keyword as matching "new" and
having a higher precedence than identifiers, the input newest would
match as first the keyword "new" and then the identifier "est".
In addition, having dozens of keywords in your tokenizer automaton greatly increases its size and complexity. What many hand-written tokenizers do is first match an identifier, and then check whether its content corresponds to any keyword.
This is supported in Lezer with the @specialize operator. This
operator declares a new token given a base token and a string literal.
When specializations have been defined for a given token type, its
specialize table will be consulted every time it is read, and if its
content matches a specialization, it is replaced by the specialized
token.
NewExpression { @specialize<identifier, "new"> expression }
This rule will match things like new Array.
There is another operator similar to @specialize, called @extend.
Whereas specialized tokens replace the original token, extended
tokens allow both meanings to take effect, implicitly enabling GLR
when both apply. This can be useful for contextual keywords where it
isn't clear whether they should be treated as an identifier or a
keyword until a few tokens later.
External Tokens
The regular way to define tokens is usually enough, and the most convenient. But regular languages are notoriously limited in what they can match, so some things, like the multiple-character look-ahead needed to recognize the end of a comment, are awkward to express inside the grammar. And something like JavaScript's automatic semicolon insertion, which requires checking the content of preceding whitespace and looking at the character ahead, can't be expressed in the grammar at all.
Thus, Lezer allows you to declare some tokens as being external, read by pieces of JavaScript code that you provide. An external tokenizer might look like this:
@external tokens insertSemicolon from "./tokens" { insertedSemicolon }
This tells the parser generator that it should import the
insertSemicolon symbol from "./tokens", and use that as a
tokenizer which might yield the insertedSemicolon token.
Such a tokenizer should be an
ExternalTokenizer instance. It will
only be called when the current state allows one of the tokens it
defines to be matched.
The order in which @external tokens declarations and the @tokens
block appear in the grammar file is significant. Those that appear
earlier will take precedence over those that appear later, meaning
that if they return a token, the others will not be tried.
If you want to make sure that a given group of external tokens is
never used together with some specific other token, because they
conflict, you can put a @conflict { tokenName } declaration in the
external token block.
It is also possible to define external
specialization logic. With a directive like
this, the given function (specializeIdent) will be called every time
an identifier token is read, and can return either a replacement
token or -1 to indicate that it doesn't specialize that value.
@external specialize {identifier} specializeIdent from "./tokens" {
keyword1,
keyword2
}
The tokens listed between the set of braces at the end provide the
tokens that the specializer might return. You can also write extend
instead of specialize to make this an extending specializer, where
both the original token and the specialized one are tried.
Context
Sometimes, for example when creating “indent” and “dedent” tokens in a Python parser, the parse needs to keep some state (such as the current indentation), and access that in external tokenizers.
In Lezer, this is done with a context. This is a value that is kept
alongside the parse, and updated for actions that the parser takes. It
is written in external code as a context
tracker, which is an object describing how to
create and update the value. A @context declaration in the grammar
enables context tracking.
@context trackIndent from "./helpers.js"
If the context is relevant for the validity of node reuse in
incremental parsing (for example, you can't reuse a Python block if the
surrounding indentation is different), it should export a
hash function that
creates a numeric hash for a context value. These hashes are stored in
the tree nodes and must match the current context for the node to be
reused.
Local Token Groups
Sometimes you want to define a token type to cover all text not matched by other tokens. An example would be the content of a string, where anything not matching an escape, an interpolation, or the closing quote should be treated as a content token. In simple cases, you can write a normal Lezer token for this, but a strictly regular language is rather poorly suited for, and in some cases entirely incapable of, expressing things like “match anything up to any of these tokens“.
Local token groups let you define a set of tokens that can be used in a given context (for example the content of a string or comment), along with a fallback token that should be used for anything else.
@local tokens {
stringEnd[@name='"'] { '"' }
StringEscape { "\\" _ }
@else stringContent
}
@skip {} {
String { '"' (stringContent | stringEscape)* stringEnd }
}
Such a local tokenizer only works if no other tokens (no skip
tokens, no literal tokens, and no tokens defined in other tokenizer
blocks) occur in the parse states where it is used. Hence you'll
almost always need a @skip {} block around the rules that use them,
and may have to redefine tokens that already exist (like '"' in the
example) in your local token block.
It is possible reference rules defined in other token blocks from within your local token definitions to avoid duplicating the notation for complex tokens that are also used in other context.
Inline Rules
Rules that are only going to be used once may be written entirely inside the rule that uses them. This is mostly useful to declare node names for sub-expressions. For example this definition of a statement rule (which itself does not create a tree node) gives names to each of its choices.
statement {
ReturnStatement { Return expression } |
IfStatement { If expression statement } |
ExpressionStatement { expression ";" }
}
Node Props
It is possible to add node props, extra metadata associated with tree nodes, directly in a grammar file. To do so, you write square brackets after the rule's name.
StartTag[closedBy="EndTag"] { "<" }
This will add the NodeProp.closedBy prop
to this node type, which provides information about the delimiters
wrapping a node. Values that consist of only letters can be written
without quotes, values like the one above must be quoted.
Most of the props defined as static properties of
NodeProp are available by default. You can also
import custom props using syntax like this:
@external prop myProp from "./props"
SomeRule[myProp=somevalue] { "ok" }
You may write as otherName after the prop name to rename it locally.
Inline rules, external token names, and @specialize/@extend
operators also allow square bracket prop notation to add props to the
nodes they define.
Pseudo-props, which are names that start with @, when they occur
between square brackets, are used to activate and configure various
types of additional functionality.
For example, @name can be used to set the node's name. This rule
would produce a node named x despite being itself called y. When a
name is explicitly specified like this, the node will be included in
the tree regardless of whether it is capitalized.
y[@name=x] { "y" }
Sometimes it is useful to insert the value of a rule parameter into a prop. You can use curly braces inside a prop value to splice in the content of an identifier or literal argument expression. This keyword helper rule produces a specialized token named after the keyword:
kw<word> { @specialize[@name={word}]<identifier, word> }
If you put a @detectDelim directive at the top level of your grammar
file, the parser generator will automatically detect rule delimiter
tokens, and create openedBy and
closedBy props for rules where it
finds them.
Literal Token Declarations
Literal tokens will, by default, not create a node. But it is possible to explicitly list those that you want in the tree in the tokens block, optionally with props.
@tokens {
"("
")"
"#"[someProp=ok]
// ... other token rules
}
Dialects
Sometimes it makes sense to define several similar languages in a single grammar, so that they can share code and be loaded as a single serialized parser. When doing that, you need a way to make parts of the grammar conditional, so that they can be turned on or off depending on which language you are parsing.
Lezer has a dialect feature that is meant to help with this. It allows you to make some tokens conditional on the dialect that's selected. It is used like this:
@dialects { comments }
@top Document { Word+ }
@skip { space | Comment }
@tokens {
Comment[@dialect=comments] { "//" ![\n]* }
Word { @asciiLetter+ }
}
A @dialects declaration provides the list of dialects supported by
the grammar. Individual tokens (as well as tokens produced by
specializing other tokens) can be annotated with a @dialect
pseudo-prop to indicate that they can only occur when that dialect is
active. Multiple dialects may be active at once.
External tokenizers and specializers can access
the active dialects through the
Stack.dialectEnabled method, using
the dialect IDs exported from the terms file as
Dialect_[name] (for example
stack.dialectEnabled(Dialect_comments)), and base the decision on
whether to return a given token on that. This can also be useful when
you need to perform more complicated tests against dialects (such as
only return a token when a dialect is not active, or when multiple
dialects are active).
Grammar Nesting
In some cases, such as JavaScript embedded in HTML or code snippets inside a literal programming notation, you want to make another parser responsible for parsing pieces of a document.
Lezer implements this as a post-processing pass on the tree. The
parseMixed utility scans the tree (or the
newly parsed parts of the tree in case of an incremental parse) for
nodes that should have other languages embedded in them.
If ranges that use another language are found, the appropriate parser
is ran over them, and the resulting trees are attached to the main
tree using NodeProp.mounted. There are
two ways to mount subtrees:
-
Regular mounts just replace the original node with the root of the inner tree when iterating through the outer tree. This is the easiest to work with, and usually preferable when the languages are strictly nested. For example, the content of a
<script>tag in an HTML tree could be replaced with a JavaScript tree. -
Overlays replace just parts of the node with content from another tree. This is useful when the nesting isn't strictly hierarchical, such as between a templating language and its output language, where both have structure, but that structure may overlap in weird ways (
<p>...<?if x?></p><p>...<?/if?></p>).When iterating through a tree normally, overlays are ignored. But you can enter them explicitly with the
entermethod.
Building a Grammar
The @lezer/generator package comes with a command-line
tool lezer-generator that converts a grammar
file into a JavaScript module that can be loaded
to get access to the parser object.
(This functionality is also available through a programming interface.)
For example, we could run the tool as
lezer-generator lang.grammar -o lang.js
If there are no problems with the grammar specified in lang.grammar,
this will write two files, lang.js and lang.terms.js.
The first is the main grammar file. It depends on @lezer/lr
and on any files specified in external token or
other external declarations. It exports a binding parser that holds
an LRParser instance.
The terms file contains, for every external token and every rule that
either is a tree node and doesn't have parameters, or has the
@export pseudo-prop, an export with the same name as the term that
holds the term's ID. When you define an external tokenizer, you'll
usually need to import the token IDs for the tokens you want to
recognize from this file. (Since there'll be quite a lot of
definitions in the file for regular-sized grammars, you might want to
use a
tree-shaking
bundler to remove the ones you do not need.)
These are the command-line options supported by lezer-generator:
-
--outputor-ocan be used to specify the place where the tool should write its output. By default, it will write to its standard output stream, but since it generates two files, that will mean you don't get a term definition file. -
--cjscan be used to make the tool output CommonJS modules (the default is to output ES modules). -
--typeScriptmakes the tool emit a TypeScript file instead of plain JavaScript. -
--namescan be given to make the tool include term names in the output, to help debugging. This adds quite a lot of bytes to the output file, and should probably not be used in production. -
--exportcan be used to change the name of the exported binding that holds theLRParserinstance. It defaults toparser. -
--noTerms, when given, tells the tool to not write a terms file, even though an output path was given.
Running the Parser
The simplest way to parse a file is to call
parse on the parser you import from
the generated file. This will always succeed and return a
Tree instance.
Sometimes you want to take advantage of the fact that LR parsing
happens in discrete steps, and have control over when the steps are
performed. You can create a parse object,
representing an in-progress parse, with the
startParse method. You then repeatedly
call advance to perform the next
action, until you decide you have parsed enough or the method returns
a tree to indicate that it has finished parsing.
Such a parse context holds a collection of Stack
instances, each of which represents an actual single parse state. When
the parse is unambiguous, there'll be one stack. When it is trying out
multiple options in parallel, there'll be more than one. You don't
usually need to concern yourself with these, but an external
tokenizer gets access to a stack and can
query it to, for example, ask if the current state accepts a given
token.
Parsing consumes an Input, which abstracts access
to a string-like data structure. You may simply pass a string to
parse, in which case the library will do
the wrapping itself, but if you have your data in some kind of data
structure more involved than a string, you'll need to write your own
class that implements Input, so that Lezer can
read directly from your data structure, without copying its content
into a flat string.
External tokenizers get access to a wrapper around the input, and can call a method to confirm a token.
Working with Trees
Syntax trees produced by Lezer are stored in a format that's optimized for compactness and efficient reuse during incremental parsing. Using this structure directly for other purposes can be very awkard.
Firstly, the tree contains nodes that don't correspond to named rules
in the grammar, but indicate repetitions from + or * operators.
These are essential for incremental parsing, but usually uninteresting
in any other context.
Secondly, chunks of small nodes are stored together in compact arrays of 16-bit numbers, with each node only encoding its type id, start postion, end position, and the index at which its children end. This helps save memory, but it is not a format you want to be directly interacting with.
That is why syntax trees provides two abstractions to help you inspect
their structure: SyntaxNode, which is an object
providing convenient access to a given node, its children, and its
parent, and TreeCursor, which provides a
similar service in a mutable and somewhat more efficient form, for
bulk iteration.
Syntax Nodes
Each node instance knows its node type, its
position in the tree, its
parent node, and its backing structure,
which can be used to access its children. To get the top node from a
tree, use its topNode getter.
Nodes come with various getters, like
nextSibling and
firstChild, which allow you to get a
node object for nearby nodes.
You can use a tree's resolve method to get
the inner node at, before, or after a given document position. This,
in combination with iterating up the parent nodes of the result, is
often useful way to figure out what syntactic constructs exists at a
given position.
Tree Cursors
Because syntax nodes allocate a new object for every node visited, they are very convenient for small-scale tree analysis (such as looking at a given position in the tree and its parent nodes), but too wasteful when iterating over large amounts of nodes.
The cursor abstraction allows you to move though
the tree without dealing with the intricacies of its data structures.
You can create a cursor into a tree by calling its
cursor method, optionally providing a position
that the cursor should move to.
Such cursors are always focused on a node, and will tell you its type and position. From that node, you can move in various directions. Somewhat like with a DOM node, you can move a cursor up to its parent, down to its first or last child, or to a sibling.
These motion methods will, if the motion is possible, update the cursor and return true. When they return false, there is no node in the requested direction, and the cursor state will not have changed.
By combining these motions, you can move through the tree in various
ways. The cursors have next and
prev helper methods that implement full
pre-order traversal. So you could do something like this to inspect
every node in a tree:
let cursor = tree.cursor()
do {
console.log(`Node ${cursor.name} from ${cursor.from} to ${cursor.to}`)
} while (cursor.next())
Note that unlike JavaScript
iterators,
on which you have to call next to get their first value, tree
cursors start out already pointing at the first value (they always
point at a node, there is no special start state), so they are a bit
more awkward to put in a for loop, and fit better with do/while
loops or loops with an if (!cur.next()) break at the end.
Node Groups
Because Lezer does not, like most parsers, organize a node's children (beyond the order in which they appear in the source), it can be difficult to find the children you are interested in.
To provide at least some help with this, nodes can be assigned a
group. If you, for example, assign all
expression-type nodes the Expression group, you can find the
elements of an array literal node by gathering all children that
belong to that group.
SyntaxNodes provide two methods,
getChild and
getChildren that help with this.
They can get the first child (or all children) that has a given node
type or group.
let elements = arrayNode.getChildren("Expression")
In addition, they allow you to specify that you are only interested in children that occur after, before, or between other node types. If you have, for example, an index expression as...
IndexExpression { expression "[" expression "]" }
Assuming that the brackets are visible in the tree, you could get the
array and index parts through node.getChild("Expression", null, "[")
and node.getChild("Expression", "[", "]") respectively.
To make assigning groups a bit less annoying, since such groups are
often already organized in a parent rule, for a rule where each choice
produces only a single named node, you can use the @isGroup
pseudo-prop in your grammar to add a group to all those named nodes.
statement[@isGroup=Statement] {
IfStatement |
ForStatement |
ExpressionStatement |
declaration
}
Assuming declaration itself only has choices that produce a single
named node, this'll assign all those, along with the named nodes
directly mentioned in the rule, to the Statement group.