A nice parser combinator library for Kotlin JVM, JS, and Multiplatform projects
val booleanGrammar = object : Grammar<BooleanExpression>() {
val id by token("\\w+")
val not by token("!")
val and by token("&")
val or by token("|")
val ws by token("\\s+", ignore = true)
val lpar by token("\\(")
val rpar by token("\\)")
val term by
(id use { Variable(text) }) or
(-not * parser(this::term) map { (Not(it) }) or
(-lpar * parser(this::rootParser) * -rpar)
val andChain by leftAssociative(term, and) { l, _, r -> And(l, r) }
override val rootParser by leftAssociative(andChain, or) { l, _, r -> Or(l, r) }
}
val ast = booleanGrammar.parseToEnd("a & !b | b & (!a | c)")Add the JCenter repository:
repositories {
maven { setUrl("https://dl.bintray.com/hotkeytlt/maven") }
}Then, in Kotlin/JVM projects:
dependencies {
compile 'com.github.h0tk3y.betterParse:better-parse-jvm:0.4.0-alpha-3'
}Note: for version 0.3.5 and below, use better-parse instead of better-parse-jvm.
In Kotlin/JS projects (since 0.4.0-alpha-3):
dependencies {
compile 'com.github.h0tk3y.betterParse:better-parse-js:0.4.0-alpha-3'
}In Kotlin Multiplatform projects (since 0.4.0-alpha-3):
dependencies {
commonMainApi 'com.github.h0tk3y.betterParse:better-parse-metadata:0.4.0-alpha-3'
/* Note: adjust the below examples to your targets set. You may need to:
* replace the prefixes: if your JVM target is `myJvm6`, use `myJvm6MainApi` instead of `jvmMainApi`
* remove the dependencies for the targets you don't have: if you don't target Linux x64, remove `linuxX64MainApi`
* add the targets not listed below; note that the artifact IDs contain the lowercased preset name, for example,
use `better-parse-androidnativearm32` for your target from the androidNativeArm32 preset
*/
jvmMainApi 'com.github.h0tk3y.betterParse:better-parse-jvm:0.4.0-alpha-3'
jsMainApi 'com.github.h0tk3y.betterParse:better-parse-js:0.4.0-alpha-3'
mingwX64MainApi 'com.github.h0tk3y.betterParse:better-parse-mingwx64:0.4.0-alpha-3'
linuxX64MainApi 'com.github.h0tk3y.betterParse:better-parse-linuxx64:0.4.0-alpha-3'
}A simpler way is possible: if you enable the experimental Gradle metadata, add just a single dependency:
dependencies {
commonMainApi 'com.github.h0tk3y.betterParse:better-parse-multiplatform:0.4.0-alpha3'
}Note: this version of better-parse-multiplatform is published with Gradle 4.10.2. Future Gradle versions may fail to
consume this dependency due to the metadata experimental status.
As many other language recognition tools, better-parse abstracts away from raw character input by
pre-processing it with a Tokenizer, that can match Tokens by their patterns (regular expressions) against an input sequence.
A Tokenizer tokenizes an input sequence such as InputStream or a String into a Sequence<TokenMatch>, providing each with a position in the input.
One way to create a Tokenizer is to first define the Tokens to be matched:
val id = Token("\\w+")
val cm = Token(",")
val ws = Token("\\s+", ignore = true)A
Tokencan be ignored by setting itsignore = true. An ignored token can still be matched explicitly, but if another token is expected, the ignored one is just dropped from the sequence.
val tokenizer = DefaultTokenizer(listOf(id, cm, ws))Note: the tokens order matters in some cases, because the tokenizer tries to match them in exactly this order. For instance, if
Token("a")is listed beforeToken("aa"), the latter will never be matched. Be careful with keyword tokens!
val tokenMatches: Sequence<TokenMatch> = tokenizer.tokenize("hello, world") // Support other types of input as well.A more convenient way of defining tokens is described in the Grammar section.
It is possible to provide a custom implementation of a Tokenizer.
A Parser<T> is an object that accepts an input sequence (a sequence of tokens, Sequence<TokenMatch>) and
tries to convert some (from none to all) of its items into a T. In better-parse, parsers are also used
as build blocks to create new parsers by combining them.
When a parser tries to process the input, there are two possible outcomes:
-
If it succeeds, it returns
Parsed<T>containing theTresult and theremainder: Sequence<TokenMatch>that it left unprocessed. The latter can then be, and often is, passed to another parser. -
If it fails, it reports the failure returning an
ErrorResult, which provides detailed information about the failure.
A very basic parser to start with is a Token itself: when given an input Sequence<TokenMatch>, it succeeds if the sequence starts
with the match of this token itself (possibly, skipping some ignored tokens) and returns that TokenMatch, also excluding it
(and, possibly, some ignored tokens) from the remainder.
val a = Token("a+")
val b = Token("b+")
val tokenMatches = DefaultTokenizer(listOf(a, b)).tokenize("aabbaaa")
val result = a.tryParse(tokenMatches) // contains the match for "aa" and the remainder with "bbaaa" in itSimpler parsers can be combined to build a more complex parser, from tokens to terms and to the whole language.
There are several kinds of combinators included in better-parse:
-
map,use,asJustThe map combinator takes a successful input of another parser and applies a transforming function to it. The error results are returned unchanged.
val id = Token("\\w+") val aText = a map { it.text } // Parser<String>, returns the matched text from the input sequence
A parser for objects of a custom type can be created with
map:val variable = a map { JavaVariable(name = it.text) } // Parser<JavaVariable>.
-
someParser use { ... }is amapequivalent that takes a function with receiver instead. Example:id use { text }. -
foo asJust barcan be used to map a parser to some constant value.
-
-
optional(...)Given a
Parser<T>, tries to parse the sequence with it, but returns anullresult if the parser failed, and thus never fails itself:val p: Parser<T> = ... val o = optional(p) // Parser<T?>
-
and,and skip(...)The tuple combinator arranges the parsers in a sequence, so that the remainder of the first one goes to the second one and so on. If all the parsers succeed, their results are merged into a
Tuple. If either parser failes, itsErrorResultis returned by the combinator.val a: Parser<A> = ... val b: Parser<B> = ... val aAndB = a and b // This is a `Parser<Tuple2<A, B>>` val bAndBAndA = b and b and a // This is a `Parser<Tuple3<B, B, A>>`
You can
skip(...)components in a tuple combinator: the parsers will be called just as well, but their results won't be included in the resulting tuple:val bbWithoutA = skip(a) and b and skip(a) and b and skip(a) // Parser<Tuple2<B, B>>
If all the components in an
andchain are skipped except for oneParser<T>, the resulting parser isParser<T>, notParser<Tuple1<T>>.To process the resulting
Tuple, use the aforementionedmapanduse. These parsers are equivalent:-
val fCall = id and skip(lpar) and id and skip(rpar) map { (fName, arg) -> FunctionCall(fName, arg) } -
val fCall = id and lpar and id and rpar map { (fName, _, arg, _) -> FunctionCall(fName, arg) } -
val fCall = id and lpar and id and rpar use { FunctionCall(t1, t3) } -
val fCall = id * -lpar * id * -rpar use { FunctionCall(t1, t2) }(see operators below)
There are
Tupleclasses up toTuple16and the correspondingandoverloads.There are operator overloads for more compact
andchains definition:-
a * bis equivalent toa and b. -
-ais equivalent toskip(a).
With these operators, the parser
a and skip(b) and skip(c) and dcan also be defined asa * -b * -c * d. -
-
orThe alternative combinator tries to parse the sequence with the parsers it combines one by one until one succeeds. If all the parsers fail, the returned
ErrorResultis anAlternativesFailureinstance that contains all the failures from the parsers.The result type for the combined parsers is the least common supertype (which is possibly
Any).val expr = const or variable or fCall
-
zeroOrMore(...),oneOrMore(...),N times,N timesOrMore,N..M timesThese combinators transform a
Parser<T>into aParser<List<T>>, invokng the parser several times and failing if there was not enough matches.val modifiers = zeroOrMore(functionModifier) val rectangleParser = 4 times number map { (a, b, c, d) -> Rect(a, b, c, d) }
-
separated(term, separator),separatedTerms(term, separator),leftAssociative(...),rightAssociative(...)Combines the two parsers, invoking them in turn and thus parsing a sequence of
termmatches separated byseparatormatches.The result is a
Separated<T, S>which provides the matches of both parsers (note that terms are one more than separators) and can also be reduced in either direction.val number: Parser<Int> = ... val sumParser = separated(number, plus) use { reduce { a, _, b -> a + b } }
The
leftAssociativeandrightAssociativecombinators do exactly this, but they take the reducing operation as they are built:val term: Parser<Term> val andChain = leftAssociative(term, andOperator) { l, _, r -> And(l, r) }
As a convenient way of defining a grammar of a language, there is an abstract class Grammar, that collects the by-delegated
properties into a Tokenizer automatically, and also behaves as a composition of the Tokenizer and the rootParser.
Note: a Grammar also collects by-delegated Parser<T> properties so that they can be accessed as
declaredParsers along with the tokens. As a good style, declare the parsers inside a Grammar by delegation as well.
interface Item
class Number(val value: Int) : Item
class Variable(val name: String) : Item
object ItemsParser : Grammar<List<Item>>() {
val num by token("\\d+")
val word by token("[A-Za-z]")
val comma by token(",\\s+")
val numParser by num use { Number(text.toInt()) }
val varParser by word use { Variable(text) }
override val rootParser by separatedTerms(numParser or varParser, comma)
}
val result: List<Item> = ItemsParser.parseToEnd("one, 2, three, 4, five")To use a parser that has not been constructed yet, reference it with parser { someParser } or parser(this::someParser):
val term by
constParser or
variableParser or
(-lpar and parser(this::term) and -rpar)A Grammar implementation can override the tokenizer property to provide a custom implementation of Tokenizer.
A Parser<T> can be converted to another Parser<SyntaxTree<T>>, where a SyntaxTree<T>, along with the parsed T
contains the children syntax trees, the reference to the parser and the positions in the input sequence.
This can be done with parser.liftToSyntaxTreeParser().
This can be used for syntax highlighting and inspecting the resulting tree in case the parsed result does not contain the full syntactic structure.
For convenience, a Grammar can also be lifted to that parsing a SyntaxTree with
grammar.liftToSyntaxTreeGrammar().
val treeGrammar = booleanGrammar.liftToSyntaxTreeGrammar()
val tree = treeGrammar.parseToEnd("a & !b | c -> d")
assertTrue(tree.parser == booleanGrammar.implChain)
val firstChild = tree.children.first()
assertTrue(firstChild.parser == booleanGrammar.orChain)
assertTrue(firstChild.range == 0..9)There are optional arguments for customizing the transformation:
-
LiftToSyntaxTreeOptionsretainSkipped— whether the resulting syntax tree should include skippedandcomponents;retainSeparators— whether theSeparatedcombinator parsed separators should be included;
-
structureParsers— defines the parsers that are retained in the syntax tree; the nodes with parsers that are not in this set are flattened so that their children are attached to their parents in their place.For
Parser<T>, the default isnull, which means no nodes are flattened.In case of
Grammar<T>,structureParsersdefaults to the grammar'sdeclaredParsers. -
transformer— a strategy to transform non-built-in parsers. If you define your own combinators and want them to be lifted to syntax tree parsers, pass aLiftToSyntaxTreeTransformerthat will be called on the parsers. When a custom combinator nests another parser, a transformer implementation should calldefault.transform(...)on that parser.
See SyntaxTreeDemo.kt for an example of working with syntax trees.
- A boolean expressions parser that constructs a simple AST:
BooleanExpression.kt - An integer arithmetic expressions evaluator:
ArithmeticsEvaluator.kt - A toy programming language parser: (link)
- A sample JSON parser by silmeth: (link)
See the benchmarks repository h0tk3y/better-parse-benchmark and feel free to contribute.