The pigeon command generates parsers based on a parsing expression grammar (PEG). Its grammar and syntax is inspired by the PEG.js project, while the implementation is loosely based on the parsing expression grammar for C# 3.0 article. It parses Unicode text encoded in UTF-8.
-
Performance improvements (5-10x faster than the original version):
- Removed
parser.state, because it's very slow. Memoizedcan be worked together withOptimizeoption.- More memory efficient - reduces memory allocations
- Generates cleaner and less lines of code
- Removed
-
parseSeqExpronly collects values that are explicitly returned, improving performance:- Example 1:
e <- items:(__ Integer __)+ EOF { return items }returns an empty array[]since no values are explicitly marked for collection - Example 2:
e <- items:(__ i:Integer __ { return i })+ EOF { return items }returns[1, 2, 3]because integers are explicitly collected - This is more efficient than the original version which would return
[[nil, 1, nil], [nil, 2, nil], [nil, 3, nil]]with unnecessary nil values
- Example 1:
-
String capture:
expr <- val:<anotherExpr> { fmt.Println(val.(string)) }// you got the string value invalexpr <- val:<(A '=' B)> { fmt.Println(val.(string)) }// you got the string value inval, e.g. "A=B"
-
Use code to control matching behavior (
andCodeExprandnotCodeExpr):expr <- &{ return c.data.AllowNumber } [0-9]+// Only matches digits if c.data.AllowNumber is trueexpr <- val:<[0-9]+> &{ return val.(string) == "123" } { return val.(string) }// Only succeeds if the matched string equals "123"- Tips: don't enable memoized if you control matching behavior by code.
-
Logical
and/ormatch:expr <- &&testExpr testExpr// if testExpr return ok but matched nothing (e.g. testExpr <- 'A'*),&&testEprreturns false.
-
Multiple peg files supported:
pigeon -o script1.peg.go script1.pegto generate a normal parser.- Run
pigeon -grammar-only -grammar-name=g2 -run-func-prefix="_s2_" -o script2.peg.go script2.pegto generate grammar only code in same package. - Use it by
newParser("filename", "expr").parse(g2)
-
Simplified
actionExpr:- The original version required two parameters to return (val, error), but errors are rarely used. So this fork simplifies the return values.
- Examples:
expr <- [0-9]+ { fmt.Println(expr) }is ok in this fork, returns nothing.expr <- "true" { return 1 }if you want return something.
- Add an error by manual:
expr <- "if" { p.addErr(errors.New("keyword is not allowed")) }, equals toexpr <- "if" { return nil, errors.New("keyword is not allowed") }of original pigeon.
-
Provide a struct(
ParserCustomData) to embed, to replace theglobalStore- Must define a struct
ParserCustomDatain your module. - Access data by
c.data, for example:expr <- { fmt.Println(c.data.MyOption) } globalStateis removed.
- Must define a struct
-
Remove ParseFile ParseReader, rename Parse and all options to lowercase issue, branch feat/rename-exported-api
ParseReaderconverts io.Reader to bytes, then invokeparse, it don't make sense.- Function
Parseand all options(MaxExpressions,Entrypoint,Statistics,Debug,Memoize,AllowInvalidUTF8,Recover,GlobalStore,InitState) expose to module user. I think expose them is not a good idea.
-
Skip "actionExpr" while looking ahead issue, branch feat/skip-code-expr-while-looking-ahead
- See detail in the issue.
*{}/&{}/!{}won't skip.
-
ActionExpr refactored issue, branch refactor/actionExpr
- Unlimited ActionExpr(CodeExpr): grammar like
expr <- firstPart:[0-9]+ { fmt.Println(firstPart) } secondPart:[a-z]+ { fmt.Println(firstPart, secondPart) }is allowed for this fork. - You can access parser in ActionExpr:
expr <- { fmt.Println(p) } stateCodeExpr(#{})was removed.
- Unlimited ActionExpr(CodeExpr): grammar like
-
positionof generated code is removed- It produced a lot of different for version control.
- You can keep it by set
SetRulePosto true and rebuild.
-
Added
-optimize-ref-expr-by-indexoption- An option to tweak
RefExprthe most usually used expr in parser. - About ~10% faster with this option.
- An option to tweak
-
Removed
-support-left-recursionoption- It's not used much, so I removed it to make maintenance easier
-
Removed
-optimize-grammaroption- There are bugs present and the effects are not significant.
-
Removed
-optimize-basic-latinoption- Because there is no evidence to suggest that this is an optimization
-
charClassMatcher/anyMatcher/litMatchernot return byte anymore, because of performance.- Use string capture or
c.textinstead.
- Use string capture or
go install github.com/fy0/pigeon@latest
This will install or update the package, and the pigeon command will be installed in your $GOBIN directory. Neither this package nor the parsers generated by this command require any third-party dependency, unless such a dependency is used in the code blocks of the grammar.
pigeon [options] [PEG_GRAMMAR_FILE]
By default, the input grammar is read from stdin and the generated code is printed to stdout. You may save it in a file using the -o flag.
Github user @mna created the original package in April 2015, and @breml is the original package's maintainer as of May 2017.
Given the following grammar:
{
//nolint:unreachable
package main
type ParserCustomData struct {
}
var ops = map[string]func(int, int) int {
"+": func(l, r int) int {
return l + r
},
"-": func(l, r int) int {
return l - r
},
"*": func(l, r int) int {
return l * r
},
"/": func(l, r int) int {
return l / r
},
}
func toAnySlice(v any) []any {
if v == nil {
return nil
}
return v.([]any)
}
func eval(first, rest any) int {
l := first.(int)
restSl := toAnySlice(rest)
for _, v := range restSl {
restExpr := toAnySlice(v)
r := restExpr[1].(int)
op := restExpr[0].(string)
l = ops[op](l, r)
}
return l
}
}
Input <- expr:Expr EOF {
return expr
}
Expr <- _ first:Term rest:( _ op:AddOp _ r:Term { return []any{op, r} })* _ {
return eval(first, rest)
}
Term <- first:Factor rest:( _ op:MulOp _ r:Factor { return []any{op, r} })* {
return eval(first, rest)
}
Factor <- '(' expr:Expr ')' {
return expr
} / integer:Integer {
return integer
}
AddOp <- ( '+' / '-' ) {
return string(c.text)
}
MulOp <- ( '*' / '/' ) {
return string(c.text)
}
Integer <- '-'? [0-9]+ {
v, err := strconv.Atoi(string(c.text))
if err != nil {
p.addErr(err)
}
return v
}
_ "whitespace" <- [ \n\t\r]*
EOF <- !.
The generated parser can parse simple arithmetic operations, e.g.:
18 + 3 - 27 * (-18 / -3)
=> -141
More examples can be found in the examples/ subdirectory.
See the package documentation for detailed usage.
See the CONTRIBUTING.md file.
The BSD 3-Clause license. See the LICENSE file.
performance: Create another version ofparseOneOrMoreExpr/parseZeroOrMoreExprwhich not collect results. Choose expr decide by is labeled, A bit faster.performance: RemovepushVandpopV, a bit faster.performance: InparseCharClassMatcher, variablestartcan be removed in most case. Lot of of small memory pieces allocated.performance: Remove Wrap function if they are not needed.- performance: Too many any, can we remove
parseExpr? - string capture inside predicate expr not work:
&( alist:<("a")*> &{ fmt.Println(alist) } ) - auto remove
return nilif unreachable