Syntax Analysis
Compiler construction
Introduction to syntax analysis
• The parser (syntax analyzer) receives the source
code in the form of tokens from the lexical
analyzer and performs syntax analysis, which
create a tree-like intermediate representation
that depicts the grammatical structure of the
token stream.
• Syntax analysis is also called parsing.
• It involves analyzing the structure of source code
to ensure it adheres to the grammatical rules of
the language
• A typical representation is an abstract syntax tree
where;
• Each interior node represents an operation
• The children of the node represent the
arguments of the operation
Compiler construction
Syntactic Analysis
• Input: sequence of tokens from scanner
• Output: abstract syntax tree
• Actually,
• parser first builds a parse tree
• AST is then built by translating the parse
tree
• parse tree rarely built explicitly; only
determined by, say, how parser pushes
stuff to stack
Compiler construction
Introduction to syntax analysis
Parse Tree/Abstract Syntax Tree
(AST)
• Parse Tree: A tree representation
of the syntactic structure of the
input based on the grammar rules.
• Abstract Syntax Tree (AST): A
simplified version of the parse tree
that abstracts away certain
syntactic details, focusing instead
on the logical structure of
Compiler the
construction
Example
• Source Code
• 4*(2+3)
• Parser input
• NUM(4) TIMES LPAR NUM(2) PLUS
NUM(3) RPAR
• Parser output (AST):
*
+
NUM(4)
NUM(2) NUM(3)
Compiler construction
Another example
• Source Code
• if (x == y) { a=1; }
• Parser input
• IF LPAR ID EQ ID RPAR LBR ID AS INT
SEMI RBR
• Parser output (AST):
IF-THEN
== =
ID ID ID IN
T
Compiler construction
Introduction to syntax analysis
Example:
For a simple expression like a + b * c, the syntax
analysis might involve:
• Tokenizing: a, +, b, *, c
• Using grammar rules to recognize:
An expression consists of terms and
operators.
The multiplication operator has higher
precedence than the addition operator.
Constructing an AST:
Compiler construction
Syntax Analysis Analogy
Syntax analysis for natural languages
• Identify the function of each word
• Recognize if a sentence is grammatically
correct
• Example: I gave Ali the card.
Syntax analysis for natural languages
• Identify the function of each word
• Recognize if a sentence is grammatically
correct
Compiler construction
Introduction to syntax analysis
.
Compiler construction
Introduction to syntax analysis
Compiler construction
Position of Syntax Analyzer
Compiler construction
Overview
Main Task: Take a token sequence from the
scanner and verify that it is a syntactically
correct program.
Secondary Tasks:
Process declarations and set up symbol table
information accordingly, in preparation for semantic
analysis.
Construct a syntax tree in preparation for 12
intermediate code generation. Compiler construction
Context-free Grammars
• A context-free grammar for a language
specifies the syntactic structure of
programs in that language.
• Components of a grammar:
• a finite set of tokens (obtained from the
scanner);
• a set of variables representing “related”
sets of strings, e.g., declarations,
statements, expressions.
• a set of rules that show the structure of
these strings.
• an indication of the “top-level” set of 13
Compiler construction
Context-free Grammars: Definition
Formally, a context-free grammar G is a 4-
tuple G = (V, T, P, S), where:
• V is a finite set of variables (or
nonterminals). These describe sets of
“related” strings.
• T is a finite set of terminals (i.e., tokens).
• P is a finite set of productions, each of the
form
•A
• where A V is a variable, and (V
T)* is a sequence of terminals and
nonterminals. 14
Compiler construction
Context-free Grammars: An Example
A grammar for palindromic bit-strings:
G = (V, T, P, S), where:
• V = { S, B }
• T = {0, 1}
• P = { S B,
S ,
S 0 S 0,
S 1 S 1,
B 0,
B1
}
15
Compiler construction
Context-free Grammars: Terminology
• Derivation: Suppose that
• and are strings of grammar symbols,
and
• A is a production.
• Then, A (“A derives ”).
• : “derives in one step”
• * : “derives in 0 or more steps”
• * (0 steps)
• * if and * ( 1 steps)
16
Compiler construction
Derivations: Example
• Grammar for palindromes: G = (V, T,
P, S),
• V = {S},
• T = {0, 1},
•P={S0S0 | 1S1 | 0 | 1 |
}.
• A derivation of the string 10101:
•S
•1S1 (using S 1S1)
• 1 0S0 1 (using S 0S0) 17
Compiler construction
Leftmost and Rightmost Derivations
• A leftmost derivation is one where, at each step,
the leftmost nonterminal is replaced.
• (analogous for rightmost derivation)
• Example: a grammar for arithmetic expressions:
• E E + E | E * E | id
• Leftmost derivation:
• E E * E E + E * E id + E * E id +
id * E id + id * id
• Rightmost derivation:
• E E + E E + E * E E + E * id E +
id * id id + id * id
CSc 453: Syntax Analysis 18
Compiler construction
Context-free Grammars: Terminology
• The language of a grammar G =
(V,T,P,S) is
• L(G) = { w | w T* and S * w }.
• The language of a grammar
contains only strings of terminal
symbols.
• Two grammars G1 and G2 are
equivalent if
• L(G1) = L(G2). 19
Compiler construction
Parse Trees
• A parse tree is a tree representation of a
derivation.
• Constructing a parse tree:
• The root is the start symbol S of the grammar.
• Given a parse tree for X , if the next
derivation step is
• X 1…n then the parse tree is
. obtained as:
20
Compiler construction
Approaches to Parsing
• Top-down parsing:
• attempts to figure out the derivation for
the input string, starting from the start
symbol.
• Bottom-up parsing:
• starting with the input string, attempts to
“derive in reverse” and end up with the
start symbol;
• forms the basis for parsers obtained from
parser-generator tools such as yacc,
bison. 21
Compiler construction
Top-down Parsing
• “top-down:” starting with the start symbol
of the grammar, try to derive the input
string.
• Parsing process: use the current state of the
parser, and the next input token, to guide
the derivation process.
• Implementation: use a finite state
automaton augmented with a runtime stack
(“pushdown automaton”).
22
Compiler construction
Bottom-up Parsing
• “bottom-up:” work backwards from the
input string to obtain a derivation for it.
• Parsing process: use the parser state to
keep track of:
• what has been seen so far, and
• given this, what the rest of the input
might look like.
• Implementation: use a finite state
automaton augmented with a runtime stack 23
(“pushdown automaton”). Compiler construction
Parsing: Top-down vs. Bottom-up
.
24
Compiler construction
Parsing Problems: Ambiguity
• A grammar G is ambiguous if some string in L(G)
has more than one parse tree.
• Equivalently: if some string in L(G) has more than
one leftmost (rightmost) derivation.
• Example: The grammar
• E E + E | E * E | id
• is ambiguous, since “id+id*id” has multiple
parses:
.
25
Compiler construction
Dealing with Ambiguity
1. Transform the grammar to an equivalent
unambiguous grammar.
2. Use disambiguating rules along with the
ambiguous grammar to specify which
parse to use.
Comment: It is not possible to determine
algorithmically whether:
• Two given CFGs are equivalent;
• A given CFG is ambiguous.
26
Compiler construction
Removing Ambiguity: Operators
• Basic idea: use additional nonterminals to
enforce associativity and precedence:
• Use one nonterminal for each precedence
level:
• E E * E | E + E | id
• needs 2 nonterminals (2 levels of
precedence).
• Modify productions so that “lower precedence”
nonterminal is in direction of precedence:
•EE+E E E + T (+ is
left-associative)
CSc 453: Syntax Analysis 27
Compiler construction
Example
• Original grammar:
•EE*E | E/E | E+E | E–E |
( E ) | id
• precedence levels: { *, / } > { +, – }
• associativity: *, /, +, – are all left-
associative.
• Transformed grammar:
•EE+T | E–T | T (precedence
level for: +, -)
•T T*F | T/ F | F (precedence 28
level for: *, /) Compiler construction
Bottom-up parsing: Approach
1. Preprocess the grammar to compute some
info about it.
(FIRST and FOLLOW sets)
2. Use this info to construct a pushdown
automaton for the grammar:
• the automaton uses a table (“parsing
table”) to guide its actions;
• constructing a parser amounts to
constructing this table.
29
Compiler construction
FIRST Sets
Defn: For any string of grammar symbols ,
FIRST() = { a | a is a terminal and *
a}.
if * then is also in FIRST().
Example: E T E′
E′ + T E′ |
T F T′
T′ * F T′ |
F ( E ) | id
FIRST(E) = FIRST(T) = FIRST(F) = { (, id }
FIRST(E′) = { +, }
FIRST(T′) = { *, }
CSc 453: Syntax Analysis 30
Compiler construction
Computing FIRST Sets
Given a sequence of grammar symbols A:
if A is a terminal or A = then FIRST(A)
= {A}.
if A is a nonterminal with productions A
1 | … | n then:
• FIRST(A) = FIRST(1) FIRST(n).
if A is a sequence of symbols Y1 … Yk
then:
• for i = 1 to k do:
– add each a FIRST(Yi), such that a
, to FIRST(A).
– if FIRST(Yi) then break;
• if is in each of FIRST(Y1), …, FIRST(Yk) 31
Compiler construction
Computing FIRST sets: cont’d
• For each nonterminal A in the grammar,
initialize FIRST(A) = .
• repeat {
• for each nonterminal A in the grammar {
• compute FIRST(A); /* as described
previously */
•}
• } until there is no change to any FIRST set.
32
Compiler construction
Example (FIRST Sets)
X YZ | a
Y b |
Z c |
• X a, so add a to FIRST(X).
• X YZ, b FIRST(Y), so add b to FIRST(X).
• Y , i.e. FIRST(Y), so add non- symbols from
FIRST(Z) to FIRST(X).
• ► add c to FIRST(X).
• FIRST(Y) and FIRST(Z), so add to FIRST(X).
• Final: FIRST(X) = { a, b, c, }.
33
Compiler construction
FOLLOW Sets
Definition: Given a grammar G = (V, T, P, S),
for any nonterminal A V:
• FOLLOW(A) = { a T | S * Aa for
some , }.
i.e., FOLLOW(A) contains those terminals
that can appear after A in something
derivable from the start symbol S.
• if S * A then $ is also in FOLLOW(A).
($ EOF, “end of input.”)
Example:
E E + E | id
FOLLOW(E) = { +, $ }.
34
Compiler construction
Computing FOLLOW Sets
Given a grammar G = (V, T, P, S):
1. add $ to FOLLOW(S);
2. repeat {
• for each production A B in P, add
every non- symbol in FIRST() to
FOLLOW(B).
• for each production A B in P, where
FIRST(), add everything in
FOLLOW(A) to FOLLOW(B).
• for each production A B in P, add
everything in FOLLOW(A) to FOLLOW(B).
} until no change to any FOLLOW set.
Compiler construction
Example (FOLLOW Sets)
X YZ | a
Y b |
Z c |
• X is start symbol: add $ to FOLLOW(X);
• X YZ, so add everything in FOLLOW(X) to FOLLOW(Z).
• ►add $ to FOLLOW(Z).
• X YZ, so add every non- symbol in FIRST(Z) to
FOLLOW(Y).
• ►add c to FOLLOW(Y).
• X YZ and FIRST(Z), so add everything in FOLLOW(X)
to FOLLOW(Y).
• ►add $ to FOLLOW(Y).
36
Compiler construction
Shift-reduce Parsing
• An instance of bottom-up parsing
• Basic idea: repeat
1. in the string being processed, find a
substring α such that A → α is a
production;
2. replace the substring α by A (i.e., reverse
a derivation step).
until we get the start symbol.
• Technical issues: Figuring out
1. which substring to replace; and
2. which production to reduce with. 37
Compiler construction
Shift-reduce Parsing: Example
Grammar: S → aABe
A → Abc | b
B→d
Input: abbcde (using A → b)
aAbcde (using A → Abc)
aAde (using B → d)
aABe (using S → aABe)
S
38
Compiler construction
Shift-Reduce Parsing: cont’d
• Need to choose reductions carefully:
• abbcde aAbcde aAbcBe …
• doesn’t work.
• A handle of a string s is a substring
s.t.:
• matches the RHS of a rule A → ;
and
• replacing by A (the LHS of the
rule) represents a step in the
reverse of a rightmost derivation of 39
s. Compiler construction
Shift-reduce Parsing: Implementation
• Data Structures:
• a stack, its bottom marked by ‘$’.
Initially empty.
• the input string, its right end marked by
‘$’. Initially w.
• Actions:
• repeat
• Shift some ( 0) symbols from the
input string onto the stack, until a
handle appears on top of the stack.
• Reduce to the LHS of the appropriate
production.
• until ready to accept. 40
• Acceptance: when input Compiler
is empty and
construction
Example
Stack (→) Input Action
$ abbcde$ shift
$a bbcde$ shift
$ab bcde$ reduce: A → b Grammar :
$aA bcde$ shift S → aABe
$aAb cde$ shift A → Abc | b
$aAbc de$ reduce: A → Abc B→d
$aA de$ shift
$aAd e$ reduce: B → d
$aAB e$ shift
$aABe $ reduce: S → aABe
$S $ accept
41
Compiler construction
Conflicts
• Can’t decide whether to shift or to reduce ―
both seem OK (“shift-reduce conflict”).
• Example: S → if E then S | if E then
S else S | …
• Can’t decide which production to reduce
with ― several may fit (“reduce-reduce
conflict”).
• Example: Stmt → id ( args ) | Expr
• Expr → id ( args )
42
Compiler construction
LR Parsing
• A kind of shift-reduce parsing. An LR(k)
parser:
• scans the input L-to-R;
• produces a Rightmost derivation (in
reverse); and
• uses k tokens of lookahead.
• Advantages:
• very general and flexible, and handles a
wide class of grammars;
• efficiently implementable.
• Disadvantages:
• difficult to implement by hand (use tools 43
Compiler construction
LR Parsing: Schematic
• The driver program is the same for all LR
parsers (SLR(1), LALR(1), LR(1), …). Only
the parse table changes.
• Different LR parsing algorithms involve
different tradeoffs between parsing power,
parse table size. 44
Compiler construction
LR Parsing: the parser stack
• The parser stack holds strings of the form
• s0 X1s1 X2s2 … Xmsm (sm is on top)
• where si are parser states and Xi are grammar
symbols.
• (Note: the Xi and si always come in pairs, with
the state component si on top.)
• A parser configuration is a pair
• stack contents, unexpended input
45
Compiler construction
LR Parsing: Roadmap
• LR parsing algorithm:
• parse table structure
• parsing actions
• Parse table construction:
• viable prefix automaton
• parse table construction from this
automaton
• improving parsing power: different LR
parsing algorithms
46
Compiler construction
LR Parse Tables
• The parse table has two parts: the action
function and the goto function.
• At each point, the parser’s next move is
given by action[sm, ai], where:
• sm is the state on top of the parser stack,
and
• ai the next input token.
• The goto function is used only during
reduce moves.
47
Compiler construction
LR Parser Actions: shift
• Suppose:
• the parser configuration is s0 X1s1 … Xmsm,
ai … an, and
• action[sm, ai] = ‘shift sn’.
• Effects of shift move:
• push the next input symbol ai; and
• push the state sn
• New configuration: s0 X1s1 … Xmsm ai sn, ai+1 … an
48
Compiler construction
LR Parser Actions: reduce
• Suppose:
• the parser configuration is s0 X1s1 … Xmsm, ai …
an, and
• action[sm, ai] = ‘reduce A → ’.
• Effects of reduce move:
• pop n states and n grammar symbols off the
stack (2n symbols total), where n = ||.
• suppose the (newly uncovered) state on top of
the stack is t, and goto[t, A] = u.
• push A, then u.
• New configuration: s0 X1s1 … Xm-nsm-n A u, ai … an
49
Compiler construction
LR Parsing Algorithm
1. set ip to the start of the input string w$.
2. while TRUE do:
1. let s = state on top of parser stack, a = input
symbol pointed at by ip.
2. if action[s,a] == ‘shift t’ then: (i) push the input
symbol a on the stack, then the state t; (ii)
advance ip.
3. if action[s,a] == ‘reduce A → ’ then: (i) pop 2*|
| symbols off the stack; (ii) suppose t is the
state that now gets uncovered on the stack; (iii)
push the LHS grammar symbol A and the state u
= goto[A, t].
4. if action[s,a] == ‘accept’ then accept;
5. else signal a syntax error.
50
Compiler construction
LR parsing: Viable Prefixes
• Goal: to be able to identify handles, and so
produce a rightmost derivation in reverse.
• Given a configuration s0 X1s1 … Xmsm, ai … an:
• X1 X2 … Xm ai … an is obtainable on a rightmost
derivation.
• X1 X2 … Xm is called a viable prefix.
• The set of viable prefixes of a grammar are
recognizable using a finite automaton.
• This automaton is used to recognize handles.
51
Compiler construction
Viable Prefix Automata
• An LR(0) item of a grammar G is a
production of G with a dot “” somewhere in
the RHS.
• Example: The rule A → a A b gives these
LR(0) items:
•A→ aAb
•A→ aAb
•A→ aAb
•A→ aAb
• Intuition: ‘A → ’ denotes that:
• we’ve seen something derivable from ;
and
52
• it would be legal to see something
Compiler construction
Overall Approach
Given a grammar G with start symbol S:
• Construct the augmented grammar by
adding a new start symbol S′ and a new
production S′ → S.
• Construct a finite state automaton whose
start state is labeled by the LR(0) item S′
→ S.
• Use this automaton to construct the
parsing table.
53
Compiler construction
Viable Prefix NFA for LR(0) items
• Each state is labeled by an LR(0) item. The initial
state is labeled S′ → S.
• Transitions:
1.
where X is a terminal
or nonterminal.
2.
where X is a nonterminal, and X → is a
production.
54
Compiler construction
Viable Prefix NFA:
Example
Grammar :
S→0S1
S→
55
Compiler construction
Viable Prefix NFA DFA
• Given a set of LR(0) items I, the set closure(I) is
constructed as follows:
• repeat
• add every item in I to closure(I);
• if A → B closure(I) and B is a
nonterminal, then for each production B → ,
add the item B → to closure(I).
• until no new items can be added to
closure(I).
• Intuition:
• A → B closure(I) means something
derivable from B is legal at this point. This 56
means that something derivable from
Compiler B (and
construction
Viable Prefix NFA DFA (cont’d)
• Given a set of LR(0) items I, the set goto(I,X) is
defined as
• goto(I, X) = closure({ A → X | A → X
I })
• Intuition:
• if A → X I then (a) we’ve seen something
derivable from ; and (b) something derivable
from X would be legal at this point.
• Suppose we now see something derivable from
X.
• The parser should “go to” a state where (a)
we’ve seen something derivable from X; and (b)
something derivable from would be legal.
57
Compiler construction
Example
Let I0 = {S′ → S}.
I1 = closure(I0) = { S′ → S, /*
from I0 */
S → 0 S 1, S → }
goto (I1, 0) = closure( { S → 0 S 1 } )
= {S → 0 S 1, S → 0 S 1, S → }
58
Compiler construction
Viable Prefix DFA for LR(0) Items
1. Given a grammar G with start symbol S, construct
the augmented grammar with new start symbol S′
and new production S′ → S.
2. C = { closure({ S′ → S }) }; // C = a set of sets of
items = set of parser states
3. repeat {
for each set of items I C {
for each grammar symbol X {
if ( goto(I,X) && goto(I,X) C ) {
// new state
add goto(I,X) to C;
}
}
}
} until no change to C;
59
Compiler construction
SLR(1) Parse Table Construction I
Given a grammar G with start symbol S:
• Construct the augmented grammar G′
with start symbol S′.
• Construct the set of states {I0, I1, …, In}
for the Viable Prefix DFA for the
augmented grammar G′.
• Each DFA state Ii corresponds to a parser
state si.
• The initial parser state s0 coresponds to
the DFA state I0 obtained from the item S′
→ S. 60
Compiler construction
SLR(1) Parse Table Construction II
• Parsing action for parser state si:
• action table entries:
• if DFA state Ii contains an item A → a
where a is a terminal, and goto(Ii, a) = Ij : set
action[i, a] = shift j.
• if DFA state Ii contains an item A → , where
A S′: for each b FOLLOW(A), set action[i,
b] = reduce A → .
• if state Ii contains the item S′ → S : set
action[i, $] = accept.
• goto table entries:
• for each nonterminal A, if goto(Ii, A) = Ij, then
goto[i, A] = j.
• any entry not defined byAnalysis
CSc 453: Syntax these steps is an error61
state. Compiler construction
SLR(1) Shortcomings
• SLR(1) parsing uses reduce actions too
liberally. Because of this it fails on many
reasonable grammars.
• Example (simple pointer assignments):
S→R | L=R
L → *R | id
R→L
The SLR parse table has a state { S → L =
R, R → L }, and FOLLOW(L) = { =, $ }.
shift-reduce conflict.
62
Compiler construction
Improving LR Parsing
• SLR(1) parsing weaknesses can be
addressed by incorporating lookahead into
the LR items in parser states.
• The lookahead makes it possible to
remove some “spurious” reduce actions
in the parse table.
• The LALR(1) parsers produced by bison
and yacc incorporate such lookahead
items.
• This improves parsing power, but at the
cost of larger parse tables. 63
Compiler construction
Error Handling
Possible reactions to lexical and syntax errors:
• ignore the error. Unacceptable!
• crash, or quit, on first error. Unacceptable!
• continue to process the input. No code
generation.
• attempt to repair the error: transform an
erroneous program into a similar but legal
input.
• attempt to correct the error: try to guess
what the programmer meant. Not
worthwhile.
64
Compiler construction
Error Reporting
• Error messages should refer to the source
program.
• prefer “line 11: X redefined” to “conflict
in hash bucket 53”
• Error messages should, as far as possible,
indicate the location and nature of the error.
• avoid “syntax error” or “illegal
character”
• Error messages should be specific.
• prefer “x not declared in function foo”
to “missing declaration”
• They should not be redundant.
65
Compiler construction
Error Recovery
• Lexical errors: pass the illegal character to
the parser and let it deal with the error.
• Syntax errors: “panic mode error
recovery”
• Essential idea: skip part of the input
and pretend as though we saw
something legal, then hope to be able to
continue.
• Pop the stack until we find a state s such
that goto[s,A] is defined for some
nonterminal A.
• discard input tokens until we find some
token a that can legitimately follow A 66
Compiler construction