Introduction to Parsing
Ambiguity and Removing Ambiguity
Outline
• Regular languages revisited
• Parser overview
• Context-free grammars (CFG’s)
• Derivations
• Ambiguity
Compiler Design 1 (2011) 2
Languages and Automata
• Formal languages are very important in CS
– Especially in programming languages
• Regular languages
– The weakest formal languages widely used
– Many applications
• We will also study context-free languages
Compiler Design 1 (2011) 3
Limitations of Regular Languages
Intuition: A finite automaton that runs
long enough must repeat states
• A finite automaton cannot remember #
of times it has visited a particular state
• because a finite automaton has finite memory
– Only enough to store in which state it is
– Cannot count, except up to a finite limit
• Many languages are not regular
• E.g., language of balanced parentheses is not
regular: { (i )i | i ≥ 0}
Compiler Design 1 (2011) 4
The Functionality of the Parser
• Input: sequence of tokens from lexer
• Output: parse tree of the program
Compiler Design 1 (2011) 5
Example
• If-then-else statement
if (x == y) then z =1; else z = 2;
• Parser input
IF (ID == ID) THEN ID = INT; ELSE ID =
INT;
• Possible parser output
IF-THEN-ELSE
== = =
ID ID ID INT ID INT
Compiler Design 1 (2011) 6
Comparison with Lexical Analysis
Phase Input Output
Lexer Sequence of Sequence of
characters tokens
Parser Sequence of Parse tree
tokens
Compiler Design 1 (2011) 7
The Role of the Parser
• Not all sequences of tokens are programs . . .
• . . . Parser must distinguish between valid and
invalid sequences of tokens
• We need
– A language for describing valid sequences of tokens
– A method for distinguishing valid from invalid
sequences of tokens
Compiler Design 1 (2011) 8
Context-Free Grammars
• Many programming language constructs have a
recursive structure
• A STMT is of the form
if COND then STMT else STMT ,
or while COND do STMT , or
…
• Context-free grammars are a natural notation
for this recursive structure
Compiler Design 1 (2011) 9
CFGs (Cont.)
• A CFG consists of
– A set of terminals T
– A set of non-terminals N
– A start symbol S (a non-terminal)
– A set of productions
Assuming X ∈ N the productions are of the
formX → ε , or
X → Y1 Y2 ... Yn where Y ∈ N ∪T
i
Compiler Design 1 (2011) 10
Notational Conventions
• In these lecture notes
– Non-terminals are written upper-case
– Terminals are written lower-case
– The start symbol is the left-hand side of the
first production
Compiler Design 1 (2011) 11
Examples of CFGs
A fragment of our example language (simplified):
STMT → if COND then STMT else STMT
⏐ while COND do STMT
⏐ id = int
Compiler Design 1 (2011) 12
Examples of CFGs (cont.)
Grammar for simple arithmetic expressions:
E →E * E
⏐ E+E
⏐ (E)
⏐ id
Compiler Design 1 (2011) 13
The Language of a CFG
Read productions as replacement rules:
X → Y1 ... Yn
Means X can be replaced by Y1 ... Yn
X→ε
Means X can be erased (replaced with empty
string)
Compiler Design 1 (2011) 14
Key Idea
(1) Begin with a string consisting of the start
symbol “S”
(2) Replace any non-terminal X in the string
by a right-hand side of some production
X → Y1 LYn
(3) Repeat (2) until there are no non-terminals in
the string
Compiler Design 1 (2011) 15
The Language of a CFG (Cont.)
More formally, we write
X1 LXi LXn → X1 LXi−1Y1 LYm Xi+1 LXn
if there is a production
Xi → Y1 LYm
Compiler Design 1 (2011) 16
The Language of a CFG (Cont.)
Write
X L X →* Y LY
1 n 1 m
if
X1 L Xn →L →L → Y1 LYm
in 0 or more steps
Compiler Design 1 (2011) 17
The Language of a CFG
Let G be a context-free grammar with start
symbol S. Then the language of G is:
{ a1…a
→
n | S *
…a
a1 n and every ai is a
terminal
}
Compiler Design 1 (2011) 18
Terminals
• Terminals are called so because there are no
rules for replacing them
• Once generated, terminals are permanent
• Terminals ought to be tokens of the language
Compiler Design 1 (2011) 19
Examples
L(G) is the language of the CFG G
Strings of balanced parentheses
(i )i | i ≥
Two grammars:
{
S → (S ) S → (S 0)
O }
S → ε R |ε
Compiler Design 1 (2011) 20
Example
A fragment of our example language (simplified):
STMT → if COND then STMT
⏐ if COND then STMT else STMT
⏐ while COND do STMT
⏐ id = int
COND → (id == id)
⏐ (id != id)
Compiler Design 1 (2011) 21
Example (Cont.)
Some elements of the our language
id = int
if (id == id) then id = int else id = int
while (id != id) do id = int
while (id == id) do while (id != id) do id = int
if (id != id) then if (id == id) then id = int else id = int
Compiler Design 1 (2011) 22
Arithmetic Example
Simple arithmetic expressions:
E → E+E | E *E | (E) | id
Some elements of the language:
id id + id
(id) id* id id
(id) * id * (id)
Compiler Design 1 (2011) 23
Notes
The idea of a CFG is a big step.
But:
• Membership in a language is just “yes” or “no”;
we also need the parse tree of the input
• Must handle errors gracefully
• Need an implementation of CFG’s (e.g., yacc)
Compiler Design 1 (2011) 24
More Notes
• Form of the grammar is important
– Many grammars generate the same language
– Parsing tools are sensitive to the grammar
Note: Tools for regular languages (e.g., lex/ML-Lex)
are also sensitive to the form of the regular
expression, but this is rarely a problem in practice
Compiler Design 1 (2011) 25
Derivations and Parse Trees
A derivation is a sequence of productions
S →L →L →L
A derivation can be drawn as a tree
– Start symbol is the tree’s root
– For a production add children
X → Y LY
1 n Y1
LYn
to node
X
Compiler Design 1 (2011) 26
Derivation Example
• Grammar
E → E+E | E *E | (E) | id
• String
id * id + id
Compiler Design 1 (2011) 27
Derivation Example (Cont.)
E
E
→ E+E
E + E
→ E * E+E
→ id *E + E E * E id
→ id *id + E
id id
→ id *id +
id
Compiler Design 1 (2011) 28
Notes on Derivations
• A parse tree has
– Terminals at the leaves
– Non-terminals at the interior nodes
• An in-order traversal of the leaves is the
original input
• The parse tree shows the association of
operations, the input string does not
Compiler Design 1 (2011) 29
Leftmost and Rightmost Derivations
• The example is a
left-most derivation
– At each step, replace the
E
left-most non-terminal
→ E+E
• There is an equivalent → E+id
notion of a
right-most → E * E + id
→ E *id + id
derivation
→ id *id +
id
Compiler Design 1 (2011) 30
Derivations and Parse Trees
• Note that right-most and leftmost
derivations have the same parse tree
• The difference is just in the order in
which branches are added
Compiler Design 1 (2011) 31
Summary of Derivations
• We are not just interested in whether
s ∈ L(G)
– We need a parse tree for s
• A derivation defines a parse tree
– But one parse tree may have many derivations
• Left-most and right-most derivations are
important in parser implementation
Compiler Design 1 (2011) 32
Ambiguity
•What is Ambiguous Grammar?
• A CFG is ambiguous if there exists more than one
derivation tree for a given input string.
• This occurs when both Left-Most Derivation Trees
(LMDT) and Rightmost Derivation Trees (RMDT) can be
generated for the same string.
• This creates uncertainty about how to parse certain
strings, leading to multiple interpretations.
• Grammar
E → E + E | E * E |( E ) | int
• String
int * int + int
Compiler Design 1 (2011) 33
Ambiguity (Cont.)
This string has two parse trees
E E
E + E E * E
E * E int int E + E
int int int int
Compiler Design 1 (2011) 34
Ambiguity (Cont.)
• A grammar is ambiguous if it has more
than one parse tree for some string
– Equivalently, there is more than one right-most or
left-most derivation for some string
• Ambiguity is bad
– Leaves meaning of some programs ill-defined
• Ambiguity is common in programming languages
– Arithmetic expressions
– IF-THEN-ELSE
Compiler Design 1 (2011) 35
S->aSbS | bSaS | ∈
S S
/\ /\
a S b S
/\ /\
b S a S
/\ /\
a S b S
/\ /\
b S a S
| |
(empty) (empty)
Grammar:
E -> E + E Input string: id + id* id
E -> E * E
E -> id
The leftmost derivation can be done in
1.E -> E + E two ways: 1.E -> E * E
2.id + E 2. E + E * E
3.id + E * E 3. id + E * E
4.id + id * E 4. id + id * E
5.id + id * id 5. id + id * id
For the given input string, we got two leftmost derivation
trees. We need to eliminate the ambiguity in the grammar.
Dealing with Ambiguity
There are several ways to handle ambiguity
Modifying Grammar Rules:
Change the production rules to ensure a unique parse tree for
each valid string.
E→T+E|T
T → int * T | int | ( E )
Operator Precedence and Associativity:
Define the precedence and associativity of operators explicitly.
Enforces precedence of * over +
Compiler Design 1 (2011) 39
Modifying Grammar
E → E + E | E * E | (E) | id
This grammar is ambiguous because the expression id + id * id can have
multiple parse trees, leading to different interpretations (e.g.,
left-associative vs. right-associative parsing).
E→E+T|T
T→T*F|F
F → (E) | id
In this grammar:
•+ has lower precedence than *.
•+ is left-associative.
•* is left-associative.
Ambiguity: The Dangling Else
• Consider the following grammar
S → if C then S
|if C then S else S
|OTHER
• This grammar is also ambiguous
Compiler Design 1 (2011)
The Dangling Else: Example
• The expression
if C1 then if C2 then S3 else S4
has two parse trees
if if
C1 if S4 C1 if
C2 S3 C 2 S3 S4
• Typically we want the second form
Compiler Design 1 (2011) 42
The Dangling Else: A Fix
• else matches the closest unmatched then
• We can describe this in the grammar
S→ /* all then are matched */
MIF /* some then are unmatched */
| →
MIF UIF
if C then MIF else MIF
| OTHER
UIF → if C then S
| if C then MIF else UIF
• Describes the same set of strings
Compiler Design 1 (2011) 43
The Dangling Else: Example Revisited
• The expression if C1 then if C2 then S3 else S4
if if
C1 if C1 if S4
C2 S3 S4 C 2 S3
• A valid parse tree • Not valid because the
(for a UIF) then expression is
not a MIF
Compiler Design 1 (2011) 44
Ambiguity
• No general techniques for handling ambiguity
• Impossible to convert automatically an
ambiguous grammar to an unambiguous one
• Used with care, ambiguity can simplify the
grammar
– Sometimes allows more natural definitions
– We need disambiguation mechanisms
Compiler Design 1 (2011) 45
Precedence and Associativity Declarations
• Instead of rewriting the grammar
– Use the more natural (ambiguous) grammar
– Along with disambiguating declarations
• Most tools allow precedence and associativity
declarations to disambiguate grammars
• Examples …
Compiler Design 1 (2011) 46
Associativity Declarations
• Consider the grammar E → E + E | int
• Ambiguous: two parse trees of int + int + int
E E
E + E E + E
E + E int int E + E
int int int int
• Left associativity declaration: %left +
Compiler Design 1 (2011) 47
Precedence Declarations
• Consider the grammar E → E + E | E * E | int
– And the string int + int * int
E E
E * E E + E
E + E int int E * E
int int int int
• Precedence declarations: %left
+
%left *
Compiler Design 1 (2011) 48
Grammar
1.X -> X - X
2.X -> var/const
Here var can be any variable, and const can be any constant value. A
string a - b - c has two leftmost derivations:
1.X -> X - X 1.X -> X - X
2. X - X - X 2. var - X - X
3. var - var - var 3. a - var - var
4. a - b - c 4. a-b-c
For example, if we take the values a = 2, b = 3 and c = 4:
a - b - c = 2 - 3 - 4 = -5
In the first derivation tree, according to the order of substitution,
the expression will be evaluated as:
(a - b) - c = (2 - 3) - 4 = -1 -4 = -5
In the second derivation tree: a - (b - c) = 2 - (3 - 4) = 2 - -1 = 3
Observe that both parse trees aren't giving the same value. They
have different meanings. In the above example, the first derivation
tree is the correct parse tree for grammar.
(a - b) - c. Here there are two same
operators in the expression. According
to mathematical rules, the expression
must be evaluated based on the
associativity of the operator
Grammar:
E -> E + E Input string: id + id* id
E -> E * E
E -> id
The leftmost derivation can be done in
two ways:
1.E -> E + E 1.E -> E * E
2.id + E If id = 2: 2. E + E * E
3.id + E * E If + id * id = 2 + 2 * 2 = 6 3. id + E * E
4.id + id * E 4. id + id * E
5.id + id * id 5. id + id * id
id + (id * id) = 2 + (2 * 2) = 2 + 4 = 6 (id + id) * id = (2 + 2) * 2 = 4*2 = 8