Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views51 pages

Lecture 04

The document provides an introduction to parsing, focusing on the role of parsers in distinguishing valid sequences of tokens in programming languages. It discusses context-free grammars (CFGs), their structure, and the concept of ambiguity in grammars, including examples of ambiguous expressions and methods to resolve such ambiguities. Additionally, it highlights the importance of operator precedence and associativity in defining unambiguous grammars.

Uploaded by

nihafahima9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views51 pages

Lecture 04

The document provides an introduction to parsing, focusing on the role of parsers in distinguishing valid sequences of tokens in programming languages. It discusses context-free grammars (CFGs), their structure, and the concept of ambiguity in grammars, including examples of ambiguous expressions and methods to resolve such ambiguities. Additionally, it highlights the importance of operator precedence and associativity in defining unambiguous grammars.

Uploaded by

nihafahima9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Introduction to Parsing

Ambiguity and Removing Ambiguity


Outline

• Regular languages revisited

• Parser overview

• Context-free grammars (CFG’s)

• Derivations

• Ambiguity

Compiler Design 1 (2011) 2


Languages and Automata

• Formal languages are very important in CS


– Especially in programming languages

• Regular languages
– The weakest formal languages widely used
– Many applications

• We will also study context-free languages

Compiler Design 1 (2011) 3


Limitations of Regular Languages

Intuition: A finite automaton that runs


long enough must repeat states
• A finite automaton cannot remember #
of times it has visited a particular state
• because a finite automaton has finite memory
– Only enough to store in which state it is
– Cannot count, except up to a finite limit
• Many languages are not regular
• E.g., language of balanced parentheses is not
regular: { (i )i | i ≥ 0}

Compiler Design 1 (2011) 4


The Functionality of the Parser

• Input: sequence of tokens from lexer

• Output: parse tree of the program

Compiler Design 1 (2011) 5


Example

• If-then-else statement
if (x == y) then z =1; else z = 2;
• Parser input
IF (ID == ID) THEN ID = INT; ELSE ID =
INT;
• Possible parser output

IF-THEN-ELSE

== = =
ID ID ID INT ID INT
Compiler Design 1 (2011) 6
Comparison with Lexical Analysis

Phase Input Output

Lexer Sequence of Sequence of


characters tokens

Parser Sequence of Parse tree


tokens

Compiler Design 1 (2011) 7


The Role of the Parser

• Not all sequences of tokens are programs . . .


• . . . Parser must distinguish between valid and
invalid sequences of tokens

• We need
– A language for describing valid sequences of tokens
– A method for distinguishing valid from invalid
sequences of tokens

Compiler Design 1 (2011) 8


Context-Free Grammars

• Many programming language constructs have a


recursive structure

• A STMT is of the form


if COND then STMT else STMT ,
or while COND do STMT , or

• Context-free grammars are a natural notation
for this recursive structure

Compiler Design 1 (2011) 9


CFGs (Cont.)

• A CFG consists of
– A set of terminals T
– A set of non-terminals N
– A start symbol S (a non-terminal)
– A set of productions

Assuming X ∈ N the productions are of the


formX → ε , or
X → Y1 Y2 ... Yn where Y ∈ N ∪T
i

Compiler Design 1 (2011) 10


Notational Conventions

• In these lecture notes


– Non-terminals are written upper-case
– Terminals are written lower-case
– The start symbol is the left-hand side of the
first production

Compiler Design 1 (2011) 11


Examples of CFGs

A fragment of our example language (simplified):

STMT → if COND then STMT else STMT


⏐ while COND do STMT
⏐ id = int

Compiler Design 1 (2011) 12


Examples of CFGs (cont.)

Grammar for simple arithmetic expressions:

E →E * E
⏐ E+E
⏐ (E)
⏐ id

Compiler Design 1 (2011) 13


The Language of a CFG

Read productions as replacement rules:

X → Y1 ... Yn
Means X can be replaced by Y1 ... Yn
X→ε
Means X can be erased (replaced with empty
string)

Compiler Design 1 (2011) 14


Key Idea

(1) Begin with a string consisting of the start


symbol “S”
(2) Replace any non-terminal X in the string
by a right-hand side of some production

X → Y1 LYn
(3) Repeat (2) until there are no non-terminals in
the string

Compiler Design 1 (2011) 15


The Language of a CFG (Cont.)

More formally, we write

X1 LXi LXn → X1 LXi−1Y1 LYm Xi+1 LXn


if there is a production

Xi → Y1 LYm

Compiler Design 1 (2011) 16


The Language of a CFG (Cont.)

Write
X L X →* Y LY
1 n 1 m
if
X1 L Xn →L →L → Y1 LYm

in 0 or more steps

Compiler Design 1 (2011) 17


The Language of a CFG

Let G be a context-free grammar with start


symbol S. Then the language of G is:

{ a1…a

n | S *

…a
a1 n and every ai is a
terminal
}

Compiler Design 1 (2011) 18


Terminals

• Terminals are called so because there are no


rules for replacing them

• Once generated, terminals are permanent

• Terminals ought to be tokens of the language

Compiler Design 1 (2011) 19


Examples

L(G) is the language of the CFG G

Strings of balanced parentheses


(i )i | i ≥
Two grammars:
{
S → (S ) S → (S 0)
O }
S → ε R |ε

Compiler Design 1 (2011) 20


Example

A fragment of our example language (simplified):

STMT → if COND then STMT


⏐ if COND then STMT else STMT
⏐ while COND do STMT
⏐ id = int
COND → (id == id)
⏐ (id != id)

Compiler Design 1 (2011) 21


Example (Cont.)

Some elements of the our language

id = int
if (id == id) then id = int else id = int
while (id != id) do id = int
while (id == id) do while (id != id) do id = int
if (id != id) then if (id == id) then id = int else id = int

Compiler Design 1 (2011) 22


Arithmetic Example

Simple arithmetic expressions:

E → E+E | E *E | (E) | id
Some elements of the language:

id id + id
(id) id* id id
(id) * id * (id)
Compiler Design 1 (2011) 23
Notes

The idea of a CFG is a big step.


But:

• Membership in a language is just “yes” or “no”;


we also need the parse tree of the input

• Must handle errors gracefully

• Need an implementation of CFG’s (e.g., yacc)

Compiler Design 1 (2011) 24


More Notes

• Form of the grammar is important


– Many grammars generate the same language
– Parsing tools are sensitive to the grammar

Note: Tools for regular languages (e.g., lex/ML-Lex)


are also sensitive to the form of the regular
expression, but this is rarely a problem in practice

Compiler Design 1 (2011) 25


Derivations and Parse Trees

A derivation is a sequence of productions

S →L →L →L
A derivation can be drawn as a tree
– Start symbol is the tree’s root
– For a production add children
X → Y LY
1 n Y1
LYn
to node
X

Compiler Design 1 (2011) 26


Derivation Example

• Grammar

E → E+E | E *E | (E) | id
• String

id * id + id

Compiler Design 1 (2011) 27


Derivation Example (Cont.)

E
E
→ E+E
E + E
→ E * E+E
→ id *E + E E * E id
→ id *id + E
id id
→ id *id +
id
Compiler Design 1 (2011) 28
Notes on Derivations

• A parse tree has


– Terminals at the leaves
– Non-terminals at the interior nodes

• An in-order traversal of the leaves is the


original input

• The parse tree shows the association of


operations, the input string does not

Compiler Design 1 (2011) 29


Leftmost and Rightmost Derivations

• The example is a
left-most derivation
– At each step, replace the
E
left-most non-terminal
→ E+E
• There is an equivalent → E+id
notion of a
right-most → E * E + id
→ E *id + id
derivation

→ id *id +
id
Compiler Design 1 (2011) 30
Derivations and Parse Trees

• Note that right-most and leftmost


derivations have the same parse tree

• The difference is just in the order in


which branches are added

Compiler Design 1 (2011) 31


Summary of Derivations

• We are not just interested in whether


s ∈ L(G)
– We need a parse tree for s

• A derivation defines a parse tree


– But one parse tree may have many derivations

• Left-most and right-most derivations are


important in parser implementation

Compiler Design 1 (2011) 32


Ambiguity

•What is Ambiguous Grammar?


• A CFG is ambiguous if there exists more than one
derivation tree for a given input string.
• This occurs when both Left-Most Derivation Trees
(LMDT) and Rightmost Derivation Trees (RMDT) can be
generated for the same string.
• This creates uncertainty about how to parse certain
strings, leading to multiple interpretations.

• Grammar
E → E + E | E * E |( E ) | int

• String
int * int + int
Compiler Design 1 (2011) 33
Ambiguity (Cont.)

This string has two parse trees

E E

E + E E * E

E * E int int E + E

int int int int

Compiler Design 1 (2011) 34


Ambiguity (Cont.)

• A grammar is ambiguous if it has more


than one parse tree for some string
– Equivalently, there is more than one right-most or
left-most derivation for some string
• Ambiguity is bad
– Leaves meaning of some programs ill-defined
• Ambiguity is common in programming languages
– Arithmetic expressions
– IF-THEN-ELSE

Compiler Design 1 (2011) 35


S->aSbS | bSaS | ∈
S S
/\ /\
a S b S
/\ /\
b S a S
/\ /\
a S b S
/\ /\
b S a S
| |
(empty) (empty)
Grammar:
E -> E + E Input string: id + id* id
E -> E * E
E -> id
The leftmost derivation can be done in
1.E -> E + E two ways: 1.E -> E * E
2.id + E 2. E + E * E
3.id + E * E 3. id + E * E
4.id + id * E 4. id + id * E
5.id + id * id 5. id + id * id

For the given input string, we got two leftmost derivation


trees. We need to eliminate the ambiguity in the grammar.
Dealing with Ambiguity

There are several ways to handle ambiguity

Modifying Grammar Rules:


Change the production rules to ensure a unique parse tree for
each valid string.
E→T+E|T
T → int * T | int | ( E )

Operator Precedence and Associativity:


Define the precedence and associativity of operators explicitly.

Enforces precedence of * over +

Compiler Design 1 (2011) 39


Modifying Grammar

E → E + E | E * E | (E) | id
This grammar is ambiguous because the expression id + id * id can have
multiple parse trees, leading to different interpretations (e.g.,
left-associative vs. right-associative parsing).

E→E+T|T
T→T*F|F
F → (E) | id

In this grammar:
•+ has lower precedence than *.
•+ is left-associative.
•* is left-associative.
Ambiguity: The Dangling Else
• Consider the following grammar

S → if C then S

|if C then S else S


|OTHER

• This grammar is also ambiguous

Compiler Design 1 (2011)


The Dangling Else: Example

• The expression
if C1 then if C2 then S3 else S4
has two parse trees

if if

C1 if S4 C1 if

C2 S3 C 2 S3 S4

• Typically we want the second form


Compiler Design 1 (2011) 42
The Dangling Else: A Fix

• else matches the closest unmatched then


• We can describe this in the grammar

S→ /* all then are matched */


MIF /* some then are unmatched */
| →
MIF UIF
if C then MIF else MIF
| OTHER
UIF → if C then S
| if C then MIF else UIF

• Describes the same set of strings

Compiler Design 1 (2011) 43


The Dangling Else: Example Revisited

• The expression if C1 then if C2 then S3 else S4

if if

C1 if C1 if S4

C2 S3 S4 C 2 S3

• A valid parse tree • Not valid because the


(for a UIF) then expression is
not a MIF

Compiler Design 1 (2011) 44


Ambiguity

• No general techniques for handling ambiguity

• Impossible to convert automatically an


ambiguous grammar to an unambiguous one

• Used with care, ambiguity can simplify the


grammar
– Sometimes allows more natural definitions
– We need disambiguation mechanisms

Compiler Design 1 (2011) 45


Precedence and Associativity Declarations

• Instead of rewriting the grammar


– Use the more natural (ambiguous) grammar
– Along with disambiguating declarations

• Most tools allow precedence and associativity


declarations to disambiguate grammars

• Examples …

Compiler Design 1 (2011) 46


Associativity Declarations

• Consider the grammar E → E + E | int


• Ambiguous: two parse trees of int + int + int

E E

E + E E + E

E + E int int E + E

int int int int

• Left associativity declaration: %left +

Compiler Design 1 (2011) 47


Precedence Declarations

• Consider the grammar E → E + E | E * E | int


– And the string int + int * int

E E

E * E E + E

E + E int int E * E

int int int int


• Precedence declarations: %left
+
%left *
Compiler Design 1 (2011) 48
Grammar
1.X -> X - X
2.X -> var/const
Here var can be any variable, and const can be any constant value. A
string a - b - c has two leftmost derivations:

1.X -> X - X 1.X -> X - X


2. X - X - X 2. var - X - X
3. var - var - var 3. a - var - var
4. a - b - c 4. a-b-c
For example, if we take the values a = 2, b = 3 and c = 4:
a - b - c = 2 - 3 - 4 = -5
In the first derivation tree, according to the order of substitution,
the expression will be evaluated as:
(a - b) - c = (2 - 3) - 4 = -1 -4 = -5
In the second derivation tree: a - (b - c) = 2 - (3 - 4) = 2 - -1 = 3
Observe that both parse trees aren't giving the same value. They
have different meanings. In the above example, the first derivation
tree is the correct parse tree for grammar.

(a - b) - c. Here there are two same


operators in the expression. According
to mathematical rules, the expression
must be evaluated based on the
associativity of the operator
Grammar:
E -> E + E Input string: id + id* id
E -> E * E
E -> id
The leftmost derivation can be done in
two ways:
1.E -> E + E 1.E -> E * E
2.id + E If id = 2: 2. E + E * E
3.id + E * E If + id * id = 2 + 2 * 2 = 6 3. id + E * E
4.id + id * E 4. id + id * E
5.id + id * id 5. id + id * id

id + (id * id) = 2 + (2 * 2) = 2 + 4 = 6 (id + id) * id = (2 + 2) * 2 = 4*2 = 8

You might also like