G52CMP COMPILERS
Syntax Analysis
Lecture 5
1
Learning objectives
• To describe the process of checking the syntax of a
program
• To describe the representation of a syntax for a
programming language using context free grammar
• To understand the acceptance of a sentence for a
grammar using derivation
• To understand the problem of ambiguity in grammars
and how to resolve it
• To describe the top-down and bottom-up parsing
processes in accepting sentences for a given grammar
2
Source Program
Syntax Analysis
Lexical Analyser
Syntax Analyser
Symbol Table Semantic Analyser
Manager Error
Handler
Intermediate Code
Generator
Code Optimiser
Code Generator
Target Program
3
Grouping of Tokens
4
Where is the Syntax Analysis?
if (b == 0) a = b;
Lexical Analysis or Scanner
if ( b == 0 ) a = b ;
Syntax Analysis or Parsing
if
abstract syntax tree
== = or parse tree
b 0 a b without token names
// inorder traversal
5
The Role of the Parser
token
source Lexical Semantic intermediate
program Analyzer Analyzer
Parser syntax representation
get next tree
token
symbol
table
6
Parsing Analogy (for English language)
• Syntax analysis for natural languages
• Recognize whether a sentence is grammatically correct
• Identify the function of each word
sentence
subject verb indirect object object
I gave him noun phrase
article noun
“I gave him the book”
the book
7
Parsing Overview
• Goal – determine if the input token stream satisfies
the syntax of the program
• What do we need to do this?
– An expressive way to describe the syntax
– A mechanism that determines if the input token
stream satisfies the syntax description
• For lexical analysis
– Regular expressions describe tokens
– Finite automata have the mechanisms to generate
tokens from input stream
8
Use Regular Expressions
• REs can expressively describe tokens
– Easy to implement via DFAs
• So just use them to describe the syntax of a
programming language ???
– NO! – They don’t have enough power to express any
non-trivial syntax. Example: Nested constructs (blocks,
expressions, statements) such as to detect balanced
braces:
{ { { { {
{{} {} {{} { }}}
...
} } } } }
9
Context-Free Grammars
// production rules or
// grammar
• Consist of 4 components/properties:
SaSa
– Terminal symbols = token or
ST
– Nonterminal symbols = syntactic variables
TbTb
– Start symbol S = special non-terminal
– Productions of the form LHSRHS T
• LHS = single non-terminal
• RHS = string of ‘terminals and non-terminals’
• Specify how non-terminals may be expanded
• Language generated by a grammar is the set of strings of
terminals derived from the start symbol by repeatedly
applying the productions
– L(G) = language generated by grammar G
10
Context-Free Grammars continued
A set of terminals: basic symbols from which
sentences are formed
A set of nonterminals: syntactic categories
denoting sets of sentences
A set of productions: rules specifying how the
terminals and nonterminals can be combined to
form sentences
The start symbol: a distinguished nonterminal
denoting the language
11
Notational Conventions
To avoid always having to state that "these are the
terminals”, "these are the nonterminals”, and so on, we
shall employ the following notational conventions with
regard to grammars throughout the remainder of this
subject
Terminals: id, +, -, *, /, (, )
Nonterminals: expr, op
Productions:
expr expr op expr
expr ( expr )
expr - expr
expr id
op + | - | * | /
The start symbol: expr
12
Notational Conventions continued
l. These Symbols are Terminals:
i) Lower-case letters early in the alphabet such as a, b, c.
ii) Operator symbols such as +, -, etc.
iii) Punctuation symbols such as parentheses, comma, etc.
iv) The digits 0, 1, . . . , 9.
v) Boldface strings such as id or if.
2. These Symbols are Nonterminals:
i) Upper-case letters early in the alphabet such as A, B, C.
ii) The letter S, when it appears, is usually the start symbol.
iii) Lower-case italic names such as expr or stmt.
13
Notational Conventions continued
3. Upper-case letters late in the alphabet, such as X, Y, Z,
represent grammar symbols, that is, either nonterminals
(normally) or terminals.
4. Lower-case letters late in the alphabet, chiefly u, v, . . . , z,
represent strings of terminals.
5. Lower-case Greek letters, , , , … , represent strings of
grammar symbols. Thus, a generic production could be written
as A , indicating that there is a single nonterminal A on
the left of the arrow (the left side of the production) and a
string of grammar symbols to the right of the arrow (the
right side of the production).
14
Notational Conventions continued
6. If A 1, A 2, . . . , A k are all productions with A on
the left (we call them A-productions),
we may write A 1 | 2 | … | k. We call 1, 2, . . . , k the
alternatives for A.
7. The left side of the first production is the start symbol.
E E A E | ( E ) | - E | id
A + |- |*|/
15
Derivations
A derivation step is an application of a production as a
rewriting rule
E-E
A sequence of derivation steps
E - E - ( E ) - ( id )
is called a derivation of “- ( id )” from E
The symbol * denotes: derives in zero or more steps
+
The symbol denotes: derives in one or more steps
E - *( id ) E - ( id+)
1. * for any string // reflexive relation
2. If * and , then *
// transitive relation
16
example
• Grammar for balanced-parentheses language
– S(S)S
– S // Why is the final S required ?
• 1 non-terminal: S
• 2 terminals: “(”, “)”
• Start symbol: S
• 2 productions
• If grammar accepts a string, there is a derivation of that
string using the productions. For examples:
– S = (S) S = (S) = ((S) S) = (() ) = (())
– S = (S) S = ((S)S) S = ((S)S) (S)S = (()) () = (()) ()
17
More on CFGs
• Shorthand notation – vertical bar for multiple
productions
SaSa|T
TbTb|
• CFGs powerful enough to expression the syntax in
most programming languages
• Derivation = successive application of productions
starting from S
• Acceptance? = Determine if there is a derivation for
an input token stream
18
RE is a Subset of CFG
Can inductively build a grammar for each RE
S
a Sa
R1 R2 S S1 S2
R1 | R2 S S1 | S2
R1* S S1 S |
Note:
G1 = grammar for RE1, with start symbol S1
G2 = grammar for RE2, with start symbol S2
19
Context-Free Languages
A context-free language L(G) is the language defined by
a context-free grammar G
A string of terminals is in L(G) if and only if S + ,
is called a sentence of G
If S * , where may contain nonterminals, then we
call a sentential form of G
E - E - ( E ) - ( id )
G1 is equivalent to G2 if L(G1) = L(G2)
20
A Parser
Context free
grammar, G Yes, if s in L(G)
Parser
No, otherwise
Token stream, s
(from lexer)
Error messages
Syntax analyzers (parsers) = CFG acceptors which also
output the corresponding derivation when the token stream
is accepted
Various kinds of parsing: LL(k), LR(k), SLR, LALR
21
Left- & Right-most Derivations
Each derivation step needs to choose
a nonterminal to rewrite
a production to apply
Consider the following grammar:
E E A E | ( E ) | - E | id
A +|-|*|/
A leftmost derivation always chooses the
leftmost nonterminal to rewrite
E lm - E lm - ( E ) lm - ( E A E ) lm - ( E + E )
lm - ( id + E ) lm - ( id + id )
A rightmost derivation always chooses the
rightmost nonterminal to rewrite
E rm - E rm - ( E ) rm - ( E A E ) rm - ( E + E )
rm - (E + id ) rm - ( id + id ) 22
Parse Trees
* A parse tree is a graphical representation for a derivation that
filters out the order of choosing nonterminals for rewriting
* Many derivations may correspond to the same parse tree, but
every parse tree has associated with it a unique leftmost and a
unique rightmost derivation
E // rightmost derivation
// leftmost derivation
E lm - E - E Erm - E
lm - ( E ) rm - ( E )
lm - ( E + E ) ( E ) rm - ( E + E )
lm - ( id + E ) rm - (E + id )
lm - ( id + id ) E + E rm - ( id + id )
id id 23
Parse Tree vs Abstract Syntax Tree
S SE+S | E +
E + S E number | (S)
+ 5
Derive: (1 + 2 + (3 + 4)) + 5
( S ) E 1 +
• Parse tree is a tree
2 +
E + S 5 representation of the derivation
• Leaves of the tree are terminals
3 4
1 E + S • Internal nodes are non-
AST (abstract
terminals
syntax tree) discards
2 E • NO information about the order
unneeded
of the derivation steps
( S ) information, more
compact format,
E + S inorder traversal
3 E
4 24
Derivation Order
• Can choose to apply productions in any order,
select non-terminal and
substitute RHS of production
• Two standard orders: left and right-most
• Consider the following grammar:
SE+S | E
E number | (S)
• Leftmost derivation
– In the string, find the leftmost non-terminal and apply a
production to it
– E+S1+S
• Rightmost derivation
– Same, but find rightmost non-terminal
– E+SE+E+S
25
Leftmost & Rightmost Derivation
Leftmost derive: SE+S|E Rightmost derive:
(1 + 2 + (3 + 4)) + 5 E number | (S) (1 + 2 + (3 + 4)) + 5
SE+S S E+S
(S)+S E+E
(E+S) + S E+5
(1+S)+S (S)+5
(1+E+S)+S (E+S)+5
(1+2+S)+S (E+E+S)+5
(1+2+E)+S (E+E+E)+5
(1+2+(S))+S (E+E+(S))+5
(1+2+(E+S))+S (E+E+(E+S))+5
(1+2+(3+S))+S (E+E+(E+E))+5
(1+2+(3+E))+S (E+E+(E+4))+5
(1+2+(3+4))+S (E+E+(3+4))+5
(1+2+(3+4))+E (E+2+(3+4))+5
(1+2+(3+4))+5 (1+2+(3+4))+5
Result: Same parse tree, same productions chosen, but in different order
26
Ambiguous Grammar
* A grammar is ambiguous if it produces more than
one parse tree for some sentence
EE+E EE*E
id + E EE*E E+E*E
id + E * E |E+E id + E * E
id + id * E | id id + id * E
id + id * id Derive: id + id * id id + id * id
E E
E + E E * E
id E * E E + E id
id id id id 27
Resolving Ambiguity
* Use disambiguating rules to throw away undesirable
parse trees
* Rewrite grammars by incorporating disambiguating
rules into grammars
28
Example
The dangling-else grammar
stmt if expr then stmt
| if expr then stmt else stmt
| other
Which parse tree for if E1 then if E2 then S1 else S2 ?
S
if E then S else S
if E then S
S
if E then S
if E then S else S
29
Incorporating Disambiguating Rules
into Grammars
* Rule: match each “else” with the closest previous unmatched
“then”
* Remove undesired state transitions in the pushdown automaton
Which parse tree for if E1 then if E2 then S1 else S2 ?
S
if E then S else S
if E then S
S
if E then S
if E then S else S
30
Grammar Rewriting
Consider the ambiguous grammar below:
stmt if expr then stmt
| if expr then stmt else stmt
| other
Incorporating Disambiguating Rules into a Grammar:
stmt m_stmt
Try to derive by yourself:
| unm_stmt if E then if E then S else S
m_stmt if expr then m_stmt else m_stmt
| other
unm_stmt if expr then stmt
| if expr then m_stmt else unm_stmt 31
Different kinds of Parsing Algorithms
• Two big groups of algorithms can be distinguished:
– bottom up strategies
– top down strategies
• Analogy: parsing of “Micro-English” grammar below:
Sentence
Sentence ::=
::= Subject
Subject Verb
Verb Object
Object ..
Subject
Subject ::=
::= II || AA Noun
Noun || The
The Noun
Noun
Object
Object ::=
::= me
me || aa Noun
Noun || the
the Noun
Noun
Noun
Noun ::=
::= cat
cat || bat
bat || rat
rat
Verb
Verb ::=
::= like
like || isis || see
see || sees
sees
The cat sees a rat. The rat sees me. I like a cat.
The cat sees the rat. The rat like me. I see the rat.
I sees a rat.
I like the rat.
// Shhh… I don’t like rat. 32
Bottom up parsing
The parse tree “grows” from the bottom (leafs) up to the top (root).
Sentence
Subject Object
Noun Verb Noun
The cat sees a rat .
33
Top-down parsing
The parse tree is constructed starting at the top (root) down to the
bottom (leafs).
Sentence
Subject Verb Object
Noun Noun
The cat sees a rat .
34
Quick Review
• Syntactic analysis
– Lexical analysis
• Group letters into words (or group characters into
tokens)
• Use regular expressions and deterministic FSM’s
– Grammar transformations before parsing, coming soon
• Left-recursion removal
• Left-factoring
• Substitution
– Parsing = structural analysis of program
• Group words into sentences, paragraphs, and
documents (or tokens into expressions, commands, and
programs)
• Top-down Parsing
• Bottom-up Parsing
35
END
36