An Overview of Compilation
source program target program
lexical analyzer
CS 335: Syntax Analysis symbol table
code generator
Swarnendu Biswas
syntax analyzer error handler code optimizer
Semester 2022-2023-II
CSE, IIT Kanpur
intermediate code
semantic analyzer generator
Content influenced by many excellent references, see References slide for acknowledgements.
CS 335 Swarnendu Biswas
Parser Interface Need for Checking Syntax
• Given an input program, scanner generates a stream of tokens
source Lexical
token
Syntax parse Rest of IR classified according to the syntactic category
get next
program Analyzer token
Analyzer tree Front End • The parser determines if the input program, represented by the token
stream, is a valid sentence in the programming language
• The parser attempts to build a derivation for the input program,
using a grammar for the programming language
symbol table • If the input stream is a valid program, parser builds a valid model for later
phases
• If the input stream is invalid, parser reports the problem and diagnostic
information to the user
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Syntax Analysis Context-Free Grammars
• Given a programming language grammar 𝐺 and a stream of tokens 𝑠, • A context-free grammar (CFG) 𝐺 is a quadruple (𝑇, 𝑁𝑇, 𝑆, 𝑃)
parsing tries to find a derivation in 𝐺 that produces 𝑠
𝑇 Set of terminal symbols (also called words) in the language 𝐿(𝐺). A terminal symbol
• In addition, a syntax analyser is a word that can occur in a sentence, and correspond to syntactic categories
i. Forwards the information as IR to the next compilation phases returned by the scanner.
ii. Handle errors if the input string is not in 𝐿(𝐺)
𝑁𝑇 Set of nonterminal symbols that appear in the productions of 𝐺. Nonterminals are
syntactic variables that provide abstraction and structure in the productions.
𝑆 Goal or start symbol of the grammar 𝐺. 𝑆 represents the set of sentences in 𝐿(𝐺).
𝑃 Set of productions (or rules) in 𝐺. Each rule in 𝑃 is of the form 𝑁𝑇 → (𝑇 ∪ 𝑁𝑇)∗ .
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Definitions Example of a CFG
• Derivation is a a sequence of rewriting steps that begins with the 𝐸𝑥𝑝𝑟
grammar 𝐺’s start symbol and ends with a sentence in the language CFG (𝒂 + 𝒃) × 𝒄
𝐸𝑥𝑝𝑟 → 𝐸𝑥𝑝𝑟 𝑂𝑝 name 𝐸𝑥𝑝𝑟 𝑂𝑝 name
+ 𝐸𝑥𝑝𝑟 → 𝐸𝑥𝑝𝑟
𝑆 ֜ 𝑤 where 𝑤 ∈ 𝐿(𝐺) | 𝐸𝑥𝑝𝑟 𝑂𝑝 name → 𝐸𝑥𝑝𝑟 × name
( 𝐸𝑥𝑝𝑟 ) ×
• At each point during derivation process, the string is a collection of | name → (𝐸𝑥𝑝𝑟) × name
→ (𝐸𝑥𝑝𝑟 𝑂𝑝 name) × name
terminal or nonterminal symbols 𝑂𝑝 → + − × | ÷ 𝐸𝑥𝑝𝑟 𝑂𝑝 name
→ (𝐸𝑥𝑝𝑟 + name) × name
𝛼𝐴𝛽 → 𝛼𝛾𝛽 if 𝐴 → 𝛾 → (name + name) × name name +
• Such a string is called a sentential form if it occurs in some step of a valid
derivation Parse Tree
• A sentential form can be derived from the start symbol in zero or more steps
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Parse Tree Derivations
• A parse tree is a graphical representation of a derivation • At each step during derivation, we have two choices to make
• Root is labeled with the start symbol 𝑆 1. Which nonterminal to rewrite?
• Each internal node is a nonterminal, and represents the application of a 2. Which production rule to pick?
production
• Leaves are labeled by terminals and constitute a sentential form, read from
left to right, called the yield or frontier of the tree
• A leftmost derivation rewrites the leftmost nonterminal at each step,
denoted by 𝛼 𝛽
• Parse tree filters out the order in which productions are applied to 𝑙𝑚
replace nonterminals, and just represents the rules applied • Every leftmost derivation can be written as 𝑤𝐴𝛾 𝑤𝛿𝛾
𝑙𝑚
• Rightmost (or canonical) derivation rewrites the rightmost
nonterminal at each step, denoted by 𝛼 𝛽
𝑟𝑚
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Leftmost Derivation Ambiguous Grammars
𝐸𝑥𝑝𝑟 → 𝐸𝑥𝑝𝑟 𝑂𝑝 name 𝐸𝑥𝑝𝑟
• A grammar 𝐺 is ambiguous if some sentence in 𝐿(𝐺) has more than
one rightmost (or leftmost) derivation
→ (𝐸𝑥𝑝𝑟) 𝑂𝑝 name
𝐸𝑥𝑝𝑟 𝑂𝑝 name
→ 𝐸𝑥𝑝𝑟 𝑂𝑝 name 𝑂𝑝 name
→ name 𝑂𝑝 name 𝑂𝑝 name ( 𝐸𝑥𝑝𝑟 ) × • An ambiguous grammar can produce multiple derivations and parse
trees
→ name + name 𝑂𝑝 name
𝐸𝑥𝑝𝑟 𝑂𝑝 name
→ name + name × name S𝑡𝑚𝑡 → if 𝐸𝑥𝑝𝑟 then 𝑆𝑡𝑚𝑡
name + | if 𝐸𝑥𝑝𝑟 then 𝑆𝑡𝑚𝑡 else 𝑆𝑡𝑚𝑡
| 𝐴𝑠𝑠𝑖𝑔𝑛
Parse Tree
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Ambiguous Dangling-Else Grammar Dealing with Ambiguous Grammars
if 𝐸𝑥𝑝𝑟1 then if 𝐸𝑥𝑝𝑟2 then 𝐴𝑠𝑠𝑖𝑔𝑛1 else 𝐴𝑠𝑠𝑖𝑔𝑛2 • Compilers use parse trees to interpret the meaning of the expressions
during later stages
𝑆𝑡𝑚𝑡 𝑆𝑡𝑚𝑡 • Ambiguous grammars are problematic for compilers since multiple
parse trees can give rise to multiple interpretations
if 𝐸𝑥𝑝𝑟1 then 𝑆𝑡𝑚𝑡 if 𝐸𝑥𝑝𝑟1 then 𝑆𝑡𝑚𝑡 else 𝑆𝑡𝑚𝑡
• Fixing ambiguous grammars
if 𝐸𝑥𝑝𝑟2 then 𝑆𝑡𝑚𝑡 else 𝑆𝑡𝑚𝑡 if 𝐸𝑥𝑝𝑟2 then 𝑆𝑡𝑚𝑡 i. Transform the grammar to remove the ambiguity
ii. Include rules to disambiguate during derivations (e.g., associativity and
𝐴𝑠𝑠𝑖𝑔𝑛1 𝐴𝑠𝑠𝑖𝑔𝑛2 𝐴𝑠𝑠𝑖𝑔𝑛1 𝐴𝑠𝑠𝑖𝑔𝑛2 precedence)
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Fixing the Ambiguous Dangling-Else Grammar Derivation with Fixed Dangling-Else Grammar
• In all programming languages, an else is matched with the closest if 𝐸𝑥𝑝𝑟1 then if 𝐸𝑥𝑝𝑟2 then 𝐴𝑠𝑠𝑖𝑔𝑛1 else 𝐴𝑠𝑠𝑖𝑔𝑛2
then
S𝑡𝑚𝑡 → if 𝐸𝑥𝑝𝑟 then 𝑆𝑡𝑚𝑡
| if 𝐸𝑥𝑝𝑟 then 𝑇ℎ𝑒𝑛𝑆𝑡𝑚𝑡 else 𝑆𝑡𝑚𝑡 S𝑡𝑚𝑡 → if 𝐸𝑥𝑝𝑟 then 𝑆𝑡𝑚𝑡
| 𝐴𝑠𝑠𝑖𝑔𝑛 → if 𝐸𝑥𝑝𝑟 then if 𝐸𝑥𝑝𝑟 then 𝑇ℎ𝑒𝑛𝑆𝑡𝑚𝑡 else 𝑆𝑡𝑚𝑡
𝑇ℎ𝑒𝑛𝑆𝑡𝑚𝑡 → if 𝐸𝑥𝑝𝑟 then 𝑇ℎ𝑒𝑛𝑆𝑡𝑚𝑡 else 𝑇ℎ𝑒𝑛𝑆𝑡𝑚𝑡 → if 𝐸𝑥𝑝𝑟 then if 𝐸𝑥𝑝𝑟 then 𝑇ℎ𝑒𝑛𝑆𝑡𝑚𝑡 else 𝐴𝑠𝑠𝑖𝑔𝑛
| 𝐴𝑠𝑠𝑖𝑔𝑛 → if 𝐸𝑥𝑝𝑟 then if 𝐸𝑥𝑝𝑟 then 𝐴𝑠𝑠𝑖𝑔𝑛 else 𝐴𝑠𝑠𝑖𝑔𝑛
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Interpreting the Meaning of Programs Associativity
𝑠𝑡𝑟𝑖𝑛𝑔 → 𝑠𝑡𝑟𝑖𝑛𝑔 + 𝑠𝑡𝑟𝑖𝑛𝑔 𝑠𝑡𝑟𝑖𝑛𝑔 − 𝑠𝑡𝑟𝑖𝑛𝑔 0 1 2| … |9
CFG 𝒂+𝒃×𝒄 𝐸𝑥𝑝𝑟
9−5+2
𝐸𝑥𝑝𝑟 → (𝐸𝑥𝑝𝑟) 𝐸𝑥𝑝𝑟 → 𝐸𝑥𝑝𝑟 𝑂𝑝 name 𝑠𝑡𝑟𝑖𝑛𝑔 𝑠𝑡𝑟𝑖𝑛𝑔
𝐸𝑥𝑝𝑟 𝑂𝑝 name
| 𝐸𝑥𝑝𝑟 𝑂𝑝 name → 𝐸𝑥𝑝𝑟 × name
| name → 𝐸𝑥𝑝𝑟 𝑂𝑝 name × name 𝐸𝑥𝑝𝑟 𝑂𝑝 name × 𝑠𝑡𝑟𝑖𝑛𝑔 + 𝑠𝑡𝑟𝑖𝑛𝑔 𝑠𝑡𝑟𝑖𝑛𝑔 − 𝑠𝑡𝑟𝑖𝑛𝑔
𝑂𝑝 → + − × | ÷ → 𝐸𝑥𝑝𝑟 + name × name
→ name + name × name name + 𝑠𝑡𝑟𝑖𝑛𝑔 − 𝑠𝑡𝑟𝑖𝑛𝑔 2 9 𝑠𝑡𝑟𝑖𝑛𝑔 + 𝑠𝑡𝑟𝑖𝑛𝑔
rightmost
derivation 9 5 5 2
How do we evaluate the
desired
expression?
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Associativity Parse Tree for Right Associative Grammars
• If an operand has operator on both the sides, the side on which a = b = c
𝑟𝑖𝑔ℎ𝑡
operator takes this operand is the associativity of that operator
𝑙𝑒𝑡𝑡𝑒𝑟 = 𝑟𝑖𝑔ℎ𝑡
• E.g., +, -, *, and / are left associative and ^ and = are right associative
a 𝑙𝑒𝑡𝑡𝑒𝑟 = 𝑟𝑖𝑔ℎ𝑡
• Grammar to generate strings with right associative operators
b 𝑙𝑒𝑡𝑡𝑒𝑟
𝑟𝑖𝑔ℎ𝑡 → 𝑙𝑒𝑡𝑡𝑒𝑟 = 𝑟𝑖𝑔ℎ𝑡|𝑙𝑒𝑡𝑡𝑒𝑟 c
𝑙𝑒𝑡𝑡𝑒𝑟 → 𝑎 𝑏 … |𝑧
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Encode Precedence into the Grammar Corresponding Parse Tree
𝑆𝑡𝑎𝑟𝑡 → 𝐸𝑥𝑝𝑟 𝒂−𝒃+𝒄 𝐸𝑥𝑝𝑟
𝐸𝑥𝑝𝑟 → 𝐸𝑥𝑝𝑟 + 𝑇𝑒𝑟𝑚 𝐸𝑥𝑝𝑟 − 𝑇𝑒𝑟𝑚 𝑇𝑒𝑟𝑚 𝑆𝑡𝑎𝑟𝑡 → 𝐸𝑥𝑝𝑟
priority
𝐸𝑥𝑝𝑟 + 𝑇𝑒𝑟𝑚
𝑇𝑒𝑟𝑚 → 𝑇𝑒𝑟𝑚 × 𝐹𝑎𝑐𝑡𝑜𝑟 𝑇𝑒𝑟𝑚 ÷ 𝐹𝑎𝑐𝑡𝑜𝑟 𝐹𝑎𝑐𝑡𝑜𝑟 → 𝐸𝑥𝑝𝑟 + 𝑇𝑒𝑟𝑚
→ 𝐸𝑥𝑝𝑟 + 𝐹𝑎𝑐𝑡𝑜𝑟
𝐹𝑎𝑐𝑡𝑜𝑟 → 𝐸𝑥𝑝𝑟 | num | name → 𝐸𝑥𝑝𝑟 + name
𝐸𝑥𝑝𝑟 − 𝑇𝑒𝑟𝑚 𝐹𝑎𝑐𝑡𝑜𝑟
→ 𝐸𝑥𝑝𝑟 − 𝑇𝑒𝑟𝑚 + name
𝑇𝑒𝑟𝑚 𝐹𝑎𝑐𝑡𝑜𝑟 name
→ 𝐸𝑥𝑝𝑟 − 𝐹𝑎𝑐𝑡𝑜𝑟 + name
→ 𝐸𝑥𝑝𝑟 − name + name 𝐹𝑎𝑐𝑡𝑜𝑟 name
→ 𝑇𝑒𝑟𝑚 − name + name
→ 𝐹𝑎𝑐𝑡𝑜𝑟 − name + name name
→ name − name + name
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Types of Parsers Programming Errors
• Common source of programming errors
Top-down • Lexical errors, e.g., illegal characters and missing quotes around strings
• Starts with the root and grows the parse tree toward the leaves • Syntactic errors, e.g., misspelled keywords, misplaced semicolons, or extra or
missing braces
• Semantic errors, e.g., type mismatches between operators and operands,
Bottom-up undeclared variables
• Starts with the leaves and grows the parse tree toward the root • Logical errors
• The scanner cannot deal with all errors, e.g., it will mark misspelled
Universal keywords as IDs
• More general algorithms, but inefficient to use in production compilers
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Goals in Error Handling Error Recovery Strategies in the Parser
i. Report errors accurately
Panic-mode recovery
ii. Recover from the error and detect subsequent errors
• Parser discards input symbols until a synchronizing token is found, restarts
iii. Add minimal overhead to the compilation of correct programs processing from the synchronizing token
• Synchronizing tokens are usually delimiters (e.g., ; or })
Phrase-level recovery
• Report the source location where the error is detected, chances are
the actual error location is close by • Perform local correction on the remaining input (e.g., replace comma by
semicolon)
• Can go into an infinite loop because of wrong correction, or the error may have
occurred before it is detected
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Handling Errors in the Parser Context-Free vs Regular Grammar
• CFGs are more powerful than REs
Error productions • Every regular language is context-free, but not vice versa
• We can create a CFG for every NFA that simulates some RE
• Augment the grammar with productions that generate erroneous constructs
• Works only for common mistakes and complicates the grammar
• Language that can be described by a CFG but not by a RE
Global correction 𝐿 = 𝑎𝑛 𝑏 𝑛 𝑛 ≥ 1}
• Given an incorrect input string 𝑥 and grammar 𝐺, find a parse tree for a related
string 𝑦 such that the number of modifications (i.e., insertions, deletions, and
changes) of tokens required to transform 𝑥 into 𝑦 is as small as possible
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas
Limitations of Syntax Analysis References
• Cannot determine whether • A. Aho et al. Compilers: Principles, Techniques, and Tools, 2nd edition, Chapters 2 and 4.
• K. Cooper and L. Torczon. Engineering a Compiler, 2nd edition, Chapter 3.
i. A variable has been declared before use
ii. A variable has been initialized
iii. Variables are of types on which operations are allowed
iv. Number of formal and actual arguments of a function match
• These limitations are handled during semantic analysis
CS 335 Swarnendu Biswas CS 335 Swarnendu Biswas