Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
39 views279 pages

Compiler Design Book

The syllabus outlines the principles of compiler design, covering key topics such as the introduction to compiling, lexical analysis, syntax analysis, intermediate code generation, and code generation and optimization. Each chapter details the roles of various components in the compilation process, including the functions of compilers, parsers, and code generators, as well as methods for optimizing code. The document serves as a comprehensive guide for understanding the phases and tools involved in compiler construction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views279 pages

Compiler Design Book

The syllabus outlines the principles of compiler design, covering key topics such as the introduction to compiling, lexical analysis, syntax analysis, intermediate code generation, and code generation and optimization. Each chapter details the roles of various components in the compilation process, including the functions of compilers, parsers, and code generators, as well as methods for optimizing code. The document serves as a comprehensive guide for understanding the phases and tools involved in compiler construction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 279

SYLLABUS

PRINCIPLES OF COMPILER DESIGN

CHAPTER I: INTRODUCTION TO COMPILING


Compilers − Analysis of the S/C program − Phases of compiler − Cousins of the
compiler Grouping of phases − Compiler construction Tools.

CHAPTER II: LEXICAL ANALYSIS


Role of Lexical Analyzer − Input Buffering − Specification and recognition of Tokens
− Finite Automata − Regular Expression to FA − Optimization of DFA − Based
pattern matchers − Tools for generating Lexical Analyzer.

CHAPTER III: SYNTAX ANALYSIS


Role of the parses − Writing Grammars − Context Free Grammars − Top Down
Parsing − Recursive Descent parsing − Predictive Parsing − Bottom-up Parsing −
Shift reduce Parsing − Operator Precedence Parsing − LR Parsers − SLR Parser −
Canonical LR Parser − LALR Parser − Tool for Parser.

CHAPTER IV: INTERMEDIATE CODE GENERATION


Intermediate languages − Declarations − Assignment Statements − Boolean
Expressions − Flow control Statements − Back patching − Procedure calls.

CHAPTER V: CODE GENERATION & CODE OPTIMIZATION


Issues in the design of code generator − The target Machine − Basic Blocks and Flow
Graphs − A Simple code generator − DAG Representation of Basic Blocks −
Introduction to Optimization − Principle sources of optimization − Optimization of
basic blacks − Peephole optimization Case Study: One pass Compiler

*********
i

CONTENTS
CHAPTER – I: INTRODUCTION TO COMPILERS 1.1 – 1.64
1.1 Introduction 1.1

1.1.1 Translator 1.1

1.1.2 Language Processing System 1.4

1.1.3 Analysis and Synthesis Model 1.6

1.2 The Phases of the Compiler 1.7

1.3 Parser 1.8

1.4 Semantic Routines 1.8

1.5 Compiler Construction Tools 1.10

1.6 Introduction to Lexical Analysis 1.11

1.7 Role of Lexical Analyzer 1.22

1.7.1 Lexical Errors 1.23

1.7.2 Token, Patterns, Lexemes 1.23

1.8 Input Buffering 1.24

1.9 Specification of Tokens 1.25

1.10 Recognition of Token 1.30

1.11 Language for Specifying Lexical Analyzer 1.33

1.11.1 Lex Program to Count Total Number of Tokens 1.41

1.12 Finite Automata 1.42

1.12.1 Role of Finite Automata in Lexical Analysis 1.43

1.12.2 Types of Finite Automata 1.45

1.12.3 NFA with ∈ Closure 1.46

1.12.4 Conversion from NFA with ε to DFA 1.47

1.13 Regular Expressions to Finite Automata 1.58

1.14 Minimizing DFA 1.63


ii

CHAPTER – II: SYNTAX ANALYSIS 2.1 – 2.80

2.1 Introduction 2.1

2.2 Role of parser 2.1

2.3 Context – free Grammar 2.3

2.4 Writing a Grammar 2.20

2.5 Ambiguous Grammar 2.21

2.6 Error Handling 2.25

2.7 Top Down Parsing 2.26

2.7.1 Recursive Descent Parsing 2.26

2.7.2 Predictive Parsing 2.28

2.7.2.1 Construction of LL(1) Parser 2.34

2.8 Bottom up Parsing 2.36

2.8.1 Concept of Shift Reduce Parsing 2.41

2.8.2 Operator Precedence Parser 2.46

2.9 LR Parser 2.50

2.9.1 Simple LR Parsing (SLR) 2.58

2.9.2 Canonical LR Parsing (CLR) 2.61

2.9.3 LALR 2.70

2.10 Comparison of LR Parsers 2.73

2.11 Error Handling and Recovery in Syntax Analyzer 2.74

2.12 YACC 2.76

2.12.1 YACC Specification 2.79


iii

CHAPTER – III: INTERMEDIATE CODE GENERATION 3.1 – 3.36

3.1 Syntax Directed Definitions 3.1

3.2 Either 3.2

3.3 Dependency Graphs 3.4

3.3.1 Syntax Tree 3.7

3.3.2 Three Address Code 3.9

3.3.2.1 Implementation of Three Address Code 3.10

3.3.2.2 Types of Three Address Code 3.14

3.4 Declarations 3.23

3.5 Translation of Expressions 3.25

3.6 Type Checking 3.32

3.6.1 Type Expressions 3.32

3.7 Type Conversion 3.35

CHAPTER – IV: RUN-TIME ENVIRONMENT


AND CODE GENERATION 4.1 – 4.22

4.1 Introduction 4.1

4.2 Source Language Issues 4.3

4.3 Storage Organization 4.4

4.4 Storage Allocation Strategies 4.5

4.4.1 Static Allocation 4.5

4.4.2 Stack Allocation of Space 4.5

4.4.3 Heap Allocation 4.7

4.5 Access to non-local Data on the Stack 4.8

4.6 Activation Record 4.10

4.7 Parameter Passing 4.12


iv

4.8 Issues of Code Generator 4.14

4.9 Target Machine Description 4.17

4.9.1 Instruction Costs 4.18

4.10 Design of a Simple Code Generator 4.19

CHAPTER – V: CODE OPTIMIZATION 5.1 – 5.30

5.1 Introduction to Code Optimization 5.1

5.2 Principal Sources of Optimisation 5.4

5.2.1 Function preserving transformations examples 5.5

5.2.2 Common Sub-expressions Elimination 5.5

5.2.3 Copy Propagation 5.6

5.2.4 Strength Reduction 5.6

5.2.5 Dead Code Eliminations 5.7

5.2.6 Loop Optimizations 5.7

5.3 Peep-hole Optimization 5.9

5.4 Dag Representation of Basic Blocks 5.13

5.4.1. Construction of DAG 5.15

5.5 Optimization of Basic Blocks 5.19

5.6 Global Data Flow Analysis 5.22

5.7 Efficient Data Flow Analysis 5.23

5.7.1 Redundant Common Sub Expression Elimination 5.23

5.7.2 Copy Propagation 5.25

5.7.3 Induction Variable 5.27

Question Paper Q.1 – Q.8

Index I.1 – I.2


CHAPTER – I

INTRODUCTION TO COMPILERS

1.1 INTRODUCTION:
1.1.1 Translator:
★ A translator is system software which converts program written in one
language to another language.
★ During translation, if any syntax errors are encountered it will be intimated
to the users.

Object Program in
Source Program in another language
one language
Translator

Error Messages
Fig. 1.1

Machine Language:
Computers are made of electronic components and it can understand only two
states of the electronic devices
1. ON state
2. OFF state
1.2 COMPILER DESIGN

In the early days of computing, programming in computers were done only using
the machine understandable language

i.e., by using only binary numbers

0 (OFF state) and

1 (ON state)

★ Programming using binary numbers is so difficult and also debugging a


program written in low level language is very difficult.

★ Therefore to make the programming task simpler, the next level of


programming called Assembly Language was developed.

Example:

To perform the computation 10 + 2, the machine level coding may be,

10001010 10001011 10100010

for 10 for + for 2

Assembly Language:

In Assembly Level coding, the operations to be performed in a computer were


represented using English − like words called mnemonic. This reduces the length of
the program and programming becomes easy when compared to machine level coding.

Example:

To perform the computation 10 + 2, the Assembly level coding may be

ADD 10, 2.

★ But, in case of large programs, coding using mnemonics are also difficult.
This resulted in the development of the next higher level of programming.

High Level Language:

In High Level Language coding, programs are being written, using english like
statements. This makes the programming easier.

Example:

To perform the computation 10 + 2, the high level coding may be

C = 10 + 2
INTRODUCTION TO COMPILERS 1.3

Need for Translation:

Program in human Program in Machine


understable understandable
High Level Translator Low Level
Language (or) Language
Assembly
Language

Error Messages

Fig. 1.2
★ Programming can be in any form either in Assembly Language (or) in High
level language.

★ However machine can understand only 0’s and 1’s (Low Level Language).
Thus, a software is necessary to convert the programs written in High Level
Language / Assembly Level Language to Low Level Language. Such a
software is called as Translator.
COMPILER:

Source Object
Program Compiler Program
in HLL (eg. C, C++) in LLL

Error Messages

Fig. 1.3

★ A compiler is a type of translator that converts programs written in High


Level Language (HLL) to Low Level Language (LLL).

★ During compilation, if any errors are encountered, the compiler displays


them as error messages.

ANALYSIS OF SOURCE PROGRAM:

The source program is analyzed by the compiler to check (for the syntax errors)
whether the program is up to the standard of the corresponding programming
language or not.
1.4 COMPILER DESIGN

This analysis is performed in three different stages:


1. Lexical Analysis

2. Hierarchical Analysis
3. Semantic Analysis
Lexical Analysis: In this stage, the source program is read character by character
from left to right and is grouped into collection of characters called TOKENS.
Hierarchical Analysis: In this stage, tokens are grouped hierarchically into a nested
structure called SYNTAX TREE for checking them for their syntax.

Semantic Analysis: In this stage, the hierarchical structure is checked for its
meaning (like verifying the data types of the variables).
1.1.2 Language Processing System:
We have learnt that any computer system is made of hardware and software.
The hardware understands a language, which humans cannot understand. So we
write programs in high-level language, which is easier for us to understand and
remember. These programs are then ed into a series of tools and OS components to
get the desired code that can be used by the machine. This is known as Language
Processing System.

High Leve
Language

Pre-Processor

Pure HLL

Compiler

Assembly
Language

Assembler

Relocatable
Machine Code

Loader/Linker

Absolute
Machine Code

Fig. 1.4
INTRODUCTION TO COMPILERS 1.5

The high-level language is converted into binary language in various phases. A


compiler is a program that converts high-level language to assembly language.
Similarly, an assembler is a program that converts the assembly language to
machine-level language.

Let us first understand how a program, using C compiler, is executed on a host


machine.

★ User writes a program in C language (high-level language).

★ The C compiler compiles the program and translates it to assembly program


(low-level language).

★ An assembler then translates the assembly program into machine code


(object).

★ A linker tool is used to link all the parts of the program together for
execution (executable machine code).

★ A loader loads all of them into memory and then the program is executed.

Before diving straight into the concepts of compilers, we should understand a


few other tools that work closely with compilers.

Preprocessor:

A preprocessor, generally considered as a part of compiler, is a tool that produces


input for compilers. It deals with macro-processing, augmentation, file inclusion,
language extension, etc.

Interpreter:

An interpreter, like a compiler, translates high-level language into low-level


machine language. The difference lies in the way they read the source code or input.
A compiler reads the whole source code at once, creates tokens, checks semantics,
generates intermediate code, executes the whole program and may involve many
passes. In contract, an interpreter reads a statement from the input, converts it to
an intermediate code, executes it, ten takes the next statement in sequence. If an
error occurs, an interpreter stops execution and reports it. Whereas a compiler reads
the whole program even if it encounters several errors.
1.6 COMPILER DESIGN

Assembler:

An assembler translates assembly language programs into machine code. The


output of an assembler is called an object file, which contains a combination of
machine instructions as well as the data required to place these instructions in
memory.

Linker:

Linker is a computer program that links and merges various object files together
in order to make an executable file. All these files might have been compiled by
separate assemblers. The major task of a linker is to search and locate referenced
module/routines in a program and to determine the memory location where these
codes will be loaded, making the program instruction to have absolute references.

Loader:

Loader is a part of operating system and is responsible for loading executable


files into memory and execute them. It calculates the size of a program (instructions
and data) and creates memory space for it. It initializes various registers to initiate
execution.

Cross-compiler:

A compiler that takes the source code of one programming language and
translates it into the source code of another programming language is called a
source-to-source compiler.

1.1.3. Analysis and Synthesis Model:

Analysis phase of compiler:

Analysis phase reads the source program and splits it into multiple tokens and
constructs the intermediate representation of the source program. And also checks
and indicates the syntax and semantic errors of a source program.

It collects information about the source program and prepares the symbol table.
Symbol table will be used all over the compilation process. This is also called as the
front end of a compiler.

Synthesis phase of compiler:

It will get the analysis phase input (intermediate representation and symbol
table) and produces the targeted machine level code. This is also called as the
back end of a compiler.
INTRODUCTION TO COMPILERS 1.7

Front End Back End

Error
Checking

Source
Program
Analysis Phase Synthesis Phase Machine
Code

Constructs

1. Intermediate
Representation
2. Symbol Table

Fig. 1.5: Analysis and Synthesis Model


1.2 THE PHASES OF THE COMPILER:

Source Program
(Character Stream)

Scanner

To kens

Parser

Syntactic Structure

Semantic
Routines
Intermediate
Representasion

Optimizer

Code
Generator

Target
Machine Code

Fig. 1.6
1.8 COMPILER DESIGN

Scanner:

★ The scanner begins the analysis of the source program by reading the input,
character by character and grouping, characters into individual word and
symbols (tokens).

★ RE Regular Expressions

★ NFA Non-Deterministic Finite Automate

★ DFA Deterministic Finite Automota

★ LEX LEX Tool


1.3 PARSER:

★ Given a formal syntax specification (typically as a Context Free Grammar


[CFG]), the parser reads tokens and groups them into unit as specified by
the productions of the CFG being used.

★ As a syntactic structure is recognized, the parser either calls corresponding


semantic routines directly or builds a syntax tree.

★ CFG Cortext Free Grammar

★ Backup − Naur From (BNF)

★ Grammar Analysis Algorithms (GAA)

★ LL, LR, SLR, LALR parsers.

★ YACC.
1.4 SEMANTIC ROUTINES:

It perform two functions

★ Check the static semantics of each construct.

★ Do the actual translation.


The heart of a compiler.

★ Syntax Directed Translation.

★ Semantic Processing Techniques.

★ IR (Intermediate Representation).
INTRODUCTION TO COMPILERS 1.9

Optimizer:
★ The IR code generated by the semantic routines is analyzed and transformed
into functionally equivalent but improved IR code.
★ This phase can be very complex & slow.
★ Peephole optimization.
★ Loop optimization, register allocation, code scheduling.

• Register & Temporary Management.


• Peephole optimization.
Code Generator:
★ Interpretive code generation.
★ Generating code from Tree/DAG.
★ Grammar based code Generator.
The structure of a compiler:
Source
program

Lexical
analyzer Analysis phase

Syntax analyzer

Semantic
analyzer

Symbol table Intermediate Error detection


management code generator and handling

Synthesis
phase
Code optimizer

Code generator

Target machine code

Fig. 1.7
1.10 COMPILER DESIGN

Compiler writing Tools:


★ Compiler Generators or Compile Compilers
Eg: Scanner & Parser Generators
Eg: YACC, LEX
1.5. COMPILER CONSTRUCTION TOOLS:
1. Software development tools are available to implement one or more compiler
phases.

(i) Scanner Generators (LEX & Flex)


(ii) Parser Generators (YACC & Bison)
(iii) Syntax − directed translation engines.
(iv) Automatic code generator.
(v) Data flow engines.
The cycle of constructions:

RE NFA DPA Minimal DFA

Skeleton source program

Preprocessor

Source program

Compiler

Target Assembly program

Assembler

Rew catable object code

Libraries &
relocatable Linker
object files

Absolute machine
code

Fig. 1.8
INTRODUCTION TO COMPILERS 1.11

1.6. INTRODUCTION TO LEXICAL ANALYSIS:

Compilers:

★ Compiler which translates a program written in one language (source


language) to an equivalent program in other language (target language).

★ Source language − High level language like Java, C, Fortran. etc.,

★ Target Language − Machine language (or) code that a computer processor


understands.

★ Source code

• Optimized for human readability.


• Expressive (matches air notion of languages)
• Redundant to help, avoid programming errors.
★ Machine code

• optimized for hardware


• redundancy is reduced
• lacks readability.
How to translate?

★ High level and machine languages differ in level of abstraction.

★ Machine level deal with memory locations, registers.


Goals of translation:

★ Good performance for generated code

• Size of hand written code and compiled machine code for same program.
• Better compiler − generates smaller code.
★ If a compiler produces a code which is 20 − 30% slower than the handwritten
code then it is acceptable.

★ Compilation time must be proportional to the program size.

★ Maintainable code.
1.12 COMPILER DESIGN

★ High level of abstraction.


★ Correctness.

• all valid programs must compile correctly.


• generate correct machine code.
• complexity − amount of optimization done.
How to translate easily?
★ Translate in steps − each step handles a reasonably simple, logical and well
defined task.

★ Design a series of program representations.


★ Intermediate representations should be amenable for program manipulation
of various kinds (type checking, optimization, code generation, ...)

★ Language processing system.

Source program

Preprocessor

Modified Source program

Compiler

Target Assembly program

Assembler

Relocatable Machine code

Library files
relocatable Linker/Loader
object files

Target machine code

Fig. 1.9
INTRODUCTION TO COMPILERS 1.13

Compiler Vs Interpreter:

Compiler Interpreter

★ Takes entire program as input Takes single instruction as input.

★ Errors are displayed after entire Errors are displayed for every
program is checked. instruction interpreted (if any)

★ Takes large amount of time to Takes less amount of time to analyze


analyze the source code. the source code.

★ Overall execution time is Overall execution time is slower.


comparatively faster

Eg: C, C ++ use compilers Eg: Python, Ruby use interpreters.

Structure of a compiler:

★ Two parts (front end and Back end) − Analysis & Synthesis

★ Analysis part → breaks source program into constituent pieces and imposes
grammatical structure on them.

• Uses this structure to create an intermediate representation.

• Provides information messages if source program is synthetically ill (or)


semantically unsound.

• Collects information about source program and store it in a data structure


called symbol table.
1.14 COMPILER DESIGN

Phases of a compiler:
Character stream

Lexical Analyzer

Token stream

Syntax Analyzer

Syntax tree

Semantic
Analyzer

Syntax tree

Intermediate
Code Generator

Intermediate Representation

Machine Idependent
Code Optimizer

Intermediate Representation

Code generator

Target - machine code

Machine - Dependent
Code Optimizer

Target - machine code

Fig. 1.10
INTRODUCTION TO COMPILERS 1.15

★ Synthesis part (back end)


• Constructs desired target program from intermediate representations and
the information in the symbol table.

★ Symbol table − stores information about the entire source program and used
by all phases of the compiler.

★ Some compilers may have machine dependent optimization phase between


front and back end inorder to produce a better target program. It is an
optimal phase.

★ Lexical Analyzer (Scanning)


• The first phase of the compiler. It reads stream of characters and groups
the characters into meaningful sequences called lexemes.
• For each lexeme, the lexical analyzer produces a token of the form
(token-name, attribute − value) and passed to next phase, called syntax
analysis.
• Token − name : abstract symbol
attribute − value : points to an entry in the symbol table for this token.
• Position = initial + rate * ⇒ assignment statement
<id, 1> <= > <id, 2> <+> <id, 3> < ∗ > <60> (sequence of tokens) token
names =, +, ∗ are abstract symbols for assignment, addition and
multiplication operators, id − abstract symbol for identifier, so constant.

★ Syntax Analysis (parsing)

• It uses the tokens produced by lexical analyzer to create a tree like


intermediate representation that depicts the grammatical structure of the
token system.

★ Syntax tree:
• Interior node represents the operation.

=
<id, 1> +
<id, 2>
*
<id, 3> 60
Fig. 1.11
1.16 COMPILER DESIGN

• Children node represents the arguments of the operation.


• syntax tree − interior node − operation children − arguments of the
operation.

• Context tree Grammars are used to specify the grammatical structure of


the programming languages is used for constructing efficient syntax
analyzers.

• Syntax directed definitions specifies the translation of Programming


Language constructs.

★ Semantic Analysis
• It checks the source program for semantic consistency with the language
definition.

• Semantic analysis uses syntax tree and information in the symbol table.
• Gathers type information and saves it in syntax tree (or) symbol table
for subsequent use during intermediate code generation.

• Type checking − checks each operator that has matching operands.


• Corrections (or) conventions are done, if it is permitted in language
specification.

• If position, initial and rate are floating point numbers then lexeme 60 is
converted.

Three address code


=
t1 : = int to float (60) <id, 1> +
t2 : = id3 ∗ t1 <id, 2>
*
t3 : = id2 + t2 <id, 3>
int to float
id1 : = t3
60
Fig. 1.12

★ Intermediate Code Generation

• Generates explicit low-level (or) machine-like intermediate representation


(program for abstract machine).
INTRODUCTION TO COMPILERS 1.17

• Properties of IR → Easy to produce

→ Easy to translate into target machine

• Eg: Three − address code (maximum 3 operands per instruction). Each


operand can act like a register.

★ Code optimization

• Improves the intermediate code so that faster better target code will
resulted [which is shorter and less power consumption].

t1 : = id3 ∗ 60.0

id1 : = id2 + t1

• In case of Optimizing compilers − significant amount of time is spent on


this phase.

• Machine dependent & Machine independent optimizations.

• Improves the running time of target program.

★ Code Generation

• Takes an intermediate representation of the source program and maps


into the target language.

• Registers (or) Memory locations are selected for each variables used by
the program.

• Crucial aspect − judicious assignment of registers to hold variables.

LDF R2 , id3

MULF R2 , R2 , # 60.0

LDF R1 , id2

ADD F R1 , R1 , R2

STF id1 , R1

F − tells that it is floating point number.


1.18 COMPILER DESIGN

• Storage allocation decisions are made either during intermediate code


generation (or) during code generation.

★ Symbol table management

• Records variable names and information about various attributes of each


name.

• name, type, scope, number and types of argument, method of passing


type returned.

• Symbol table is a data structure containing a record for each variable


name with fields for the attributes of the name.

• Storage and retrieval from the record should be quick

Symbol Table

1 position ...

2 initial ...

3 rate

★ Grouping of phases into passes

• Phases is a logical organization of a compiler.

• Several phases may be grouped together into a pass.

• Front end phases consist of lexical analysis, syntax analysis, semantic


analysis, intermediate code generation which is one pass.

• Code optimization space is a optional pass.


INTRODUCTION TO COMPILERS 1.19

• Back end pass phase consist of code generation for a particular target
machine.

• compilers can be produced for different source languages for one target
machine.
(combining different front ends with the backend)

• Produces different target machines by combining a front end with


backends for different target machines.

★ Compiler Construction Tools

• Use specialized languages for specifying and implementing specific


components & use quite sophisticated algorithms.

★ Parse generators will produce syntax analyzers from grammatical description


of a Programming Language.

• Scanner generator − produce lexical analyzers from a regular expression


description of the tokens of a language.

• Syntax directed translation engine − produce collection of routines for


walking a parse tree and generating intermediate code.

• Code generator − produce a code generator from a collection of rules for


translating each operation of the intermediate language into the machine
language (target machine) language.

• Data flow analysis engine is a key part of code optimization, which


gathers information about how values are transmitted from one part of
a program to each other part.

• Compiler construction tool kits is a integrated set of routines for


constructing various phases of a compiler
1.20 COMPILER DESIGN

Phases of Compiler:

Input processing in compiler Output

a = b + c * 60

Lexical analyzer

id1 = id2 + id 3* 60 Token stream

Syntax analyzer

=
id1 +
* Syntax tree
id2
id3 60

Semantic analyzer

id
=1
id1 + Semantic tree
id2 *
id3 int to float

60
a ..........
Intermediate code
b .......... generator
c ..........
t1 : = inttofloat (60)
. t2 : = id3 * t1 Intermediate code
. ..........
. t3 : = id2 * t2
id1 : = t3
Symbol table
Code optimizer

t1 : = id3 * 60.0 Optimized code


id1 : = id 2+ t 1

Code generator
MOVF id3 , R2
MULF #60.0, R2
Machine code
MOVF id2 , R1
ADDF R2 , R 1
MOVF R 2, id1

Fig. 1.13
INTRODUCTION TO COMPILERS 1.21

Cousines of the compiler:

→ Preprocessors

→ Assemblers

→ Two-pass assembler

→ Loaders & link editors

★ Preprocessors

• Produce input to compilers.


• Macro processing (shorthands for longer constructs).
• File Inclusion (headerfiles)
• Rational preprocessors (macros for constructs like while, if − statements
where none exist in the programming language itself − augment older
languages).

• Language extensions (eg: database query language embedded in C) − add


capability to the language.

★ Assemblers

• assembly code − mnemonic version of machine code − names are used


instead of binary codes.

• produces reachable machine code that can be passed directly to the


loader/link − editor.

★ Two-pass assembler

• simplest form of assembler.


• pass: reading an input file once.

• first pass: identifiers are found, stored in symbol table and are assigned
storage locations.

• second pass: translates each operation code into sequence of bits,


identifies location into address.

• output: relocatable machine code.


1.22 COMPILER DESIGN

★ Loaders and Link Editors


• Loaders: relocatable machine code is altered and the altered instructions
and data in memory is placed at proper location.

• Link editor: makes a single program from several files of relocatable


machine code − external references (code of one file refers to location in
another file) are maintained in the symbol table.
1.7. ROLE OF LEXICAL ANALYZER:
★ First phase of a compiler.
★ Reads the input characters and produce a sequence of tokens for syntax
analysis (parser)

★ Interaction of lexical analyzer with passer.

token
Source Lexical
analyzer get next Parser
program
token

Symbol
table

Fig. 1.14
★ Secondary task − stripping out from comments and white spaces (blank, tab,
new line).
• Correlating error messages from the compiler with the source program
(eg. associating error message with its line no.)

• Implementing preprocessor functions.


★ Two phases: scanning − to perform simple tasks, lexical analyzer − to
perform complex operations.

Issues in lexical analysis:

★ Simpler design: separation of lexical analysis from syntax analysis.


★ Compiler efficiency is improved. A specialized buffering techniques for
reading input characters and processing tokens.
★ Compiler portability is enhanced.
INTRODUCTION TO COMPILERS 1.23

1.7.1. Lexical Errors:

★ Few errors are discernible at lexical level alone


eg: fi (a = = f (x))

fi − misspelling of the keyword if (or) undeclared function identifier


panic mode
★ Error recover actions

• detecting an extraneous character


• inserting a missing character
• replacing an incorrect character by a correct character
• transposing two adjacent characters
★ Minimum distance error correction: too costly to implement
1.7.2. Token, Patterns, Lexemes:

★ Pattern: rule associated with a token.

★ Lexeme: sequence of characters in source program that is matched by the


pattern for a token.

const pi = 3.1416.

Token Sample lexemes Informal description of pattern

const const const

if if if

relation < , < = , = , < > , >, > = < or < = or = or < > or > or > =

id pi, count, D2 letter followed by letters & digits

literal “core dumped” any characters between “and” except.

★ token − terminal symbols in the grammar.

★ tokens − keywords, operators, identifiers, constants, literal strings,


punctication symbols [C, ;].

★ If keywords are not reserved then lexical analyzer must distinguish a


keyword and user defined identifier (eg. PL/I)
1.24 COMPILER DESIGN

Attributes for tokens:

★ When more than one pattern matches a lexeme, the lexical analyzer must
provide additional information about the particular lexeme that matched to
the subsequent phases of the compiler.

★ Tokens influence parsing decisions where Attributes influence the translation


of tokens.

★ Line number on which the identifier first appears and its lexeme are stored
in symbol table.

1.8. INPUT BUFFERING:

★ Three general approaches to the implementation of a lexical analyzer.

• Use a lexical analyzer generator (Lex compiler) from a regular expression


based specification generator provides routines for reading and buffering
the input.

• Write the lexical analyzer in a conventional systems programming


language, using the I/O facilities of that language to read the input.

• Write the lexical analyzer in a assembly language and explicitly manage


the reading of input.

★ Speed of a lexical analysis is a concern in compiler design (harder to


implement approach often yield faster lexical analyzers).

★ Buffer pairs
Algorithm:
if forward at end of first half then begin reload second half

forward = forward + 1

end
else if forward at end of second half then begin
reload first half

move forward to beginning of first half

end

else forward = forward + 1


code to advance forward pointer
INTRODUCTION TO COMPILERS 1.25

• Buffer is divided into two N-character halves


N − Number of characters in one disk block (eg: 1024, 4096)

★ Sentinels − special character that cannot be part of sources program. Each


buffer half to hold a sentinel character at the end.

Forward = Forward + 1

if forward = eof then begin

if forward at end of first half then begin

reload second half

forward = forward + 1

end
else if forward at end of second half then begin
reload first half

move forward to beginning of first half

end
else /* eof within a buffer signifying
end of i/p */

terminate lexical analysis

end
Look ahead code with sentinels

1.9. SPECIFICATION OF TOKENS:

★ Regular Expression − Important notation for specifying pattern

− Serve as names for sets of strings

− Extended into pattern directed languages for


lexical analysis

★ Strings and Languages

• alphabet (or) character class denotes any finite set of symbols (letters &
characters).

• { 0, 1 } − binary alphabet
1.26 COMPILER DESIGN

• { ASCII, EBCDLC } − computer alphabets.

• strings over some alphabet − finite sequence of symbols drawn from that
alphabet.

• sentence & word − synonyms for the term string.

•  S  − length of a string S.

• number of occurrences of symbols in S.

• empty string → ∈ → length 0


Terms for parts of string

eg: S → banana

Subsequence of S − deleting zero (or) more not necessarily contiguous


prefix of S − zero or more trailing symbols deleted
suffix of S − zero or more leading symbols deleted
substring of S − deleting prefix & suffix from S
for every

S & ∈ − prefixes, suffixes, substring of S

proper prefix, suffix, substring − non empty x such that S ≠ x

• Language − any set of strings over some fixed alphabet.

• x = dog y = house
xy → concatenation of x and y → doghouse.

• empty string − identity element under concatenation


S∈=∈S=S

• if concatenation is considered as product then

S0 = t

Si = Si − 1

∈ S = S S1 = S S2 = SS S3 = SSS, …
INTRODUCTION TO COMPILERS 1.27

★ Operation on languages

Operation Definition
union of L and M LUM = { S  S is in L (or) S is in M }
LUM
Concatenation of L and M LM = { St  S is in L and t is in M }
LM

L = ∪ Li

Kleene closure of L
i=0
L∗
* 0 (or) more number of
concatenation

L+ = ∪ Li
Positive closure of L i=1

L+ + − 1 (or) more no. of concatenation


+ : one or more number of
concatenation

★ Regular Expressions

• Build out a simpler regular expressions using a set of defining rules.

• Each regular expression “r” denotes a language

L (r)

• Let Σ = { a, b }

Regular expression a/b denotes the set { a, b }

Regular expression (a/b) denotes the set { aa, ab, ba, bb }

Regular expression a∗ = { ∈ , a, aa, aaa, … }

(a∗ b∗)∗

Regular expression (a/b)* = set of all strings of a’s and b’s

Regular expression a/a* b ⇒ a + b − set containing the string and all strings
consisting of zero or more a’s followed by ab.
1.28 COMPILER DESIGN

Basis of definition

• ∈ is a regular expression that denotes { ∈ }


• If “a” is a symbol in, then a is a regular expression that denotes { a }
Inductive step:

• If r and S are regular expression denotes the language L (r) and L (s)
then

(i) (r)  (s) − regular expression denoting L (r) U L (s)

(ii) (r) (s) − regular expression denoting L (r) L (s)

(iii) (r)∗ − regular expression denoting (L (r))∗

(iv) (r) − regular expression denoting L (r)

• Language denoted by a regular expression is said to be a regular set.


• Unnecessary parenthesis can be avoided in regular expression if following
conventions are used.

Unary operator * has a highest precedence & left associative,


concatenation − second highest precedence & left associative

1 − lowest precedence & left associative

• If two regular expression r and s denote the same language then r = s


eg: (ab) = (b  a)

Algebric properties of regular expression

Axiom Description

rs=sr Commutative

r  (s  t) = (r  s)  t is associative

(rs) t = (st) Concatenation is associative

r  (s  t) = rs  rt

(s  t)  r = sr  tr concatenation distributes over

∈ r = r, r t = r → ∈ is the identity element for concatenation


INTRODUCTION TO COMPILERS 1.29

r∗ = (r  ∈ )∗ → relation between * and ∈

r∗∗ = r∗ → * is idempotent.

★ Regular definitions

• Names to regular expression


• sequence of definitions of form
d1 → r1

d2 → r2

dn → r n

di − distinct name

ri − regular expression over the symbol in

Σ ∪ { d1 , d2 , … di − 1 }

basic symbols & previously defined names

eg: letter→A  B  …  Z  a  b  …  Z

digit → 0  1  ...  9

id → letter (letter / digit)*

digits → digit digit*

optional-fraction → digits  ∈

optional exponent → (E ( +  −  ∈ ) digits)  t

num → digit optional-fraction optimal-exponent

★ Notational shorthands

• one (or) more instances


r∗ = r+  ∈ ⇒ kleene operator.

r+ = rr∗ ⇒ positive closure operator


1.30 COMPILER DESIGN

• zero (or) more instance


r ? shorthand for r∈

? zero (or) one instance

digit → 0  1  ...  9

digits → digit +

optional-fraction → ( • digits)?

optional-exponent → (E (+  − )? digits)?

num → (digit) (optional - fraction) (optional-exponent)

• character classes
[a − z] denote regular expression a|b| ... |z

[A − z a − z] [A − Z a − z 0 − 9]* ⇒ identifiers.

★ Non regular sets

• Some languages cannot be described by any regular expression.


• Regular expression cannot be used to describe balanced constructs. It can
be specified by context free grammars.

• Repeating strings cannot be described by regular expression nor by


context − free grammar

{ wcw  w is a string of a’s & b’s }

• Regular expression can be used to denote only a fixed number of


repetitions (or) an unspecified number of repetitions of a given construct.

1.10. RECOGNITION OF TOKEN:

Consider the following grammar fragment

stmt → if expr then stmt


if expr then stmt else stmt  ∈
expr → term relop term  term

term → id  num

if, then, else, relop, id, num → terminals which generate set of strings given by the
following representations.
INTRODUCTION TO COMPILERS 1.31

if → if

then → then

else → else

relop → <  < =  >  > =  =  < >

id → letter (letter / digit)*

num → digit+ ( • digit+)? (E (+ ⁄ −) ? digit+)?

• Assume lexemes are separated by white spaces


delim → blank  tab  newline

ws → delim+

Regular expression patterns for tokens

Regular expression Token num Attribute value

ws − −

if if −

then then −

else else −

id id pounter to table entry

num num pounter to table value

< relop LT

<= relop LE

= relop EQ

<> relop NE

= relop GT

>= relop GE
1.32 COMPILER DESIGN

★ Transition diagrams
• Intermediate step in the construction of a lexical analyzer.
• Stylized flowchart represents transition diagram.
• Depict the actions that take place when a lexical analyzer is called by
the parser to get next token.

• Keep track of information about characters.


• circles → states (s)
arrows → edges
other → any character that is not indicated by any of the other edge.

• Deterministic ⇒ no symbol can match the labels of two edges leaving


one state.

Transition diagram for relational operators

Start < return (relop, LE)


0 1 2

3 return (relop, NE)

*
4 return (relop, LT)
=
5 return (relop, EQ)

> =
6 7 return (relop, GE)

other *
8 return (relop, GT)

Fig. 1.15
Transition diagram for identifical and key words
Letter or disit

other *
Start Letter return (get token ()
9 10 11
install - id ())

returns a panter to
symbol table entry
Fig. 1.16
INTRODUCTION TO COMPILERS 1.33

Transition diagram for unsigned numbers

digit digit

+ or -
Start digit digit E
12 13 14 15 16 17

E
digit
digit
digit *
18 other 19

digit digit

digit other *
Start digit 24
20 21 22 23

digit digit

Start digit other *


25 26 27

Fig. 1.17

1.11. LANGUAGE FOR SPECIFYING LEXICAL ANALYZER:

Non-Regular Languages:

Some languages cannot be represented by any regular expression. They are


called non-regular languages.

1. Balanced (or) needed constructs.

2. Two arbitrary numbers cannot be compared.

3. Palindrome cannot be checked.

LEX (Lexical Analyzer Generator):

(i) The construction of scanner for a new language takes much time and
hence to automate the construction of scanners for the new
programming language, several built in tools have been developed.

(ii) ‘LEX’ is such a tool used to generate lexical analyzer for a variety of
languages.
1.34 COMPILER DESIGN

(iii) The input to the LEX compiler is program in LEX language

LEX Source
LEX Compiler
Program (lex.l) lex.yy.c
(Tabular representation
of transition diagram &
routine to recognize
lexemes)

C Compiler
lex.yy.c a.out

C Compiler
Input stream Sequence of
tokensa

Fig. 1.18
(iv) LEX Specifications:
A LEX program consists of 3 parts:
Declarations
% %

Translation rules

% %

Auxiliary procedures

(v) Declarations:
This includes declaration of variables, regular definitions and manifest constant,
identifiers that is declared to represent a constant.
Note: A manifest constant is an identifier that is declared to represent a
constant.
(vi) Transition Rules:
These are statements of the form,
p1 { ACTION1 }

p2 { ACTION2 }

pn { ACTIONn }
INTRODUCTION TO COMPILERS 1.35

where each pi represents a Regular Expression

eg: { if } { return (IF) ; }

{ id } { yylval = install − id ( ) ; return (ID) ; }

Actioni → Program fragment describing the action taken by lexical analyzer


when the pattern pi matches the lexeme.

(vii) Auxiliary procedures:

Those procedures that are needed by the actions are specified in this part.

(viii) The Look a head operator:

In LEX, the patterns can be written as r1 ⁄ r2 where,

r1 , r2 → regular expressions

r1 ⁄ r2 → Match a string r1 only if it is followed by a string r2.

where ‘/’ is the look ahead operator.

Eg:

Do / ( { letter } ⁄ { digit } )* = ( { letter } ⁄ { digit } )∗,

with this, the lexical analyzer will look ahead in its input buffer for a sequence of
letters & digits followed by an equal sign followed by letters and digits followed by
a comma.

If it is matched it will consider it to be ‘Do’ sequence.

Otherwise if will matched as a variable.

DO5I = 1.25 → DO5I − variable

DO5I = 1, 25 → DO − Keyword

(ix) LEX program to identify constants, variables, keywords & relational operators.
1.36 COMPILER DESIGN

MODEL OF LEX COMPILER:

LEX LEX Compiler Transition


Specification table

Lex e me Input Buffer

Look ahead pointer

FA simulator

Transition DFA transition table


table

Fig. 1.19
% { /* definitions of manifest constants
LT, LE, EQ, GT, GE, IF, THEN, ELSE, ID*, To }

/* regular expression */

delim [ \t \n ]

cos { delim }+

letter [ A - Z a - z ]

digit [0 - 9]

id { letter } ( { letter } / { digits } )*

number {digit} + (/. {digit} +)? (* [= \ -]? {digit} +)?

% %

{ cos } {/* No action and return */}

if { return (IF) ; }

then { return (THEN) ; }

else { return (ELSE) ; }

{ id } { yylval = install _ id( ); return (ID);}


INTRODUCTION TO COMPILERS 1.37

{number} {yylval = install _ num ( ); return (NUMBER);}

“<” {yy lval = LT; return (RELOP);}

“< =” “ LE

“=” “ EQ

“>” “ GT

“>=” “ GE

“< >” “ NE

% %

% { /* definitions of manifest constants


LT, LE, EQ, GT, GE, IF, THEN, ELSE, ID*, To }

/* regular expression */

delim [ \t \n ]

ws { delim }+

letter [ A - Z a - z ]

digit [0 - 9]

id { letter } ( { letter } / { digits } )*

install_id ( )

procedure to install the lexeme, whose first character is pointed to by yytext


and whose length is yyleng, into the symbol table & return a pointer there after.

}
install_num ( )
{
similar procedure to install number into the symbol table
}
Some variables used in this program are:

1. yylval − stores the lexical value returned


2. yytext − pointer to first character of lexeme.
3. yyleng − length of the lexeme
1.38 COMPILER DESIGN

Implementation of lex based on FA:


(a) NDFA:
★ Build the NFA for a composite pattern p1 ⁄ p2 ⁄ … ⁄ pn.
{ First create NFA’s for individual pattern pi. Then create a start state So. The
combined NFA can be got by connecting the start states of individual NFA’s with So }.

N(P1 )

So e
N(P2 )

N(P3 )

Fig. 1.20

★ At each step, the sequence of steps that the combined NFA will be in after
seeing each input character should be constructed.

★ If there is an accepting state in current set of states, record the current


input position & the pattern pi that gets matched.

★ Continue the transitions until termination. Upon termination, retract the


forward pointer to the position at which last match occurred.

★ If no pattern matches, then transfer control to error recovery routine.

★ If more than one match occurs, then return the pattern that appears first.

(b) Deterministic FA:

Convert the NFA into DFA. If more than one accepting state occurs, then the
pattern that appears first has the priority.
INTRODUCTION TO COMPILERS 1.39

LEX Program to count no. of spaces, character, lines & words

Declaration part

% {

int c = 0, w = 0, l = 0, s = 0;

% }

% %

translation rules

[\n] l ++ ; s ++ ;

[\t ‘ ’ ] s ++ ;

[^ ‘ ’ \ t \ n ] + w ++ ; c + = y y leng ;

% %

int main ( int arg c, char * argv [ ])

if (arg c = = 2)

yyin = f open ( argv [1], “r”) ;

yylex ( ) ;

print f ( “ \n number of spaces = % d”, s) ;

print f ( “ \n character = % d”, ( ) ;

print f ( “ \n lines = % d”, l) ;

print f ( “ \n word = % d” ~ n", w) ;

else

print f ( “error” ) ;

}
1.40 COMPILER DESIGN

Program to implement lexical analyzer:


Optional
% {

# include < stdio.h >

% }

Regular definitions
letter [a − Z A − Z]

digit [0 − 9]

operators [ + * / = ]

% %

( # include < stdio.h > | void | main | int | float | char |

print f while | do | for | if | else ) + { print f (“key word”) ; }

{ letter } ( { letter } | { digit } + { print f (“variable”) ; }

{ digit } + { print f ( “Numbers” ) ; }

{ operators } + { print f ( “special operator” ) ; }

% %

int main ( int arg c, char * arg v [ ] )

yyin = fopen (arg v [1], “r” ) ;

yylex ( ) ;

yylex ( )
function vote have to invoke to start the process lex will take the file which is
pointed by yyin
yyleng
Returns the length of the matched string in yytex t
After scanning the whole file it needs to print the output in a file
yy out → pointer to a file where it has to keep the o/p.

for each to keyword it generates a token value which is kept in yylval


INTRODUCTION TO COMPILERS 1.41

1.11.1. Lex program to count total number of tokens:


%{

int n = 0 ;

%}

// rule section

%%

//count number of keywords

“while”|"if"|"else" {n++;printf(“\t keywords : %s”, yytext);}

// count number of keywords

“int”|"float" {n++;printf(“\t keywords : %s”, yytext);}

// count number of identifiers

[a-zA-Z_] {a-zA-Z0-9_]* {n++;printf(“\t identifier : %s”, yytext);}

// count number of operators

“<=”|"=="|"="|"++“|”-"|"*"|"+“ {n++;printf(”\t operator : %s", yytext);}

// count number of separators

[(){}|, ;] {n++;printf(“\t separator : %s”, yytext);}

// count number of floats

[0-9]*“.”[0-9]+ {n++;printf(“\t float : %s”, yytext);}

// count number of integers

[0-9]+ {n++;printf(“\t integer : %s”, yytext);}

. ;

%%

int main()

yylex();

printf(“\n total no. of token = %d\”, n);

}
1.42 COMPILER DESIGN

1.12. FINITE AUTOMATA:


★ Finite automata are used to recognize patterns.
★ It takes the string of symbol as input and changes its state accordingly.
When the desired symbol is found, then the transition occurs.

★ At the time of transition, the automata can either move to the next state
or stay in the same state.

★ Finite automata have two states, Accept state or Reject state. When the
input string is processed successfully, and the automata reached its final
state, then it will accept.
Formal Definition of FA:
A finite automaton is a collection of 5-tuple (Q, Σ , δ , q0, F), where:
1. Q: finite set of states

2. Σ: finite set of the input symbol


3. q0: initial state
4. F: final state

5. δ: Transition function
Finite Automata Model: Finite automata can be represented by input tape and
finite control.

Input tape: It is a linear tape having some number of cells. Each input symbol is
placed in each cell.

Finite control: The finite control decides the next state on receiving particular input
from input tape. The tape reader reads the cells one by one from left to right, and
at a time only one input symbol is read.

a b c a b b a Input tape

Tape reader
reading the
input symbol

Finite Control
Fig. 1.21: Finite automata model
INTRODUCTION TO COMPILERS 1.43

1.12.1. Role of Finite Automata in Lexical Analysis:


Frontend Overview

Compiler Frontend

Program lexical syntax parse semantical abstract


token syntax
source analysis sequence analysis tree analysis
tree

symbol Table

Fig. 1.22
★ Lexical Analysis: Identify atomic language constructs.
Each type of construct is represented by a token.
(e.g. 3.14 → FLOAT, if → IF, a → ID).

★ Syntax Analysis: Checks if the token sequence is correct with respect


to the language specification.

★ Semantical Analysis: Checks type relations + consistency rules.


(e.g. if type (lhs) = type (rhs) in an assignment lhs = rhs).

Each step involves a transformation from a program representation to another.


Lexical Analysis Overview:
lexical specification
(regular expressions)

if (x > 1) { IF, LP, ID, RelOp, RP, LB


x = x + 1; ID, ASSGN, ID, AddOp, ID, SC
Scanner/Tokanizer RB, ELSE, LB,
} else {
(finite automata) ID, ASSON, ID, AddOp, ID, SC
x = x - 1;
} RB

Fig. 1.23

★ Input program representation: Character sequence

★ Output program representation: Token sequence

★ Analysis specification: Regular expressions


1.44 COMPILER DESIGN

★ Recognizing (abstract) machine: Finite Automata


★ Implementation: Finite Automata
Lexical Analysis Specification

Token Patterns Action

WS (blank|tab|newline)+ skip

... ... ...

IF if genToken():

... ... ...

RelOP < | < = | = | > = | > genToken(); addAttr();

... ... ...

ID [a − zA − Z] [a − zA − Z1 − 9]* genToken (); updateSymTab();

... ... ...

★ In theory: (ρ1|ρ2|...|ρn)*, where ρi are the above patterns, defines all


lexically correct programs. ⇒ the set of lexically correct programs is a
regular language.

★ In practice: we recognize each pattern individually.


★ This type of specification is input to AntLR in Practical Assignment 1.
String recognition using Finite Automata
A finite automaton is an abstract machine that can be used to identify strings
specified by regular expressions.

1
a
a b

initial b final
0 3 4
state state

b b
2

Fig. 1.24
INTRODUCTION TO COMPILERS 1.45

★ The above finite automata recognizes the pattern (a|b) ba*b.


★ Every input character (a or b) causes a transition from one state to another.
★ A string is accepted if we end up in a final state once every character been
processed.

★ No possible transition ⇒ rejection ⇒ the string is not part of the language.


1.12.2. Types of Finite Automata:
There are two types of finite automata:
1. DFA (deterministic finite automata)

2. NFA (non-deterministic finite automata)

Finite Automata

Deterministic Not deterministic


Finite Automata (DFA) Finite Automata (NFA)

Fig. 1.25
1. DFA:

DFA refers to deterministic finite automata. Deterministic refers to the


uniqueness of the computation. In the DFA, the machine goes to one state only for
a particular input character. DFA does not accept the null move.

2. NFA:

NFA stands for non-deterministic finite automata. It is used to transmit any


number of states for a particular input. It can accept the null move.

Few points about DFA and NFA:

1. Every DFA is NFA, but NFA is not DFA.

2. There can be multiple final states in both NFA and DFA.

3. DFA is used in Lexical Analysis in Compiler.

4. NFA is more of a theoretical concept.


1.46 COMPILER DESIGN

1.12.3. NFA with ∈ Closure:


Non-determinestic Finite Automata (NFA):
NFA is a finite automaton where for some cases when a single input is given
to a single state, the machine goes to more than 1 states, i.e. some of the moves
cannot be uniquely determined by the present state and the present input symbol.
An NFA can be represented as M = { Q, Σ , δ , q0, F }

Q → Finite non-empty set of states.

Σ → Finite non-empty set of input symbols.

∂ → Transitional Function.

q0 → Beginning state.

F → Final State

NFA with (null) or ∈ move: If any finite automata contains ε (null) move or
transaction, then that finite automata is called NFA with ∈ moves.

Example: Consider the following figure of NFA with ∈ move:

1 1 0, 1

e
A B C
e
0

Fig. 1.26: Transition state table for the above NFA

STATES 0 1 epsilon

A B, C A B

B − B C

C C C −

Epsilon (∈) – closure: Epsilon closure for a given state X is a set of states which
can be reached from the states X with only (null) or ε moves including the state X
INTRODUCTION TO COMPILERS 1.47

itself. In other words, ε-closure for a state can be obtained by union operation of the
ε-closure of the states which can be reached from X with a single ε closure are as
follows:
∈ closure (A) : {A, B, C}

∈ closure (B) : {B, C}

∈ closure (C) : {C}

1.12.4. Conversion from NFA with ε to DFA:

Non-deterministic finite automata (NFA) is a finite automata where for some


cases when a specific input is given to the current state, the machine goes to multiple
states or more than 1 states. It can contain ε move. It can be represented as
M = { Q, Σ , δ , q0, F }.

Where

1. Q : finite set of states

2. Σ : finite set of the input symbol

3. q0 : initial state

4. F : final state

5. δ : Transition function

NFA with ∈ move: If any FA contains ε transaction or move, the finite automata
is called NFA with ∈ move.
ε-closure: ε-closure for a given state A means a set of states which can be reached
from the state A with only ε (null) move including the state A itself.
Steps for coverting NFA with ε to DFA:

Step 1: We will take the ε-closure for the starting state of NFA as a starting state
of DFA.
Step 2: Find the states for each input symbol that can be traversed from the present.
That means the union of transition value and their closures for each state of NFA
present in the current state of DFA.
Step 3: If we found a new state, take it as current state and repeat step 2.
Step 4: Repeat Step 2 and Step 3 until there is no new state present in the transition
table of DFA.
Step 5: Mark the states of DFA as a final state which contains the final state of
NFA.
1.48 COMPILER DESIGN

Problem 1: Convert the NFA with ε into its equivalent DFA.

q1

e 0

Start q q 1 q
0 3 4

e
1
q2

Fig. 1.27

Solution:

Let us obtain ε-closure of each state.

1. ε-closure {q0} = {q0, q1, q2}

2. ε-closure {q1} = {q1}

3. ε-closure {q2} = {q2}

4. ε-closure {q3} = {q3}

5. ε-closure {q4} = {q4}

Now, let ε-closure {q0} = {q0, q1, q2} be state A. Hence

δ′ (A, 0) = ε−closure { δ ((q0, q1, q2), 0) }

= ε−closure { δ (q0, 0) ∪ δ (q1, 0) ∪ δ (q2, 0) }

= ε−closure { q3 }

= { q3 } call it as state B.

δ′ (A, 1) = ε−closure { δ ((q0, q1, q2), 1) }

= ε−closure {δ ((q0, 1) ∪ δ (q1, 1) ∪ δ (q2, 1) }

= ε−closure { q3 }

= { q3 } = B.
INTRODUCTION TO COMPILERS 1.49

The partial DFA will be

A B

1
Fig. 1.28
Now,
δ′ (B, 0) = ε−closure { δ (q3, 0) }

δ′ (B, 1) = ε−closure { δ (q3, 1)


= ε−closure { q4 }
= { q4 } i.e. state C

For state C:
δ′(C, 0) = ε−closure { δ (q4, 0) }


δ′(C, 1) = ε−closure { δ (q4, 1) }

The DFA will be,


0

A B
1
1

Fig. 1.29

OPTIMIZATION OF DFA – BASED PATTERN MATCHERS:

Outline:

★ Important states of an NFA.

★ Functions computed from the syntax free.


1.50 COMPILER DESIGN

★ Computing nullable, first pos & last pos.

★ Computing follow pos.

★ Converting a R.E Directly to a DFA.

★ Minimizing the number of states of a DFA.

★ State minimization of a lexical analyzers.

★ Trading time for a space in DFA simulation.

First Algorithm:

★ Constructs a DFA directly from a R.E.

★ Without constructing a intermediate NFA.

★ With fewer states.

Second Algorithm:

★ Minimizes the number of states of any DFA.

★ Combines states having the same future behavior O(n* log (n))

Third Algorithm:

★ Produces more compact representations of transitions tables then the


standard 2 dimensional.

★ It has non-ε out transitions.

★ Used when computing ε − closure (move (T, a)) − the set of states reachable
from T on input a.

★ The set moves (s, a) is non-empty if state S is important.

★ NFA states are 2 fold if

• have the same ‘X’ states &

• either both have accepting states on neither does.


INTRODUCTION TO COMPILERS 1.51

★ Important state
• Initial states in the basis part for a particular symbol position in the R.

start a
i F

Fig. 1.30

• Correspond to a particular operands in the RE.

★ Thompson algorithm for constructing NFA:

• It has only one accepting state which is non-important (has no out −


transitions!!!).

★ to concatenate a unique right end marker # to a regular expressions

• The accepting state of the NFA, r becomes important state in the (r) #
NFA.

• any state in the (r) # NFA with a transition to # must be an accepting


state.

Syntax Tree:

★ Important states correspond to the positions in the RE that hold symbols of


the alphabet.

★ R.E representation of a system Tree.

• leaves correspond to operands.

• Interior nodes correspond to operators.


(a) Cat - node − concatenation operator (dot)

(b) Or - node − Union operator (/)

(c) Star - node − Star operator (*)


1.52 COMPILER DESIGN

Syntax Tree example (a/b)* abb #

0 #
6
0 b
5
0 b
4
a 0 cat nodes are
* 3 reprecented as
circle

a b
1 2
Fig. 1.31

Representation Rules:

★ Syntax tree leaves a labeled by ε or by an alphabet symbol.

★ To each leaf which is not ε we attach a unique integer.

• the position of the leaf

• the position of it’s symbol.

★ a symbol may have several positions

• symbol ‘a’ has positions 1 & 3.

★ Positions in the syntax tree correspond to NFA important-states.


INTRODUCTION TO COMPILERS 1.53

Thompson constructed NFA for


(a/b) * abb #
e

a
1 C
e
start e e a b b e
A B E 3 4 5 6 12
e
b
2 D

e
Fig. 1.32

★ The important states are numbered.

★ Other states are represented by letters.

★ The correspondence between numbered states in the NFA and the positions
in the syntax tree.

Functions computed for the syntax Tree:

★ In order to construct a DFA directly from the R.E.

• to build the syntax tree.


• to compute 4 functions referring (r) #.
(a) nullable

(b) first pos

(c) last pos

(d) follow pos

Nullable (n):

True for syntax tree node n iff the sub-expression represented by n.

(i) has ε in its language.

(ii) can be made null or the empty string.

(iii) even it can represent other strings


1.54 COMPILER DESIGN

First pos (n):


Set of positions in then rooted subtree that correspond to the first symbol of
atleast one string in the language of the subexpression rooted at n.

Cast pos (n):


Set of positions in the n rooted subtree that correspond to the last symbol of
atleast one string in the language of the subexpression rooted at n.

Follow pos (n):


(i) for a position p.

(ii) is the set of positions of such that

X = a1 a2 … an is L ( (r) #)

such that,

For some i there is a way to explain the membership of X in L ( (r) #) by


matching ai to position p of the syntax tree ai + 1 to position q.

Example:
1. nullable (n) = false

2. first pos (n) = {1, 2, 3}

3. last pos (n) = {3}

4. follow pos (n) = {1, 2, 3}

* 03

b
5
a b
1 2

Fig. 1.33
INTRODUCTION TO COMPILERS 1.55

Computing nullable, first pos & last pos:

node n nullable (n) first pos (n) last pos (n)


A leaf labeled ε true φ φ
A leaf with position i false {i} {i}
An or node nullable (c1) or first pos (c1) ∪ last pos (c1) ∪
n = C1 ⁄ C2 nullable (c2) first pos (c2) last pos (c2)

cat - node nullable (c1) and if (nullable (c1)) if (nullable (c2))

n = c1 ⋅ c2 nullable (c2) first pos (c1) ∪ last pos (c2) ∪


first pos (c2) last pos (c1)
else first pos (c1) else last pos (c2)

Star – node true first pos (c1) last pos (c1)


n= c∗1

First pos and Last pos Example:


{1, 2, 3}0
{6}
{1, 2, 3} #
0 {5} 6
{6} {6}
{1, 2, 3} 6
0 {4} {5} 5 {5}

c1 c2
6
{1, 2, 3} {4}
0 {3} 4 {4}
c2

{1, 2}
{1, 2} last {3} a {3}
pos first 3 last
first
nodel pos
pos * pos

{1, 2} {1, 2}
c1 c2
{1, 2} {1, 2}
Node
a b
{1} {2}
{1} 1 2 {2}
{1} {2}

Fig. 1.34
1.56 COMPILER DESIGN

Computing Follow pos:

★ A position of a Regular Expression can follow another position in 2 ways:

1. if n is a cat node c1 ⋅ c2 (rule 1)


for every position i in last pos (c1) all positions in first pos (c2) are in
follow pos (i).

2. if n is a star-node (rule 2)
if i is a position in last pos (n) then all positions in first pos (n) are in
follow pos (i).

Applying rule 1 & 2 for previous syntax tree to find follow pos:

Applying rule 1

follow pos (1) includes {3}

follow pos (2) includes {3}

follow pos (3) includes {4}

follow pos (4) includes {5}

follow pos (5) includes {6}

Applying rule 2

follow pos (1) includes {1, 2}

follow pos (2) includes {1, 2}

Converting a R.E directly to a DFA:

Input:

Regular Expression r

Output:

DFA D that recognizes L (r).

Method:

★ to build the syntax tree T from (r) #

★ to compute nullable, first pos, last pos, follow pos.


INTRODUCTION TO COMPILERS 1.57

★ to build
• D states the set of DFA states
(i) start state of D is first pos (no), where no is the root of T.

(ii) accepting states = those containing the # end marker symbol.

• Dtran the transition function for D.


Example for r = (a ⁄ b) ∗ abb.
A = firstpos (no) = {1,2,3}
D tran [A, a] =

followpos (1) ∪ followpos (3) = {1,2,3,4} = B

D tran [A, b] =
followpos (0) = {1,2,3} = A
D tran [B, a] =

followpos (1) ∪ followpos (3) = B

D tran [B, b] =

followpos (2) ∪ followpos (4) = {1,2,3,5} = C

D tran [C, a] =

followpos (1) ∪ followpos (3) = B

D tran [C, b] =
followpos (2) u followpos (5) = {1,2,3,5} = D

Minimized DFA
b
b a

start a b b
123 1234 1235 1236
a
a

Fig. 1.35
1.58 COMPILER DESIGN

D tran [D, a] =
followpos (1) ∪ followpos (3) = B
D tran [D, b] =
followpos (2) ∪ followpos (6) = {1,2,3}

Minimizing the Number of states of a DFA:


Equivalent automata:

Node n Followpos (n)


1 {1,2,3}
2 {1,2,3}
3 {4}
4 {5}
5 {6}
6 φ

3 4 5 6

Fig. 1.36

1.13. REGULAR EXPRESSIONS TO FINITE AUTOMATA:


Problems:
1. Convert the R.E aa*/bb* to NFA using Thompson’s construction.

Solution:

r = aa∗  bb∗
r1 r2 r3 r4





r5  r6






R7
INTRODUCTION TO COMPILERS 1.59

R1 :

r7

r5 r6
a

r1 r2 r3 r4

a a* b b*
Fig. 1.37
R2 :
e

e a e

e
Fig. 1.38
R3 :

Fig. 1.39
R4 :
b

e
e b

e
Fig. 1.40
R5 :
e

a e a e

e
Fig. 1.41
1.60 COMPILER DESIGN

R6 : R3 R4
e

b b
e e

e
Fig. 1.42
R7 : R5 R6
e

a e a e
1 2 3 4 5
e
e
e 11
12
e
e b e
6 7 e 8 9 10
b
e
e
Fig. 1.43
NFA to DFA using subset construction:

Method:

Computation of ε − closure:

ε − closure (0) = { 0, 1, 6 }

ε − closure (1) = { 1 }

ε − closure (2) = { 2, 3, 5, 11 }

ε − closure (3) = { 3 }

ε − closure (4) = { 3, 5, 11 }

ε − closure (5) = { 5, 11 }

ε − closure (6) = { 6 }
INTRODUCTION TO COMPILERS 1.61

ε − closure (7) = { 7, 8, 10, 11 }

ε − closure (8) ={8}

ε − closure (9) = { 8, 9, 10, 11 }

ε − closure (10) = { 10, 11 }

ε − closure (11) = { 11 }

Construction of DFA:

ε − closure (0) = { 0, 1, 6 } → A

8 (A, a) = { 2 }

8 (A, b) = { 7 }

ε − closure (0) ( 8 (A, a)) = ε − closure (2)

= { 2, 3, 5, 11 } → B

8 (B, a) = { 4 }

8 (B, b) = { }

ε − closure (8, (A, b)) = ε (7)

= { 7, 8, 10, 11 } → c

8 (C, a) = { }

8 (C, b) = { 9 }

ε − closure ( 8 (B, a)) = ε − closure (4)

= { 3, 4, 5, 11 } → D
1.62 COMPILER DESIGN

ε − closure ( 8 (B, b)) = ε − closure { }

={}

8 (D, a) = { 4 }

8 (D, b) = { }

ε − closure ( 8 (C, b)) = ε − closure ( 9 )

= { 8, 9, 10, 11 } → E

8 (E, a) = { }

8 (E, b) = { 9 }

Transition Table:

I/P’s
State
a b

A B C

B D −

C − E

D D −

E − E

ε − closure ( 8 (D, a)) = ε − closure ( 4 )

= { 3, 4, 5, 11 } → D

ε − closure ( 8 (E, b)) = ε − closure ( 9 )

= { 8, 9, 10, 11 } → E
INTRODUCTION TO COMPILERS 1.63

DFA Diagram:

a
a
B D
a

b b
C E
b

Fig. 1.44
1.14. MINIMIZING DFA:

p = {A B C D E}

Non-Accepting Accepting
states states

= { {A} {BCDE} }

= {A} equivalent states


{BD} {CE}

= { {A} {B} {C} {


Fig. 1.45

Transition Table

I/P’s
State
a b

A B C

B B −

C − C
1.64 COMPILER DESIGN

Minimized DFA diagram:

a
B
a

b
C
b
Fig. 1.46

*********
CHAPTER – II

SYNTAX ANALYSIS

2.1. INTRODUCTION:
Syntax Analysis:
★ Syntax of a Programming Language can be described by context − free
grammars (or) BNF (Backs − Newer form).

★ Grammar − precise
easy to understand

syntactic specification

★ Efficient parser using grammars to find syntactically well formed.

2.2. ROLE OF PARSER:

token
Source Parse intermed
Lexical rest of
get Parser
Program analyzer Tree front end refresher
next
token

Symbol
table

Fig. 2.1
2.2 COMPILER DESIGN

★ Parser − report any syntax error.

★ Universal parsing method − cocke − Younger − Kasami alg. an parse any


grammar & Earley’s alg.

(Left most derivation) (Right most derivation in reverse)

top-down
bottom-up parsers LR
(LL)

build parse tree from top to bottom build parse tree from leaf to root
(root) (leaf)

(recursive descent & predictive parsing)

★ Input − scanned from left to right one symbol at a time.

★ Sub-classes of grammar − LL, LR → eg. for top down passer

★ Syntax error recovery strategy − panic mode phrase − level recover.

★ Syntax error Handling

lexical − misspelling of identifies, key word, operate

syntactic − arithmetic exp. with unbalanced parentheses.

semantic − operator applied to incomparable

logical − indefinitely recursive call

★ Error handler

★ Presence of errors clearly & accurately.

★ Recover from each error quickly enough to be able to detect subsequent


errors.

Error productions:

★ good idea of common errors.

★ error diagnostics using error productions.


SYNTAX ANALYSIS 2.3

Global correction:
★ minimal sequence of changes to obtain globally least cost correction.
★ transform x to y with minimal changes (insert, delete, change).
Role of a parser:
★ Every programming language has rules that describe the syntactic structure
of structured programs.

★ Parsing or synta Analysis is one phase of compiler that checks whether the
states of the program as per the conventions of the language and it converts
the sequence of tokens into syntax free (or) parse tree.

Syntax Syntax or
Sequence Analysis parse tree
of tokens

Fig. 2.2

★ Give the tokens separated by the lexical analyzer phase, the syntax analyzer
checks whether the stats are upto the constructs of source programming
language & if correct it constructs the syntax tree otherwise displays an
error.

The issues needs to be addressed while constructing syntax analyzer are

1. Method of describing the possible constructs of pgmming language (CFG).

2. Mechanism to check whether the states of the input stream are as per the
constructs (parsing)

3. Mechanism of error recovery.

2.3. CONTEXT – FREE GRAMMAR:

★ if E then S1 else S2 is a statement

E − expression

S1 , S2 − statement

★ cannot be expressed using regular expressions.

★ grammar production
statement → if expression then statement else CFG consists of statement.
2.4 COMPILER DESIGN

★ terminals, non terminals, start symbol, productions.

★ shouldn’t significantly slow down the processing of correct programmes.

★ Viable – prefix property: Detect an error as soon as they see a prefix of


input that is not a prefix of any string in the language.

★ Report the place in source program where an error is detected because the
actual error occurred within the previous few tokens.

★ Spurious error: not made by programmer but were introduced by the


changes made to the parse state during error recovery.

★ Error recovery strategies:


(i) panic mode − delete one by one

(ii) phrase − level recovery

local correction

replace correction

delete extra;

insert missing;

Replace a prefix of the remaining input by some string that allows parser to
continue

synchronization tokens

delimiters like;

★ terminal − basic symbol from which strings are formed (token)


eg: if, then, else.

★ non-terminal − denote set of strings syntactic variables


eg: expr, stmt.

★ start symbol − distinguished one non-terminal

★ set of strings − language defined by a grammar.

★ productions − manner in which terminals & non-terminals can


be combined to form strings.
SYNTAX ANALYSIS 2.5

Eg:

Following productions defines simple expressions

expr → expr op expr

expr → (expr)

expr → − expr

expr → id

op → +

op → −

op → *

op → /

op → ^

terminal symbols are

id + − ∗ ⁄ ^( )

non terminal symbols are

expr op

start symbol

expr

★ Notational Conventions
terminals
→ lower case letters such a, b, c, ...

→ operator symbols such as +, − , …

→ punctuation symbols ( ) ....

→ digits 0, 1, ...

→ bold face such as id (or) if

non terminals
→ upper case A, B, C ....

→ lower case italic names expr (or) stmt

→ S which when it appear is start symbol.


2.6 COMPILER DESIGN

using

★ Shorthands

E → E A E | ( E ) | − E | id

A → +|−|∗|,|↑

E & A ⇒ non terminals

E → start symbol

★ Derivation

Production is treated as a re-writing rule in which the non terminal on the left
is replaced by the string on the right side of the production

eg: E ⇒ − E ⇒ − ( E ) ⇒ − (id)

α1 ⇒ α2 means α1 derives α2

⇒ symbol means derives in one step


⇒ symbol means derives in zero or more steps.

+
⇒ symbol means derives in one (or) more steps


α ⇒ α for any string α

∗ ∗
α ⇒ β and β ⇒ γ then α ⇒ γ

α & β are strings of grammar symbols

★ G − Grammar

L (G) − Language generated by G

String in L (G) may contain only terminal symbols of G.

+
String of terminal w is in L (G) if S ⇒ w

w − sentence of G
SYNTAX ANALYSIS 2.7

Context free grammar: Language that can be generated by a grammar.


If two grammars generate the same language, the grammars are said to be

equivalent. If S ⇒ α where α may contain non-terminals, then we say that α is a
sentential form of G. A sentence is a sentential form with no non terminals.
∗ ∗
★ Left most derivation α ⇒ β If S ⇒ α then α is a left sentential
lm ⇒ lm
form of the grammar.

left most non-terminal in any sentential form is replaced at each step.


Eg:
E ⇒ − E ⇒ − (E) ⇒ − (E + E) ⇒ − (id + E) ⇒ − (id + id)
lm lm lm lm lm

★ right most derivation (canonical derivations) right most non-terminal is


replaced at each step.

★ Parse tree
• Graphical representation for a derivation that filters out the choice
regarding replacement order.
• Each interior node − non-terminal A
• Children of node is labeled from left to right by the symbols in the right
side of the production by which this A was replaced in the derivative.
• Leaves are labeled by non-terminals (or) terminals and read from left to
right, they constitute a sentential form, called the yield (or) frontier of
the tree.
Parse tree for − (id + id)
E

_
E

( E )

E + E

id id
Fig. 2.3
2.8 COMPILER DESIGN

Eg: id + id ∗ id

E ⇒ E+E E ⇒ E * E

⇒ id + E ⇒ E+E∗E

⇒ id + E ∗ E ⇒ id + id ∗ E

⇒ id + id ∗ id ⇒ id + id ∗ id

E E

E + E E * E

id E * E E + E

id id id id id
Fig. 2.4
Context – Free Grammar:

CFG is a notation used to specify the syntax of programming language & it


consists of,

Terminals – Basic symbols which we used to form the strings.

Non-Terminals – Syntactic variables that denote set of strings. These


define set of strings that helps to define the language
generated by the grammar.

Start symbol – A distinguished non-terminal which define set of


strings accepted by grammar.

Productions – Specify the ways through which the terminals &


non-terminals can be combined to form strings.

Example:

expr → expr op expr

expr → id

op → +

op → −
SYNTAX ANALYSIS 2.9

★ Terminals – id, +, −

★ Non-Terminals – expr, op

★ start symbol – expr

★ productions – 4 productions (given)

Notational Rules:

The notational conventions to be used in defining the CFG are as follows:

1. Terminals:

(i) Lower − case letters that appear at the beginning of the alphabet sequence
such as a, b, c, ....

(ii) Operator symbols such as + , − , ....

(iii) Punchiation symbols such as parenthesis, commas etc.,

(iv) Digits 0, 1, ..., 9

(v) Bold face strings such as id if ....

2. Non-Terminals:

(i) Upper-case letters that appear at the beginning of the alphabet sequence such
as A, B, C ....

(ii) The letter ‘S’, when appears is usually the start symbol.

(iii) Lower − case italic names such as expr, stmt, ....

3. Grammar symbols:

Upper − case letters that appear at the end in the alphabet sequence such as
X, Y, Z represent a grammar symbol that is either terminals or non-terminals.

4. Strings of Terminals:

Lower case letters that appear at the nd in the alphabet sequence such as u,
v, ..., z represent string of terminals.
2.10 COMPILER DESIGN

5. String of Grammar symbols:

Lower − case Greek symbols such as α , β , γ represent string of grammar


symbols. Thus, a production can be written as, A → α where,

A → Non-terminal on the left

α → String of grammar symbols on the right.

6. Rule:

If A → α1 , A → α2 , …, A → αn are productions, then it can be re-written as,

A → α1 ⁄ α2 ⁄ .... ⁄ αn.

7. Start symbol:

Unless specified, the non-terminal on the left side of first production is the start
symbol.

Derivations:

★ Derivation gives a exact description of the top down construction of a parse


tree.

★ Each production is treated as a re-writing rule in which the non-terminal


on the left is replaced by the string on the right side of the production.

★ Notations

• −−−> Derives in one step.


• −−−> Derives in zero or more steps

+
• −−−> Derives in one or more steps.

★ Language

+
Given a grammar ‘G’ with the start symbol ‘S’, then −−−> relation can be used
to define L (G), the language generated by G.
SYNTAX ANALYSIS 2.11

Strings:


Strings in L (G) contain only terminals, ‘w’ is in L (G) if, S −−−> w where w is
string of terminals (or) sentence of G.

★ Context free language:

A language that can be generated by a grammar is said to be a context free


language.

★ Equivalent context free languages:

If 2 CFG generate the same language, then the grammars are said to be
equivalent.

★ Sentential form:

If S −−−> α, where α may contain non-terminal, then α is called as the sentential
form of G.

★ Left most Derivations:

Derivations in which only the left most non-terminals in any centennial form
is replaced at each step are called as left most derivations.

If α → β by a step in which only left most non-terminal in α is replaced, it can


lm
be written as α −−−−> β .

★ Left – sentential form:



α derives β by a left most derivation can be denoted as, α −−−−> β . If
lm

S −−−−> α , then α is a left-sentential form of the grammar.
lm

★ Right most Derivations (canonical derivation):

Derivations in which only the right most non-terminal in any sentential form
is replaced at each step are called as right most derivations.

If α → β by a step in which only right most non-terminal in α is replaced, can


rm
be written as, α −−−−> β .
2.12 COMPILER DESIGN

★ Right – sentential form:



α derives β by a right most derivation can be denoted as, α −−−−> β . If
rm

S −−−−> α , then α is a right sentential form of the grammar.
rm

Example:

1. Consider the grammar,

S → (L) ⁄ a

L → L, S ⁄ S

Construct the left most & right most derivation for the string (a, a).

Left most derivation Right most derivation

lm rm
S −−−−> (F) S −−−−> (L)

lm rm
S −−−−> (L, S) S −−−−> (L, S)

{left sentential form} {right sentential form}

lm rm
S −−−−> (S, S) S −−−−> (L, a)

lm rm
S −−−−> (a, S) S −−−−> (S, a)

lm rm
S −−−−> (a, a) S −−−−> (a, a)

Parse Trees:

A parse tree is graphical represention of a derivation. The leaves of parse tree


are labeled by non-terminals or terminals and read from left to right, they constitute
a sentential form, called the yield or frontier of the tree. Features of a parse tree
are,
SYNTAX ANALYSIS 2.13

★ Every interior node is labeled by some non-terminal ‘A’.


★ Children of node ‘A’ are labeled from left to right, by symbols in right side
of production by which this ‘A’ was replaced.

★ Leaves of parse tree are labeled by non-terminal or terminals.


Example:
1. Consider the grammar
S → (L) / a
L → L, S/S

The parse tree for the input string (a, a) is as follows:

( L )

L * S

S a

a
Fig. 2.5

Ambiguity:

A grammar that produces more than one parse tree for same sentence is said
to be ambiguous. An ambiguous grammar is one that produces more than one left
most derivation or more than one rightmost derivation for the same sentence.

Example:

1. Check whether the grammar is ambiguous or not.

E → E + E

E → E * E

E → a

Consider a string “a + a ∗ a”:


2.14 COMPILER DESIGN

Let most derivation (I) Left most derivation (II)

lm lm
E −−−> E + E E −−−> E ∗ E

lm lm
−−−> a + E −−−> E + E ∗ E

lm lm
−−−> a + a ∗ E −−−> a + E ∗ E

lm lm
−−−> a + a ∗ E −−−> a + a ∗ E

lm lm
−−−> a + a ∗ a −−−> a + a ∗ a

Since for a single input sentence 2 left most derivations are there, the grammar
is said to be ambiguous.

Writing a CFG from Regular Expression:

For each state ‘i’ of the NFA, create a non-terminal A01.

1. If state i has a transition to state j on symbol a, introduce the production.

Ai → a Aj

2. If state i goes to state j on input ε, introduce the production

Ai → Aj

3. If i is an accepting state, introduce Ai → ∈.

4. If i is a start state, make Ai to be the start symbol of the grammar.

Example:
1. Write the CFG for the following automata:

a, b
a b b
0 1 2 3
Fig. 2.6
SYNTAX ANALYSIS 2.15

Solution:
Corresponding Grammar is,

A0 → a A0 ⁄ b A0 ⁄ a A1

A1 → b A2

A2 → b A3

A3 → ε

2. Write the CFG for the following automata.


a
a
1 2
e
0
e b
3 4
b
Fig. 2.7
Solution:
Corresponding Grammar is,

A0 → A1 ⁄ A3

A1 → a A2

A2 → a A2 . ∈

A3 → b A4

A4 → b A4 ⁄ ε.

3. Write the CFG for the following automata:


b
b

a b b
0 1 2 3
a
a a

Fig. 2.8
2.16 COMPILER DESIGN

Solution:

A0 → b A0 . a A1

A1 → a A1 ⁄ b A2

A2 → b A3 ⁄ a A1

A3 → a A0 ⁄ a A1 ⁄ ε.

Left – Recursive Grammar:

A grammar is said to be left recursive if it has a non-terminal ‘A’ such that


there is a derivation,
+
A −−−> A α for some string ‘α’ i.e., the same non-terminal on the left side must
appear as the first symbol on the right side.

Example:

The grammar, E → E + E

E → id is left recursive since the same non-terminal ‘E’ appears on left as


well as the first position of right side in the first production.

Elimination of Left Recursion:

Algorithm for eliminating immediate left recursion:

1. First join the A-productions together as,

A → A α1 ⁄ A α2 ⁄ … ⁄ A αm ⁄ β1 ⁄ β2 ⁄ … ⁄ βn.

where, no βi begins with ‘A’

2. Replace these A − productions by the set of statements,

A → β1 A′ ⁄ β2 A′ ⁄ … ⁄ βn A′

A′ → α1 A′ ⁄ α2 A′ ⁄ … ⁄ αm A′ ⁄ ε.

Algorithm for eliminating all kinds of grammars:

This algorithm will work only if it has,

1. No cycles

2. No ε − productions.
SYNTAX ANALYSIS 2.17

Problems:

1. Eliminate left recursion from the following grammar.

E → E +T⁄T

T → T∗ F⁄F

F → (E) ⁄ id

Solution:

E → E +T⁄T

A → A α⁄β α=+T, β=T

⇓ replaced by

A → β A′
A′ → α A′ ⁄ ε
E → TE′
E′ → + T E′ ⁄ ε
T → T∗ F⁄F
T → FT′
T′ → ∗ FT′ ⁄ ε

∴ E → TE′

E′ → + TE′ ⁄ ε

T → FT′

T′ → ∗ FT′ ⁄ ε

F → (E) ⁄ id.

2. Eliminate left recursion from the following grammar.

S → Aa ⁄ b .... (1)

A → Ac ⁄ sd .... (2)

Solution:

Sub eq. (1) in (2)

α1 = c , α2 = ad , β = bd
2.18 COMPILER DESIGN

S → Aa ⁄ b

A → AC

A → A ad ⁄ bd.

Elimination of left recursion

S → Aa ⁄ b

A → bd A′

A′ → C A′ ⁄ ad A′ ⁄ ε

3. Eliminate left recursion from the following grammar.

S → a / ^ / (T)

T → T1 S ⁄ S
α β

Solution:

Eliminating left recursion.

S → a / ^ / (T)

T → ST′

T′ → , ST′ ⁄ ε.

Left Factoring:

If more than one production is available to expand a non-terminal ‘A’ and if


the selection is not clear, the grammar is said to posses left-factoring.

If A → α β1 ⁄ α β2 are 2 ‘A’ productions, left factored productions becomes.

A → α A′

A′ → β1 ⁄ β2

To each non-terminal ‘A’, find the longest prefix ‘?’ common to two or more of
its alternatives.

Replace all the A-productions

A → α β1 ⁄ α β2 ⁄ .... ⁄ α βn ⁄ y
SYNTAX ANALYSIS 2.19

By

A → α A′ ⁄ y

A′ → β1 ⁄ β2 ⁄ … βn

Problems:

1. Left Factor the following grammar.

S → iEt SeS ⁄ i E t s

E → b

Elimination of left factoring formula

A → α A′ ⁄ ν

A′ → β1 ⁄ β2 … ⁄ βn

Solution:

Grammar after left factoring:

S → i E t SS′ ⁄ a

S′ → e S ⁄ ?

E →b

2. Left factor the following grammar.

S → a Abc ⁄ a Ab ⁄ d
A → e
Solution:

Grammar after left factoring:

S → a A b S′ ⁄ d
S′ → c ⁄ ∈
A → e
2.20 COMPILER DESIGN

3. Left factor the following grammar.

S → Ced ⁄ Cdb

C → db

Solution:

Grammar after left factoring:

S → Cs′

S′ → Cd ⁄ db

C → db

2.4. WRITING A GRAMMAR:

★ Grammars are capable of describing most, but not all, of the syntax of
programming languages.
eg: Requirement that identifiers be declared before they are used cannot be
described by a context − free grammar.

★ Regular Expression Vs Context free grammars R.E (a/b)* abb

A0 → a A0 ⁄ b A0 ⁄ a A1

A1 → b A2

A2 → b A3

A3 → ∈

★ Every regular set is a context free language.

★ Why use regular expression to define lexical syntax of a language?

• Lexical rules of a lang. are frequently quite simple (no need a rotation
as powerful as grammars)

• Provide easier and concise to understand notation for tokens than


grammars.

• Constructed automatically from reg. exp. than from arbitrary grammars.


SYNTAX ANALYSIS 2.21

★ Regular Expression for describing structure of lexical constructs such as


identifiers, constants, keywords...

★ Grammars are useful in describing nested structures such as balanced


parentheses, matching begin end, if-then-else, ...

★ Verifying the language generated by a grammar:

• every string generated by G is in L


• every string in L can indeed be generated by G.
eg: Consider the grammar

S → (s) S ⁄ ∈

every sentence derivable for S is balanced


∗ ∗
S ⇒ (s) S ⇒ (x) S ⇒ (x) y

derivations of x and y from S take fewer than n steps.

2.5. AMBIGUOUS GRAMMAR:

Ambiguity:

★ Grammar that produces more than one parse tree for some sentence is said
to be anbiguous.

★ Produce more than one left most (or) more than are rightmost derivation
for the same sentence.

★ Eliminating Ambiguity:

• Sometimes an ambiguous grammar can be rewritten to eliminate the


ambiguity.

• eg: dangling - else grammar

stmt → if expr then stunt |

if expr then stunt else stmt |


any other stmt.
2.22 COMPILER DESIGN

if E, then S1 else if E2 then S2 else S3

Stmt

if stmt else stmt


expr then

S1
E1 if expr then stmt else stmt

E2 S2 S3

Fig. 2.9
if E1 then if E2 then S1 else S2

Stmt

if
expr then stmt

E1
if expr then stmt else stmt

E2 S1 S2

Fig. 2.10

• general rule is “match each else with closest previous unmatched then”.
• unambiguous grammar

stmt → matched - stmt | unmatched. stmt.

matched - stmt → if expr then matched - stmt else matched -


stmt | other

unmatched - stmt → if expr then stmt | if expr then matched -


stmt else unmatched - stunt.

★ Elimination of left recursion


SYNTAX ANALYSIS 2.23

★ A grammar is left recursive if it has a non-terminal A such that there is a


derivation.
+
A ⇒ A α for some string α.

★ Top down parsing methods cannot handle left - recursive grammars.


★ A → A α | β can be replaced by non-left-recursive procedure.
A → β A′

A′ → α A′ | ∈

eg: E → E+T|T

T → T∗F|F

F → (F) | id

Eliminate unmediated left recursion

E → T E′

E′ → + T E′ | ∈

T → F T′

T′ → ∗ F T′ | ∈

F → (E) | id

to remove immediate left recursion

A → A α1  A α2  …  A αm  β1  β2  …  βn

where no βi begins with an A

A → β1 A′  B2 A′  …  βn A′

A′ → α1 A′  α2 A′  …  αm A′  ∈

But it doesn’t eliminate left recursion involving derivations of 2 (or) more steps.

eg: S → Aab

A → Ac  Sd  ∈

non terminal S is left recursive because

S ⇒ A a ⇒ Sol a
2.24 COMPILER DESIGN

but it is not immediate left recursive.


Algorithm – Eliminating left recursion

Input − Grammar G with no cycles (or) ∈ productions

Output − Equivalent grammar with no left recession.

arrange the non-terminals in some order A1 , A2 … An

For i = 1 to n do begin

For j = 1 to i − 1 do begin

replace each production of the form Ai → Aj γ

by the productions Ai → δ1 γ  δ2 γ  …  δK γ

where Aj → δ1  δ2  …  δK are all current Aj productions.

end

eliminate the immediate left recursion among the Ai productions.

end

eg: S → Aa⁄b

A → Ac ⁄ S

After Elimination of left Recursion

S → Aa ⁄ b

A → SA′

A′ → cA′ ⁄ ε

★ Left factoring

• It is a grammar transformation that is useful for producing a grammar


suitable for predictive parsing.
SYNTAX ANALYSIS 2.25

eg: Stmt → it expr then stmt else stmt


 if expr then stmt

i.e. A → α β1  α β2

Left factored is

A → α A′

A′ → β1  β2

Grammar

A → α β1  α β2  … αβn  α

then

A → α A′  γ

A′ → β1  β2  … βn

Dangling else problem

eg: S → iEtSiEtseSa

E → b

Left factored

S → i E t s s′  a

S′ → e S  t

E → b.

2.6. ERROR HANDLING:

★ An LR parser detects an error, when it consults the action table and finds
that there is no entry for the given state and input symbol.

★ Error can never be detected by consulting the goto table.

★ Error detecting in LR parsing exhibits valid prefix property. Detection of


error happen as soon as the prefix of input is not valid.

★ Canonical LR parser will not make a single reduction before announcing the
error.
2.26 COMPILER DESIGN

★ SLR and LALR can make several reductions before detections an error but
will not make a single shift before detecting an error.

★ Error recovery schemes in LR parsers.

Panic mode − discards symbols

Phrase level recovery − error handling routine.

2.7. TOP DOWN PARSING

Types of Parsers:

Parsing is a technique that is used to check whether the statements of the input
stream are as per the syntax of the pgmming lang or not.

Parser is a program that takes a string of token as its input & produces a
parse tree if the string is accepted by the grammar otherwise, an error.

Two types of parsers are:

1. Top-down parsing

2. Bottom-up parsing.

Top-down parsing

For a given input string, these parsers construct the parse tree from root to
the leaves if the statement is accepted by the s/c language; otherwise it will generate
an error.

Various top - down parsing techniques are

1. Recursive descent parsing

2. Predictive parsing.

2.7.1. Recursive Descent Parsing:

★ Construct parse tree from root & create nodes of parse tree in pre-order.

★ Special case − predictive parsing − no backtracking is required.

★ Repeated scans of the input (involves backtracking)

eg: S → cAd

A → ab  a string w = c a d
SYNTAX ANALYSIS 2.27

S S S

c A d c A d c A d

a b
a
Fig. 2.11
The most general form of top=down parsing that involves back-tracking is
recursive descent parsing.
To check whether a given string is accepted or not i.e., to construct a parse
tree, create child node in preorder.

Example:
1. Consider the following grammar
S → c Ad
A → ab ⁄ a
Construct parse tree for the string cad.
Solution:
Step 1:
S (starting node) Input pointer : cad

Step 2:
S

c A d
Fig. 2.12 Input pointer : cad

Step 3:
S

c A d

a b
Fig. 2.13 Input pointer : cad
2.28 COMPILER DESIGN

Step 4:
‘b’ is compared against ‘d’ & hence found to be wrong. Hence go back to ‘A’ and
check for an alternative.

c A d
Fig. 2.14
Input pointer : cad

Step 5:
S

c A d

a
Fig. 2.15

Input pointer : cad

“String accepted”

2.7.2. Predictive Parsing:

★ Eliminators left recursion


⇒ so no backtracking required
★ Left factors

Translation diagram for predictive parsers:

For each non terminal A do the following

(i) create an initial and final (return) state

(ii) for each production

A → X1 X2 … Xn create a path from initial to final state with edges labeled


X1 , X2 , … Xn
SYNTAX ANALYSIS 2.29

eg:

E: T E¢
0 1 2

+ T E¢
E′ :
3 4 5 6

T: F T¢
7 8 9

F T¢
10 * 11 12 13
T′ :

e
( E )
14 15 16 17
F:

id
Fig. 2.16
Simplification of E′ :
Apply ε to E′

e T

+ T +
3 4 5 Þ 3 4

e e
6 6

Apply E′ to E
T +
T + T
0 3 4 Þ 0 3
e
e 6 6
2.30 COMPILER DESIGN

Similarly: Simplifying T′
e
F

* F *
Þ
e

Apply T′ to T

F
*
F * F
Þ
e
e

Fig. 2.17
Non-recursive predictive passing:
★ Maintain stack explicitly rather than implicitly via recursive calls.
★ Key pbm → determining the production to be applied for a non-terminal

Input
a + b $
Buffer

Stack

X
Predictive Parsing Output
Program stream
Y

$ Parsing
Table M

Fig. 2.18
SYNTAX ANALYSIS 2.31

★ Input buffer − stream string to be passed followed by $ (endmasker -


end of string).

★ Stack − sequence of grammar symbols with $ on the bottom.

★ Parsing table − two dimensional array M [A, a]

where A − nonterminal
a − terminal

★ The program considers X, the symbol on top of the stack and a, the current
input symbol. These two symbols determine the action of the parser.

There are 3 possibilities:

(i) If X = a = $, parser halts & announces successful completion of parsing.

(ii) If X = a ≠ $, the parser pops X off the stack & advances the input pointer to
the next input symbol.

(iii) If X is a non-terminal, the program consults enter M [X, a] of the parsing


table M. This entry will be either an X-production of the grammar (or) an
error entry. If for example, M [X, a] = { X → UVW }, the parser replaces X
on top of the stack by WVU (with U on top). As output we hall assume that
the parser just prints the production used; any other code could be executed
here. If M [X, a] = error, the parser calls an error recovery routine.

★ Parsing table M for grammar

Non I/P symbol


terminal
id + * ( ) $

E E → TE′ E → TE′

E′ E′ → + T E′ E′ → ∈ E′ → ∈

T T → F T′ T → FT′

T′ T′ → ∈ T′ → ∗ F T′ T′ → ∈ T′ → ∈

F F → id F → (E)
2.32 COMPILER DESIGN

Stack Input Output

$E id + id ∗ id $

$ E′ T id + id ∗ id $ E → TE′

$ E′ T′ F id + id ∗ id $ T → F T′

$ E′ T′ id id + id ∗ id $ F → id

$ E′ T′ + id ∗ id $

$ E′ + id ∗ id $ T′ → ∈

$ E′ T + + id ∗ id $ E′ → + TE′

$ E′ T id ∗ id $

$ E′ T′ F id ∗ id $ T → FT′

$ E′ T′ id id ∗ id $ F → id

$ E′ T′ ∗ id $

$ E′ T′ F ∗ ∗ id $ T′ → ∗ F T′

$ E′T′ F id $

$ E′ T′ id id $ F → id

$ E′ T′ $

$ E′ $ T′ → ∈

$ $ E′ → ∈

Predictive parsing program:

Set ip to point to the first symbol of w$ repeat.

Let X be the top stack symbol and a the symbol pointed to by input

if X is a terminal or $ then

if X = a then

pop X from stack & advance input

else error)
SYNTAX ANALYSIS 2.33

★ else / * X is a nonterminal * /
if M [X, a] = X → Y1 Y2 .... Yk then begin

pop X from the stack

push YK , YK1 , … Y1 on to the stack with Y1 on top

output the production X → Y1 1 ⁄ 2 … YK

end

else error ( )

until X = $ ⁄ ∗ stack is empty * /.

Predictive parsing:

★ The drawback with the recursive descent parsing is that it requires


backtracking & hence difficult to implement in systems.

★ A special form of recursive - descent parsing that needs no backtracking is


called as predictive parsing.

Two forms of predictive parsing are

(i) Recursive predictive parsing.

(ii) Non-recursive predictive parsing.

Recursive predictive parsing:

★ A recursive predictive parsing program based on a TD tries to match


terminal symbols against the input. When the parsing has to follow an edge
labeled a non-terminal, it makes a recursive procedure call to the
corresponding TD.

TD’s:

(a) For each NT, there will be a diagram.

(b) The labels of edges are labeled tokens & NT’s.

(c) A transition on a taken means that transition has to be done if that token
the next input symbol.

(d) A transition on a NT ‘A’ is a call of the procedure for ‘A’.


2.34 COMPILER DESIGN

Steps for construction:


(a) Eliminate left recursion from the grammar.

(b) Left factor the grammar.

(c) For each non-terminal ‘A’ do,

(i) Create an initial & final state

(ii) For each production

A → X1 X2 … Xn, create a path from initial to final state, with edges labeled
X1 X2 … xn

Working of predictive parser:


(a) Parser begins in the start state for the start symbol.

(b) If after some actions, it is in state ‘S’ with a edge labeled by terminal ‘a’ to
state ‘t’ & if the next input symbol is ‘a’, then parser moves the input pointer
one position right & goes to state ‘t’.
(c) If edge is labeled by a non-terminal ‘A’, the parser goes to start state for ‘A’,
the parser goes to start state for ‘A’, w/o moving the input pointer. If it
reaches the final state for ‘A’ it immediately goes to state ‘t’.
(d) If there is an edge from ‘S’ to ‘t’ labeled, then from state S the parser goes
to t w’o advancing the input pointer.

2.7.2.1. Construction of LL(1) Parser:


Construct predictive parsing table for the following grammar and hence
check whether the grammar is LL(1) or not
S → L=R
S → R
L → ∗R
L → id
R → L
Solution:
Combine all the A productions,

S → L=R⁄R
L → ∗ R ⁄ id
R →L
SYNTAX ANALYSIS 2.35

Step 1: No left recursion

Step 2: No left accounting

Step 3: Computation of FIRST:

FIRST (S) = FIRST of (L) = { ∗ , id }

FIRST (R) = FIRST (F) = { ∗ , id }

FIRST (L) = { ∗ , id }

Step 4: Computation of FOLLOW:

FOLLOW (S) = { $ }

FOLLOW (L) = { = , $ }

FOLLOW (R) = { = , $ }

Step 5:

(a) Construction of parsing table:

Rule 1:

For each terminal ‘a’ contained in FIRST (A), add A → X to M [A, a] in parsing
table if X derives ‘a’ as the first symbol

M [S, ∗] = S → L = R

M [S, id] = S → L = R

M [L, ∗] = L → ∗R

M [L, id] = L → id

M [R, ∗] = R → L

M [R, id] = R → L
2.36 COMPILER DESIGN

(b) Rule 2:

If FIRST (A) contain null production for each terminal ‘b’ in FOLLOW (A), add
this production (A → null) to M [A, b] in parsing table.

Step 6:

Parsing Table

Non-terminals/
* id = $
Terminals

S S→L=R S→L=R − −

L L → XR L →id − −

R R→L R→L − −

2.8. BOTTOM UP PARSING:

Bottom up evaluation of s-attributed definitions:

★ Translator or S-attributed definition can be implemented with the help of


LR parser generator.

★ Uses a stack to hold information about sub trees that have been parsed

State val

X X-x
state entry is a pointer
Y Y-y to LR (1) parsing table

top → Z Z-z
SYNTAX ANALYSIS 2.37

eg: Production Codefragment

L→En print (val [trp])

E → E1 + T val [ntop] = val [top – 2] + val [top]

E→T

T → T1 ∗ F val [n top] = val [top − 2] * val [top]

T→F

F → (E) val [n top] = val [top − 1]

F → digit

n top = top − r + 1

r → no. of symbols in right side of the production.

Moves made by translator on input 3 ∗ 5 + 4 n

Input State Val Productions

3∗5+4n − −

∗5+4n 3 3

∗5+4n F 3 F → digit

∗5+4n T 3 T → F

5 + 4n T * 3−

+ 4n T∗5 3−5

+ 4n T∗F 3−5 F → digit


2.38 COMPILER DESIGN

Input State Val Productions

+ 4n T 15 T → T * F

+ 4n E 15 E → T

4n E+ 15 −

n E+4 15 − 4

n E+F 15 − 4 F → digit

n E+T 15 − 4 T → F

n E 19 E→E+T

En 19

L 19 L→En

Bottom-up Parsing:

Parsers that construct the parse tree from the leaves to the root for a given
input string are said to be Bottom-up parsing i.e. the input string is reduced to the
start symbol. If it can be reduced to a start symbol then the string is said to be
accepted otherwise, not.

Various types of bottom-up parsers are:

★ Shift − reduce parser.

★ Operator − precedence parser

★ LR parsers

• SLR parser.

• CLR parser.

• LALR parser.
SYNTAX ANALYSIS 2.39

Shift – reduce parser:

★ The general form of bottom-up parser is shift − reduce parser & hence
parsing of the input string is made from leaves to the root.

★ This parsing attempts to construct a parse tree for an input string beginning
at the leaves & working up towards the root.

★ At each reduction step, a particular substring matching the right side of a


production is replaced by the symbol on the left of that production. If the
substring chosen correctly at each step, a right most derivation is traced out
in reverse.

Example:

Consider the grammar:

S → a AB e

A → Abc ⁄ b

B → d

Construct the reduction for the input sentence abbcde.

Top-down parsing Bottom-up parsing

Initial config. ($ S w $) Initial config. ( $ , w $)

Final config. ( $ $) Final config ($ S, $)

Reduction: a bb cde

a Abc de { A → b }

a A d e { A → Abc }

a A Be { B → d }

S { S → a ABe }
2.40 COMPILER DESIGN

Handle:

A handle of a string is a substring that matches the right side of a production


and whose reduction to the non-terminal on the left side of the production represents
one step along the reverse of a right most derivation.

A handle of a right sentential form? is a production A′ β and a position of?


where the string β may be found and replaced by A to produce the previous right −
sentential form in a right most derivation of ?

Example:

Consider the following grammar:

E → E+E⁄E∗E
E → (E) ⁄ id
Identify the handles in the derivation of id + id ∗ id.

rm
E −−−−> E + E
__
rm
−−−−> E + E ∗ E
__
rm
−−−−> E + E
__ ∗ id { HANDLES UNDERLINED }
rm
−−−−> E
__ + id ∗ id
rm
−−−−> id + id ∗ id

HANDLE PRUNING:

A right most derivation in reverse can be obtained by handle pruning. Reducing


β to A in α β γ is called handle pruning.

Example:

E → E+E⁄E∗E

E → (E) ⁄ id.

and the input string (id + id): Obtain the handle pruning for this sentence.
SYNTAX ANALYSIS 2.41

Right – sentential form Handle Reducing Production

(id + id) id E → id

(E + id) id E → id

(E + E) E+E E →E+E

(E) (E) E → (E)

E − −

VIABLE PREFIXES:
The set of prefixes of right − sentential forms that appear on the stack of a
shift-reduce parser are called viable prefixes.

Problems with handle pruning:


The problems that have to be solved in handle pruning are:

(a) Location of sub string to be reduced in a right − sentential form.

(b) Choosing a production in case of there is more than one production with that
sub string on the right side.

2.8.1. Concept of Shift Reduce Parsing:


Stack Implementation of Shift – reduce parsing:

★ 2 Pbms to be solve

• Locate the substring to be reduced in a right sentential form.


• Determine what production to choose in case there is more than one
production with that sub string on the right side.

★ Use stack to hold grammar symbols.


★ Input buffer to hold the string w to be parsed.

★ $ − bottom of stack & right end of input.


★ Initial

Stack Input

$ w$
2.42 COMPILER DESIGN

★ Parser operates by shifting zero (or) more input symbols on to the stack
until a handle β is on top of the stack.

★ Parser reduces β to the left side of the appropriate production.

★ Parser repeats this cycle until an error is detected (or) stack contains start
symbol & input is empty (ie) stack i ⁄ p
$S $

★ 4 possible actions a shift-reduce parser can make.

1. shift, 2. reduce, 3. accept, 4. error

★ Shift − next input symbol is shifted onto the top of the stack.

★ reduce − right end of handle is at top of stale & replace it with the left
end of handle.

★ Accept − successful completion.

★ Error − syntax error discovered & calls an error recovery routine.

Stack I/P Action

$ id, + id ∗ id3 $ shift

$ id1 + id2 ∗ id3 $ reduce by E → id

$E + id2 ∗ id3 $ shift

$E+ id2 ∗ id3 $ shift

$ E + id2 ∗ id3 $ reduce by E → id

$E+E ∗ id3 $ shift

$E+E∗ id3 $ shift

$ E + E ∗ id3 $ reduce by E → id

$E+E∗E $ reduce by E → E ∗ E

$E+E $ reduce by E → E + E

$E $ accept
SYNTAX ANALYSIS 2.43

★ Action are to made so that shift − reduce parser works correctly ⇒ 2


techniques − operator precedence & LR parsers.

★ Viable prefixes − set of prefixes of right sentential farms that can appear
on the stack of a shift − reduce parser.

★ Conflicts design shift − reduce parsing

Shift / reduce conflict

• cannot decide whether to shift (or) to reduce.


reduce / reduce conflict

• cannot decide which of several reductions to make.


Technically, these grammars are not in LR (K) class of grammars ⇒ non LR
grammars K − no. of symbols of lookahead on the input LR (1) − one symbol looked
head.

★ Ambiguous grammar can never be LR.


eg: dangling − else grammar

stmt → if expr then stmt

if expr then stmt else stmt

other

Stack

... if expr then stmt input else ... $

Shift / reduce conflict

★ Stack content & next input symbol are not sufficient to determine which
production should be used in a reduction.

Stack implementation of shift – reduce parser:

A shift − reduce parser has an,

1. Input Buffer that contains the string to be parsed, followed by $, a symbol


used as a right-end marker to indicate the end of the string.

2. Stack that contains a sequence of grammar symbol with $ on the stack at


the bottom, indicating the bottom of the stack.
2.44 COMPILER DESIGN

3. Configuration: A configuration of a shift reduce parser is a pairs whose first


component is the stack content and second component is an unexpended input,

( $ XYZ , ai ai + 1 … an $ )

Initial configuration

($,w$)

Final configuration

($S,$)

ACTIONS:

Shift − The next input symbol is shifted on to the top of the stack.

Reduce − The parser knows the right end of the handle is on the top of the
stack. It then locates the left end of the handle within the stack
and decides the non-terminal for replacement.

Accept − Parser announces successful completion of parsing.

Error − Discovers that an syntax error has occurred and calls an error
recovery routine.

Steps in shift - reduce parsing:

1. The parser operates by shifting zero or more i/p symbols onto the stacks until
a handle β is on the top of the stack.

2. The parser that reduces β to the left side of appropriate production.

3. This is repeated until it has detected an error or until the stack reaches the
final configuration.

Problems:

1. Consider the following grammar

S → Cc

C → cC

C → d

Check whether the input string ccdd is accepted or not


SYNTAX ANALYSIS 2.45

Solution:

Stack I/P Buffer Action

$ ccdd $ shift

$ c cdd $ shift

$ cc dd $ shift

$ ccd d $ reduce C → d

$ ccC d $ Reduce C → cC

$ c C d $ Reduce C → cC

$ C d $ shift

$ C d $ Reduce C → d

$ CC $ Reduce S → CC

$ S $ Accept

2. Consider the following grammar:

E → E+T⁄T

T → T ∗F⁄F

F → (E)  id

check whether the input string (id * id) & + id is accepted or not.

Conflicts in shift – reduce parsing:

Two types of conflicts occur in shift − reduce parsing:

★ Shift / Reduce conflict:

Upon reaching a configuration in which, knowing the stack contents & next
input symbol, the parser cannot decide whether to shift or to reduce.

★ Reduce / Reduce conflict:

Upon reaching a configuration in which, the parser cannot decide which of the
several reductions to make.
2.46 COMPILER DESIGN

Shift / Reduce conflict:

An ambiguous grammar may give rise to a shift / reduce conflict.

Example:

Consider the following grammar:

stmt → if expr then start /


if expr then stmt else stmt / other.

If at some stage, let te stack configuration be

($ ... if expr then stmt, else ... $)

At this stage, the parser cannot decide whether the stack top has to be reduced
or the next input symbol has to be shifted.

Reduce / Reduce conflict:

Consider the grammar:

stmt → id (parameter − list) / expr = expt

parameter - list → parameter _ list, parameter / parameter.

parameter → id.

Expr → id (expr - list) / id

Expr - list → expre _ list, exp / expr.

Consider the input string id (id, id)

Let the configuration be, ( $ ... id - ( id, , id ... $))

At this stage, the id on the stack has to be reduced but since more than one
production is there with id on it’s left, it lade to a confusion.

2.8.2. Operator Precedence Parser:

Precedence Relations:

Bottom-up parsers for a large class of context-free grammars can be easily


developed using operator grammars.

Operator Grammars have the property that no production right side is


empty or has two adjacent non-terminals.
SYNTAX ANALYSIS 2.47

Consider:

E−> E op E | id

op−> + |*

Not an operator grammar but:


*
E−> E + E | E E | id

This parser relies on the following three precedence relations:

Relation Meaning

A<⋅b a yields precedence to b

a=⋅b a has the same precedence as b

a⋅>b a takes precedence over b

id + * $
id ⋅> ⋅> ⋅>
+ <⋅ ⋅> <⋅ ⋅>
* <⋅ ⋅> ⋅> ⋅>
$ <⋅ <⋅ <⋅ ⋅>

Precedence Table
Example: The input string:

id1 + id2 ∗ id3

After inserting precedence relations becomes:

$ < ⋅ id1 ⋅ > + < ⋅ id2 ⋅ > ∗ < ⋅ id3 ⋅ > $

Basic Principle:
Having precedence relations allows identifying handles as follows:

1. Scan the string from left until seeing ⋅ > and put a pointer.

2. Scan backwards the string from right to left until seeing < ⋅

3. Everything between the two relations < ⋅ and ⋅ > forms the handle

4. Replace handle with the head of the production.


2.48 COMPILER DESIGN

Operator Precedence Parsing Algorithm:

Initialize: Set ip to point to the first symbol of the input string w$

Repeat: Let b be the top stack symbol, a the input symbol pointed to by ip

if (a is $ and b is $)

return

else

if a ⋅ > b or a = ⋅ b then

push a onto the stack

advance ip to the next input symbol

else if a < ⋅ b then

repeat

c ← pop the stack

until (c ⋅ > stack-top)

else error

end

Making Operator Precedence Relations

The operator precedence parsers usually do not store the precedence table with
the relations; rather they are implemented in a special way.

Operator precedence parsers use precedence functions that map terminal


symbols to integers, and so the precedence relations between the symbols are
implemented by numerical comparison.

Algorithm for Constructing Precedence Functions:

1. Create functions fa for each grammar terminal a and for the end of string
symbol.

2. Partition the symbols in groups so that fa and gb are in the same group if
a = ⋅ b (there can be symbols in the same group even if they are not connected
by this relation).
SYNTAX ANALYSIS 2.49

3. Create a directed graph whose nodes are in the groups, next for each symbols
a and b do: place an edge from the group of gb to the group of fa if a < ⋅ b,
otherwise if a ⋅ > b place an edge from the group of fa to that of gb.

4. If the constructed graph has a cycle then no precedence functions exist. When
there are no cycles collect the length of the longest paths from the groups of
fa and gb respectively.

Example: Consider the following table

id + * $

id ⋅> ⋅> ⋅>

+ <⋅ ⋅> <⋅ ⋅>

* <⋅ ⋅> ⋅> ⋅>

$ <⋅ <⋅ <⋅ ⋅>

Using the algorithm leads to the following graph:

g f id
id

f* g*

g+ f+

f$ g$

Fig. 2.19
2.50 COMPILER DESIGN

From which we extract the following precedence functions:

id + ∗ $

f 4 2 4 0

g 5 1 3 0

2.9. LR PARSER:

★ Efficient, bottom-up syntax analysis technique.

LR (K) no. of in put symbols of


lookahead that are used
↓ in making passing
decisions.
Left to right scanning of right most derivation in
input reverse.

★ If (K) is omitted then K is assumed to be.

★ To recognize all PL constructs for which CGF can be written.

★ General non-backtracking shift-reduce parser.

★ Grammars that can be parsed during LR methods is a proper superset of


the class of grammars that can be parsed with predictive parsers.

★ Detect syntactic error during left to right scan of the input.

★ Too much work to construct LR parser by hand, so use specialized tool −


LR parser generator (YAcC)

★ SLR − Simple LR − easy to a implement but least power.

CALR − Canonical LR − most powerful & most expensive.

LALR − Look ahead − intermediate in power & expense.


SYNTAX ANALYSIS 2.51

★ LR passing algorithm.

Input a1 ...... ai . . . . . . a $
n

Stack

Sm
LR
Xm Parsing Output
Program
Sm-1

X m-1
.
.
action goto
S0

Fig. 2.20

Xi − grammar symbol

Si − state symbol ⇒ summarizes the information contained in the stack below it.

★ Combination of state (sm) symbol contained in stack & (ai) current i/p symbol
are used to index the parsing table & determine shift − reduce parsing
decision [ sm , ai ].

★ Parsing action table entry for state sm and ‘input a1’

• shift s, where S is a state

• reduce by a grammar production A → B

• accept

• error

★ Function goto takes state & grammar symbol as and produces a state.

★ goto function of a parsing table constructed from a grammar G using SLR,


LALR, CALR method is one transition fn of a DFA that recognizes the viable
prefixes of G.
2.52 COMPILER DESIGN

★ Configuration of an LR parser is a pair

( S0 X1 s1 X2 s2 ⁄ … Xm sm , ai ai + 1 … an $)



























stack content unexpended input

This configuration represents the right-sentenced from


X1 X2 … Xm ai ai + 1 … an

Same way as shift-reduce parser, only me presence of states on the stack is


new

★ Four types of more


(i) If action [sm , ai ] = shift S,

(S0 X1 S1 X2 S2 … Xm Sm ai S, ai + 1 … an $ )

Parser has shifted both current symbol ai & next state S which is given in
action [ sm , ai ] on to the stack; ai + 1 becomes the current i/p symbol.

(ii) If action [sm , ai ] = reduce A → β

( S0 X1 S1 X2 S2 … Xm − r Sm − r As , ai ai + 1 … an $ )

where s = goto [ sm − r , A] and r is the length of right side of production.

Here passers first popped 2r symbols of the (r state symbols) & r grammar
symbols, exposing state sm − r. The parser then pushed both A, the left side
of production & S, the entry for goto [ sm − r , A] onto stack.

(iii) If action [sm , ai ] = accept, parsing is completed.

(iv) If action [sm , ai ] = error, the passer has discovered an error & calls an error
recovery routine.

★ Algorithm (LR parsing)

Input − Input string w & LR parsing table with for action and goto
for a grammar G.

Output − If w is in LCG

Method − Initially
parser has so initial state on its state & I/P buffer has w $
SYNTAX ANALYSIS 2.53

Set ip to point to the first symbol of w $ repeat forever begin.

Let S be the state on top of stack and a the symbol pointed to by ip

If action [S, a] = shift S′ then begin


push a then S′ on top of stack
advance ip to next i/p symbol.

end

else of action [s, a] = reduce A → β then begin

pop 2 ∗  β  symbols off the stack

Let S′ be the state now on top of stack

push A then goto [s′ , A] on top of stack

output the production A → β

end

else it action [s, a] = accept then

return

− else error ( )

end

Follow (E) = {$, +, ) }

Follow (T) = { *, $, +, ) }

Follow (F) = { *, $, +, ) }

Grammar:

(1) E → E + T

(2) E → T

(3) T → T ∗ F

(4) T → F

(5) F → (E)

(6) F → id
2.54 COMPILER DESIGN

Parsing table for expression grammar

action goto
State
id + * ( ) $ E T F
0 S5 S4 1 2 3
1 S6 acc
2 r2 S7 r2 r2
3 r4 r4 r4 r4
4 S5 S4 8 2 3
5 r6 r6 r6 r6
6 S5 S4 9 3
7 S5 S4 10
8 S6 S11
9 r1 S7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5

Si means shift and stack state i


rj means reduce by production numbered j
acc means accept

blank means error

Stack Input Action

0 id ∗ id + id $ shift

0 id 5 ∗ id + id $ reduce F → id

0F3 ∗ id + id $ reduce T → F

OT 2 ∗ id + id $ shift

OT 2 * 7 id + id $ shift

OT 2 * 7 id 5 + id $ reduce F → id
SYNTAX ANALYSIS 2.55

Stack Input Action

OT 2 * 7 F10 + id $ reduce T → T * F

OT 2 + id $ reduce E → T

OE 1 + id $ shift

OE 1 + 6 id $ shift

OE 1 + 6 id 5 $ reduce F → id

OE 1 + 6 F3 $ reduce T → F

OE 1 + 6 F 9 $ reduce E → E + T

O E 1 $ accept

Parse tree

T T

F
F F

id * id + id

Fig. 2.21

★ LR parsers doesn’t have to scan the enter stack to know where the handle
appears on top. Rather than the state symbol on top of the stack contains
all the into it needs.

★ Goto m of LR parsing table is a finite automation.

★ Grammar that can be parsed by an LR parser examining upto K i/p symbols


on each more is called an LR (K) grammar.
2.56 COMPILER DESIGN

Comparison of LR Parsers:

LR(0)
CLR LALR SLR

Fig. 2.22
LR Parser:

This is a bottom-up syntax analysis technique that can be used to parse a class
of CFG’s. LR parser is actually called LR(K) parsing where,

L → Left - right scanning.

R → Right most derivation.

K → Number of i/p symbols look a head that are used in making


parsing decisions.

An LR parser has an,

1. Input Buffer that contains the string to be parsed, followed by $, a symbol


used as a right end marker to indicate the end of the string.

2. Stack is used to hold strings of the form

S0 X1 S1 X2 … Xm − 1 Sm − 1 Xm Sm

where,

Sm is at the top of the stack.

where,

X1 , X2 … Xm − 1 , Xm are grammar symbols.

and S0 , S1 , S2 … Sm − 1, Sm are states.


SYNTAX ANALYSIS 2.57

3. Parsing Table:

The parsing pgm consists of 2 parts.

(i) The parsing action function “action”.

Action can be of 4 values:

1. Shift S

2. Reduce by a production A → β

3. Accept.

4. Error.

(ii) Goto function “goto”.

This takes a state and a grammar symbol as arguments & produces a state.

4. Configuration:

A configuration of a LR parser is a pair whose 1st component is the stack.


Content & second component is an unexpended input.

(S0 X1 S1 X2 … Xm Sn , ai ai + 1 … an $ )

Initial configuration:

(S0 , w $ )

5. Actions:

The action taken by the parser can be of the following types:

★ action [ Sm , ai ] = shift S

Shift ai onto the stack along with state ‘S’. The configuration becomes,

[S0 X1 S1 X2 … Xn Sm ai S, ai + 1 … an $ ]

★ Action [ Sm , ai ] = Reduce A → β

Parser executes a reduce move entering the configuration,

[S0 X1 S1 X2 … Xm − r Sm − r A S, ai ai + 1 , … an $ ]
2.58 COMPILER DESIGN

where,

S = goto [Sm − r , A]

r = length of β

Number of symbols popped = 2 ∗ r.

★ action [ Sm , ai ] = Accept

parsing is completed

★ action [ Sm , ai ] = Error

parser calls an error recovery routine.

LR (0) item:
An LR (0) item of a grammar ‘G’ is a production of G with a dot at some position
of the right side.

2.9.1. Simple LR Parsing (SLR):


Construction SLR parsing table:

★ A grammar for which an SLR parser can be constructed is said to be an


SLR grammar.

★ LR (0) item, in short item, of a grammar G is a production of G with a dot


at some position of the right side.

Thus A → XYZ yields the 4 items

A → • XYZ

A → X • YZ

A → XY • Z

A → XYZ •

Production A → ∈ generates only one item A → • .

★ Item can be represented by pair of integers, first giving the no. of production
& second the position of the dot.

★ Construct from the grammar, DFA to recognize viable prefixes.


★ One collection of sets of LR (0) items, which we call the anonical LR (0)
collection, provides the basis for constructing SLR parsers.
SYNTAX ANALYSIS 2.59

★ To construct canonical LR (0) collection of a grammar, define augmented


grammar & closure and goto.

★ If G is a grammar with start symbol S, then G′, the augmented grammar


for G, is G with a new start symbol S′ and production S′ → S.
Acceptance occurs only when the parser is about to reduced by S′ → S

Consider the augmented expression grammar

E′ → E

E → E +T⁄T

T → T∗ F⁄F

F → (t) ⁄ id

If I is the set of one item { [ E′ → • E] }, then closure (I) contains the items
constructed from I by the 2 rules.

1. Initially every item in I is added to closure (I).

2. See augmented.

Rule 1:

E′ → • E

Rule 2:

E → •E+T

E → •T

T → •T∗F

T → •F

F → • (E)

F → • id

★ 2 classes of items kernel items non kernel items.


2.60 COMPILER DESIGN

Computation of closure:

Function closure (I)

begin

J=I

repeat

for each item A → α • B β

in J and each production β → γ in a such that β → • γ is and in J


do add β → • γ to J

until no more items can be added to J

return J

end

★ Kernel items

• Include the initial item, S′ → • S and all items whose dots are not at the
left end.

★ Non kernel items → which have there dots at the left end.

goto operation:

★ goto (I, X) where I is a set of items and X is a grammar symbol.

★ goto (I, X) is defined to be the closure of the set of all items


[ A → α X • β ] such that A → α • X β is in I.

eg: If I is set of two items

{ [ E′ → E • ] , [E → E • + T] } then goto (I, +) consists of

E → E+•T

T → •T∗F

T → •F

F → • (E)

F → • id
SYNTAX ANALYSIS 2.61

2.9.2. Canonical LR Parsing (CLR):

Construction of canonical LR(0) items:

★ To construct canonical collection of sets of LR(0) items for an augmented


grammar G′, the algorithm is shown below

procedure item (G′);

begin

c = { closure ( { [ S′ → • S ] } ) } ;

repeat

for each set of items I in C and each grammar symbol X


such that goto (I, X) is not empty and not in C do

add goto (I, X) to C

until no more sets of items can be added to

end

Canonical LR(0) collection for grammar:

I0 : E′ → • E

E → •E+T

E → •T

T → •T∗F

T → •F

F → • (E)

F → • id

I1 (goto (I0, E))

E′ → E •

E → E•+T

I2 (goto (I0 , T))

E → T•

T → T•∗F
2.62 COMPILER DESIGN

I3 goto (I0 , F)

T → F•

I4 goto (I0 , C)

F → ( • E)

E → •E+T

E → •T

T → •T∗F

T → •F

F → • (E)

F → • id

I5 goto (I0 , id)

F → id ⋅

I6 goto (I1 , $)

E → E+•T

T → •T∗F

T → •F

F → • (E)

F → • id

I7 goto ( I2 , ∗ )

T → T∗•F

F → • (E)

F → • id

I8 goto ( I4 , E)

F → (E •)

E → E•+T
SYNTAX ANALYSIS 2.63

I9 goto (I6 , T)

E → E+T•

T → T•∗F
I10 goto ( I7 , F)

T → T∗F•
I11 goto ( I8 , )

F → (E) •

to
id
I5

to
F C
I4

to
F I3

to
I0 E I1 + I6 T I9 * I7

T * F
I2 I7 I10

to
(
I4

to
F id
I5
I3

C
( E )
I4 I8 I11

id to
+ I6
id
I5 to
T
I2
to
F
I3

Fig. 2.23
2.64 COMPILER DESIGN

Transition diagram of DFA D for viable prefixes:

goto (I7 , id ) = I5

goto (I4 , T) = I2

goto (I4 , F) = I3

goto (I4 , C) = I4

goto (I4 , id) = I5

goto (I6 , F) = I3

goto (I6 , C) = I4

goto (I6 , id) = I5

goto (I7 , C) = I4

goto (I8 , + ) = I6

goto I9 , ∗ = I7

Algorithm − Constructing an SLR parsing table

Input − An augmented grammar G′

Output − The SLR parsing table functions goto & action for G′

Method

1. Construct C = { I0 , I1 … In } the collection of sets of LR(0) items for G′

2. State i is constructed from Ii. The parsing actions for state i are determined
as follows:

(a) If [A → α ⋅ a β ] is in Ii and goto (Ii , a) = Ij then set action [i, a] to “shift


j”.
Here ‘a’ must be terminal.

(b) If [A → α • ] is in Ii then set action [i, a] = to “reduce A → α” for all a


in follow (A) here A may not be S′.
SYNTAX ANALYSIS 2.65

(c) If [ S′ → S • ] is in Ii then set action [i, $ ] to “accept”.

If any conflicting actions are generated by the above rules, then the grammar
is not SLR(1).

The algorithm fails to produce a parser in this case.

3. The goto transactions for state i are constructed for all nonterminals A using
the rule.

If goto (Ii , A) = Ij then

goto [i, A] = j

4. All entries not defined by rules (2) & (3) are made “error”.

5. The initial state of the parser is the one constructed from the set of items
containing [S′ → • S].

Every SLR(1) grammar is unambiguous, but there are many unambiguous


grammars that are not SLR (1).

eg: Consider the grammar with productions

1. S → L = R

2. S → R

3. L → ∗ R

4. L → id

5. R → L

Alegmented Grammar S′ → S

LR (0) for item (S′ → • S)

S′ → • S

S → •L=R

S → •∗R

L → • id

R → •L
2.66 COMPILER DESIGN

I0 S′ → • S

S → •L=R

S → •R

L ↑ •∗R

L → • id

R ↑ • L

I1 goto (I0 , S)

S′ → S •

I2 goto (I0 , L)

S → L•=R

R → L•

I3 goto (I0 , R)

S → R•

I4 goto (I0 , ∗)

L → ∗•R

R → •L

L → •∗R

L → • id

I5 goto (I0 , id)

L → id •

I6 goto (I2 , = )

S → L=•R

R → •L

L → •∗R

L → • id
SYNTAX ANALYSIS 2.67

I7 goto (I4 , R)

L → ∗R•

I8 goto (I4 , L)

R → L•
goto (I4 , ∗) = I4
goto (I4 , id) = I5

I9 goto (I6 , R)

S → L=R•

goto (I6 , L) = I8
goto (I6 , ∗) = I4
goto (I6 , id) = I5
Follow (S) = { $ }
Follow (L) = { = , $ }
Follow (R) = { $ , = }

action goto
States
= * id $ S L R
0 S4 S5 1 2 3

1 accept
2 S6 ⁄ r5 r5

3 r2

4 S4 S5 8 7

5 r4 r4

6 S4 S5 8 9

7 r3 r3

8 r5 r5

9 r1

action [2, = ] is S6 ⁄ r5
2.68 COMPILER DESIGN

Even though the given grammar is not ambiguous, tare is a shift − reduce
conflict SLR parser is not powerful enough.

Tutorial:

S → As⁄b

A → s A⁄a

Construct SLR parse table for grammar. Show the actions of the parser for the
i/p string “abab”.

Canonical LR (CLR) parsing table:

★ More information is added with each state of an LR parser.

★ If A → α • is there in the itemset then the set of i/p symbols that can follow
a handle α for which there is a possible reduction to A will be added to
items in item set.

★ The extra information is incorporated into the state by redefining items to


include a terminal symbol as a second component.

★ The general form of items becomes


[A → α • β , a] where A → α β is a production and a is a terminal (or) $

This is called LR(1) item − look ahead of the item.

Constriction of the sets of LR(1) items:

Input − Augment Grammar G′

Output − Set of LR(1) items

Method − closure & goto, items (G′)

Closure (I)

For each item [A → α • B β , a] in I each production B → γ in G′ and each


terminal b in FIRST (β a) such that [ B → • γ , b] is not in I do add [B → • γ , b] to
I.

goto (I, X):

Let J be the set of items [A → α X • β , a] such that [A → α • X β , a] is in I


return closure (J).
SYNTAX ANALYSIS 2.69

item (G′):
Same as SLR but initially

C = { closure ( { [S′ → • S, $ ] } ) };

Eg:
1. S → CC

2. C → bc

3. C → d

First (C) = b & d

First (S) = b & d

I0

S′ → • S, $

S → • CC, $

C → • bc, b ⁄ d

C → • d, b ⁄ d

I1 goto (I0 , S)

S′ → S • , $

C → • bc, b ⁄ d

C → • d, b ⁄ d

I1 goto (I0 , S)

S′ → S • , $

I2 goto (I0 , C)

S → c•C,$

C → • bc, $

C • • d, $
2.70 COMPILER DESIGN

I3 goto (I0 , b)

C → b • C, b ⁄ d

C → • bC , b ⁄ d

C → • d,b⁄d

I4 goto (I0 , d)

C → d •,b⁄d

I5 goto (I2 , C)

S → cC • , $

I6 goto (I2 , ib)

C → b • C, $

C → • bC, $

C → • d, $

I7 goto (I2 , d)

C → d•,$

I8 goto (I3 , C)

C → bC • , b ⁄ d

goto (I3 , b) = − I3

goto (I3 , d) = I4

2.9.3. LALR:

LALR Parser:

★ The CLR parser avoids conflicts in the parse table. But it produces more
no. of states when compared to SLR parser. Hence it occupies more space.

★ So LALR parser can be used. Here the tables detained are smaller than
CLR parse tables & also efficient as CLR parsers.
SYNTAX ANALYSIS 2.71

★ LALR parse tables are constructed from LR(1) collection of items.

★ LR(1) items that have same productions but different lookaheads are
combined to form a single set of items (It means that these items result in
the same state of the DFA)

eg:

I4 = goto (I0 , d)

={ C→ d•,b⁄d}

I7 = goto (I2 , d)

={C→d•,$}

I4 & I7 differ only in look-aheads so combined as single state I47

I3 = goto (I0 , b)

= { → b • C, b ⁄ d

C → • bc, b ⁄ d

C → • d, b ⁄ d }

I6 = goto (I2 , b)

= { C → b • C, $

C → • bc , $

C → • d, $ }

I8 = goto (I3 , C)

= { C → bc • , b ⁄ d }

I9 = goto (I6 , C)

= { C → bc • , $

Combined as I89
2.72 COMPILER DESIGN

Combined as single state I36

action goto
State
b d $ S C

0 S36 S47 1 2

1 acc

2 S36 S47 5

36 S36 S47 89

47 r3 r3 r3

5 r1

89 r2 r2 r2

★ This merger of states can never produce shift-reduce conflict. However it can
produce a reduce-reduce conflict.

LALR parse table construction algorithm:

1. Construct C = { I0 , I1 , … In } the collection of sets of LR(1) items for G′


(augmented grammar).

2. For each core production present in LR(1) items, find all sets having
same core & replace these sets by their union.

3. Let C′ = { J0 , J1 , … Jm } be the resulting sets of LR(1) items. The parsing


actions for state i are constructed Ji. If there is a conflict, then the
grammar is not LALR (1).

4. For the goto table entries, if J is union of LR (1) items like


J = I1 U I2 U … IK then goto (J, X) = K where K is the union if all set of
items hading the same core as goto (I, X).
SYNTAX ANALYSIS 2.73

5. All other entries are error entries Construct SLR, CLR, LALR parser
for the grammar given below.

S → a  ^  (R)

T → S, T ⁄ S

R → T

Valid I/P − (a, ^)

Invalid I/P − a ^ (a)

2.10. COMPARISON OF LR PARSERS:

It’s a time to compare SLR, LALR and LR parser for the common factors such
as size, class of CFG, efficiency and cost in terms of time and space.

Sr.
SLR parser LALR parser Canonical LR parser
No.

1. SLR parser is smallest The LALR and SLR LR parser or canonical


in size. have the same size. LR parser is largest in
size.

2. It is an easiest method This method is This method is most


based on FOLLOW applicable to wider powerful than SLR and
function. class than SLR. LALR.

3. This method exposes Most of the syntactic This method exposes


less syntactic features features of a language less syntactic features
than that of LR parsers. are expressed in LALR. than that of LR parsers.

4. Error detection is not Error detection is not Immediate error


immediate in SLR. immediate in LALR. detection is done by
LR parser.

5. It requires less time The time and space The time and space
and space complexity. complexity is more in complexity is more for
LALR but efficient canonical LR parser.
methods exist for
constructing LALR
parsers directly.
2.74 COMPILER DESIGN

Graphical representation for the class of LR family is as given below:

LR (1)

LALR (1)

SLR

LR (0)

Fig. 2.24: Classification of grammars


2.11. ERROR HANDLING AND RECOVERY IN SYNTAX ANALYZER:

An efficient program should not terminate on an parse error.

★ It must recover to parse the rest of the input and check for subsequent
errors.

★ For one line input, the routine yyparse () can be made to return 1 on error
and then calls yyparse() again.

YACC program error handling:

★ Parser must be capable of detecting the error as soon as it encounters, i.e.,


when an input stream does not match the rules in grammar.

★ If there is an error-handling subroutine in the grammar file, the parser can


allow for entering the data again, ignoring the bad data or initiating a
cleanup and recovery action.

★ When the parser finds an error, it may need to reclaim parse tree storage,
delete or alter symbol table entries and set switches to avoid generating
further output.

★ Error handling routines are used to restart the parser to continue its process
even after the occurrence of error.

★ Tokens following the error get discarded to restart the parser.

★ The YACC command uses a special token name error, for error handling.
The token is placed at places where error might occur so that it provides a
recovery subroutine.
SYNTAX ANALYSIS 2.75

★ To prevent subsequent occurrence of errors, the parser remains in error state


until it processes three tokens following an error.

★ The input is discarded and no message is produced, if an error occurred


while the parser remains in error state.

(eg.) stat : error ‘ ; ’

★ The above rule tells the parser that when there is an error, it should ignore
the token and all following tokens until it finds the next semicolon.

★ It discards all the tokens after the error and before the next semicolon.
★ Once semicolon is found, the rule is reduced by parser and cleanup action
associated with that rule will be performed.

Providing for error correction:


The input errors can be corrected by entering a line in the data stream again.

input : error ‘\n’

printf (“Reenter last line:”);


}
input
{
$$ = $4;
} ;
The YACC statement, yyerrok is used to indicate that error recovery is complete.

This statement leaves the error state and begins processing normally.

Input : error ‘\n’

yyerrok;

printf (“Reenter last line:”);

intput

$$ = $4;

};
2.76 COMPILER DESIGN

Clearing the Lookahead token:

★ When an error occurs, the lookahead token becomes the token at which the
error was detected.

★ The lookahead token must be changed if the error recovery action includes
code to find the correct place to start processing again.

★ To clear the lookahead token, the error-recovery action issues the following
statement: yyclearin;

★ To assist in error handling, macros can be placed in YACC actions.


Macros for error handling

YYERROR Causes the parser to initiate error handling.

YYABORT Causes the parser to return with a value of 1.

YYACCEPT Causes the parser to return with a value of 0.

YYRECOVERING() Returns a value of 1 if a syntax error has been


detected and the parser has not yet fully recovered.

2.12. YACC:

★ Yet Another compiler compiler.

★ GNU equivalent of Yacc is bison.

★ Translates any grammar that describes a language into a parser for that
language.

★ Written in BNF (Backers Naur Form)

★ • y extension

★ Yacc compiler is invoked using.


$ yacc < options > < filename ending with • y >

1. Generation of tokens using Lex

2. Specification of the grammar by including the following.

★ Writing the grammar & the actions to be taken in a • y till.

★ Writing a lexical analyzer to process input.


SYNTAX ANALYSIS 2.77

Construct predictive parse table for the following grammar


S → a  ↑  (T)

T → T, s ⁄ S
T → ST′
T′ → , ST′ ⁄ ∈

first (s) = { a, ↑ , c }
first (T) = { a, ↑ , c }
first (T′) = { , , ∈ }

I/P string (a, a)


s → a
s → ↑
s → (T)
follow (S) = { , , $ , ) }

follow (T) = { ) }
follow (T′) = { ) }

( T )

S T

a ST

at
Fig. 2.25

I/P symbol

Non-terminal a ↑ ( ) , $
S S→a S→↑ S → (T)
T T → ST′ T → ST′ T → ST′
T′ T′ → ∈ T′ → , ST′
2.78 COMPILER DESIGN

Stack I/P O/P Stack I/P O/P

$ S (a, a) $ $ ) T′ S, a)$ T′ → , ST′

$)T( (a, a) $ S → (T) $ ) T′ S a)$ S→a

$)T a, a ) $ $ ) T′ a a)$

$ ) T′ S a, a ) $ T → ST′ $ ) T′ )$

$ ) T′ a a, a ) $ S→a $) )$ T′ → ∈

$ ) T′ ,a)$ $ $

Construct SLR parser for the grammar


S → a / ^ / (R)
T → S, T / S

R → T
augment grammar
S′ → s

1. S → a S→a
2. S → ^ S→^
3. S → (R) S → (R)

4. T → S, T T → S, T
5. T → S T→S
6. R → T R →T

Consider grammar
S → a Sb S ⁄ bS aS ⁄ ∈
ababbaab
S ⇒ a Sb S ⇒ a b S a S b S ⇒ a b a Sb Sa SbS
lm lm lm

⇒ aba b Sa S b Sa Sb S ⇒ a babb SaS aS b SaSb S


lm lm


lm
SYNTAX ANALYSIS 2.79

TOOL FOR PARSER:


YACC (YET ANOTHER COMPILER COMPILER)

(i) Manual generation of a parser takes much time & hence to automate the
construction, so many built − in tools are available.

(ii) YACC is one such tool used to generate parser (syntax analyzer) for a variety
of languages it generates LALR parsing table.

(iii) The input to the YACC compiler is pgm in YACC language.

YACC Compiler
YACC y.tab.c
specification
(translate.y)

C Compiler
y.tab.c a.out

a.out
input syntax tree
string

Fig. 2.26

2.12.1. YACC Specification:

(iv) A YACC pgm consists of 3 parts:

Declarations

% %

Translation rules

% %

Supporting procedures

(v) Declarations:

This includes declaration of ordinary C declarations, temporaries and the token


formats. Tokens declared in this section can be used in the subsequent translation
and C-routines section.
2.80 COMPILER DESIGN

(vi) Translation Rules:

For each production of the form,

< left side > ’ < alt 1 > / < alt 2 > / ... / < alt n >

The translation rule will be of the form,

< left side > : < alt 1 > { semantic Action 1 }

< alt 2 > { semantic Action 2 }

. . . . . . .

< alt n > { semantic Action n }

Conventions used in writing the Rules:

1. Terminal symbols: symbols represented within single quotes like ‘c’.

2. Non-Terminals: unquoted strings of letters and digits that have not been
declared to be a token.

3. If a production has more than one alternative on its right hand side, they
should be separated by a vertical bar.

4. The translation rule of each production has to be ended with a semicolon.

5. The symbol on the left hand side of first translation rule should be the start
symbol.

6. A YACC semantic rule is nothing but a C statement which has to be executed


when the corresponding production is matched.

7. Symbols used in the semantic action are,

$$ − Attribute value associated with the non-terminal on the left.

$i − Value associated with ith grammar symbol on the right.

*********
CHAPTER – III

INTERMEDIATE CODE
GENERATION

3.1. SYNTAX DIRECTED DEFINITIONS:

★ Generalization of CFG in which each grammar symbol has an associated set


of attributes (synthesized & inherited attributes).

★ Attribute can represent anything we choose − string, number, type, memory


location, ...

★ Value of an attribute at a parse-tree node is defined by semantic rule


associated with the production used at that node.

★ Synthesized attribute − value is computed from values of attribute at


children of that node in parse tree.

★ Inherited attribute is computed from values of siblings and parent of that


node.

★ Semantic rules − dependency between attribute.

★ Dependency graph − evaluation order for semantic rule.

★ Semantic rule: side effect − printing (or) updating a variable.


3.2 COMPILER DESIGN

★ Annotated parse tree − parse tree showing values of attributes at each node.

★ Annotating / decorating parse tree − process of computing the attribute


values at the node.

★ Form of a syntax − directed definition.

Each grammar production A → α

with set of semantic rules of form

b: = f (c1 , c2 , … ck)

function

3.2. EITHER:

1. b is a synthesized attribute of A and c1 , c2 , … ck are attributes


belonging to the grammar symbols of the production.

2. b is an inherited attribute of one of the grammar symbols on the right


side of the production and c1 , c2 … − attributes belonging to grammar
symbols of the production

b depends on c1 , c2 … ck

★ Syntax directed definition of a simple desk calculator

L − line E − Expr T − Term F − Factor

Productions Semantic rules

L → En print (E • val)

E → E1 + T E • val = E1 • val + T • val

E→T E • val : = T • val

T → T1 ∗ F T • val = T1 • val ∗ F • val

T→F T • val = F • val

F → (E) F • val : = E• val

F → digit F • val : = digit • lex val


INTERMEDIATE CODE GENERATION 3.3

Productions Semantic rules

L → En just a procedure that prints as output the value of


arithmetic expression generated by E dummy attribute
for non terminal L

Terminal − synthesized attributes only

O supplied by lexical analyzer

Start symbol − no interited attribute

★ S − attributed definition that uses synthesized attributes exclusively

• Evaluated bottom up
• Adapts LR parser generater
Annotated parse tree for 3 * 5 + 4n = 19

L
n
E.val = 19

+ T.val = 4
E.val = 15
F.val = 4
T.val = 15
digit.lexval = 4
T.val = 3 * F.val = 5

F.val = 3 digit.lexval = 5

digit.lexval = 3

Fig. 3.1

★ Inherited attributes − convenient for expressing the dependence of a


programming language construct on the context in which it appears.

• to keep track of whether an identifier appears on left/right side of an


assignment inorder to decide whether the address (or) value of identifier
is needed
3.4 COMPILER DESIGN

eg:
Syntax directed definition with inherited attributed L • in

D − Declaration
T − Type
in − inherited attribute

Production Semantic rules

D → TL L • in : = T • type

T → int T • type : = integer

T → real T • type : = real

L → L1 , id L1 • in : = L • in

L → id add type (id • entry, L • in)

add type (id • entry, L • in)

real id1 , id2 , id3

T.type = real L.in = real

real ) id3
L.in = real

L.in = real id2


)

id1
Fig. 3.2

3.3. DEPENDENCY GRAPHS:

★ The inter dependencies among the inherited and synthesized attributes at


the nodes in a parse tree can be depicted by a directed graph called
dependency graph.
INTERMEDIATE CODE GENERATION 3.5

★ For each node n in the parse tree do for each attribute a of the grammar
symbol at n do

Construct a node in the dependency graph for ‘a’


For each node n in the parse tree do
For each semantic rule b : = f (c1 , c2 , … ck)

associated with the production used at n do

for i = 1 to k do
Construct an edge from the node

for Ci to the node for b

★ Introduce a dummy synthesized attribute b for each semantic rule that


consists of a procedure call.

★ A • a = f (X • x, Y • y) semantic rule for A → xy

A•a

X•x Y•y

Fig. 3.3
★ X • i = g (A • a, Y • y) semantic rule for A → XY
X.i A.a

A.a X.i Y.y


Y.y
Fig. 3.4
★ E → E1 + E2 E • val = E1 • val + E2 • ra

E val

E1 E2
val val
+
Fig. 3.5
3.6 COMPILER DESIGN

★ real id1 , id2 , id3 T L

4 5
type in
T L 6
3 entry
in 7 id3
real L
8 )
9
in L
10
) id2 2 entry

1 entry
id1
Fig. 3.6

1. id1 • entry * Topological sort of DAG (directed Acyclic graph):

2. id2 • entry

3. id3 • entry → ordering of m1 , m2 , … mk of the nodes of the


graph such that edges go from earlier to later
4. a4 = real nodes.
5. a5 = a4

6. add type (id3 • entry, a5) → gives valid order in which semantic rules can
be evaluated.
7. a7 = a5

8. add type (id2 • entry, a7)

9. a9 = a7

10. addtype (id, • entry, a9)

★ Evaluating semantic rule


★ parse tree method − compile time parse tree shouldn’t have cycle.

★ rule based method − at compiler construction time

★ oblivious method − doesn’t consider semantic rule forced by parsing method.


INTERMEDIATE CODE GENERATION 3.7

3.3.1. Syntax Tree:


Construction of syntax trees:
★ Grammar suitable for parsing may not reflect the natural hierarchical
structure of the construts in the language.

★ The order in which the nodes in a parse tree are considered may not match
the order in which information about a construct becomes available.

★ Syntax tree
• Condensed form of parse tree
• Useful for representing language constructs.
• S → if B then S1 else S2.
if - then - else

B S1 S2

Fig. 3.7
• Operators & keywords don’t appear as leaves

Parse tree Syntax tree


L +

n 4
E *

E + T 3 5

T F
T * F
F-3 4
5

Fig. 3.8
• Constructing syntex tree for expressions.
3.8 COMPILER DESIGN

★ Similar to translation of expressions to postfix form.


★ Operator → label of the node
→ pointers to nodes for operand

Record with several fields.

Function to create nodes

mk node (op, left, right)

mk leaf (id, entry) → makes an entry into symbol table

mk leaf (num, val)

eg: a − 4 + C (bottom up construction)

P1 = mk leaf (id, entry a)

P2 = mk leaf (num, 4)

P3 = mk node ( − , P1 , P2)

P4 = mk leaf (id, entry c)

P5 = mk node ( + , P3 , P4)

_ id

to entry for C

id num 4

to entry for a
Fig. 3.9
INTERMEDIATE CODE GENERATION 3.9

★ Syntax directed definition for construction of syntax tree.

Production Semantic rule

E → E1 + T E • nptr = mk node (‘+’, E1 • nptr, T • nptr)

E → E1 − T E • nptr = mk node (‘ −’, E, • nptr, T• nptr)

E→T E • nptr = T • nptr

T → (E) T • nptr = E • nptr

T → id T • nptr = mk leaf (id, id • entry)

T → num T • nptr = mk leaf (num, num • val)

a−4+C
E nptr

E nptr T nptr
+

E nptr
T nptr id
_
+
num
T nptr

_ id
id

to entry for C

id num 4

to entry for a

Fig. 3.10

3.3.2. Three Address Code:

Three address code is a type of intermediate code which is easy to generate


and can be easily converted to machine code. It makes use of at most three addresses
and one operator to represent an expression and the value computed at each
instruction is stored in temporary variable generated by compiler. The compiler
decides the order of operation given by three address code.
3.10 COMPILER DESIGN

General representation: a = b op c
Where a, b or c represents operands like names, constants or compiler generated
temporaries and op represents the operator.

Example 1: Convert the expression a ∗ − (b + c) into three address code.

t1 = b + c

t2 = uminus t1

t3 = a ∗ t2

Example 2: Write three address code for following code

for (i = 1; i < = 10; i ++)

a[i] = x * 5;

i=1

L : t1 = x ∗ 5

t2 = & a

t3 = sizeof(int)

t4 = t3 ∗ i

t5 = t2 + t4

t5 = t1

i=i+1

if i < = 10 goto L

3.3.2.1. Implementation of Three Address Code:


There are 3 representations of three address code namely

1. Quadruple

2. Triples

3. Indirect Triples
INTERMEDIATE CODE GENERATION 3.11

1. Quadruple:
It is structure with consist of 4 fields namely op, argl, arg2 and result. op
denotes the operator and arg1 and arg2 denotes the two operands and result is used
to store the result of the expression.

Operator
Source 1
Source 2
Destination

Advantage:

★ Easy to rearrange code for global optimization.


★ One can quickly access value of temporary variables using symbol table.
Disadvantage:

★ Contain lot of temporaries.


★ Temporary variable creation increases time and space complexity.
Example: Consider expression a = b ∗ − c + b ∗ − c.

The three address code is:

t1 = uminus c
t2 = b ∗ t1
t3 = uminus c
t4 = b ∗ t3
t5 = t2 + t4
a = t5

# Op Arg1 Arg2 Result


(0) uminus c t1
(1) * t1 b t2
(2) uminus c t3
(3) * t3 b t4
(4) + t2 t4 t5
(5) = t5 a

Fig. 3.11: Quadruple representation


3.12 COMPILER DESIGN

2. Triples:

This representation doesn’t make use of extra temporary variable to represent


a single operation instead when a reference to another triple’s value is needed, a
pointer to that triple is used. So, it consist of only three fields namely op, arg1 and
arg2.

Operator

Source 1

Source 2

Disadvantage:

★ Temporaries are implicit and difficult to rearrange code.

★ It is difficult to optimize because optimization involves moving intermediate


code. When a triple is moved, any other triple referring to it must be updated
also. With help of pointer one can directly access symbol table entry.

Example: Consider expression a = b ∗ − c + b ∗ − c

# Op Arg1 Arg2

(0) uminus c

(1) * (0) b

(2) uminus c

(3) * (2) b

(4) + (1) (3)

(5) = a (4)

Fig. 3.12: Triples representation

3. Indirect Triples:

This representation makes use of pointer to the listing of all references to


computations which is made separately and stored. Its similar in utility as compared
to quadruple representation but requires less space than it. Temporaries are implicit
and easier to rearrange code.
INTERMEDIATE CODE GENERATION 3.13

Example: Consider expression a = b ∗ − c + b ∗ − c

List of pointers to table

# Op Arg1 Arg2 # Statement

(14) uminus c (0) (14)

(15) * (14) b (1) (15)

(16) uminus c (2) (16)

(17) * (16) b (3) (17)

(18) + (15) (17) (4) (18)

(19) = a (18) (5) (19)

Fig. 3.13: Indirect Triples representation

Example: Write quadruple, triples and indirect triples for following expression:
(x + y) ∗ (y + z) + (x + y + z)

Example: The three address code is:

t1 = x + y

t2 = y + z

t3 = t1 ∗ t2

t4 = t1 + z

t5 = t3 + 14

# Op Arg1 Arg2 Result

(1) + x y t1

(2) + y z t2

(3) * t1 t2 t3

(4) + t1 z t4

(5) + t3 t4 t5

Fig. 3.14: Quadruple representation


3.14 COMPILER DESIGN

# Op Arg1 Arg2

(1) + x y

(2) + y z

(3) * (1) (2)

(4) + (1) z

(5) + (3) (4)

Fig. 3.15: Triples representation

List of pointers to table

# Op Arg1 Arg2 # Statement

(14) + x y (1) (14)

(15) + y z (2) (15)

(16) * (14) (15) (3) (16)

(17) + (14) z (4) (17)

(18) + (16) (17) (5) (18)

Fig. 3.16: Indirect Triples representation

3.3.2.2. Types of Three Address Code:

Translation of Assignment Statements

In the syntax directed translation, assignment statement is mainly deals with


expressions. The expression can be of type real, integer, array and records.

Consider the grammar

1. S → id: = E

2. E → E1 + E2

3. E → E1 * E2

4. E → (E1)

5. E → id
INTERMEDIATE CODE GENERATION 3.15

The translation scheme of above grammar is given below:

Production rule Semantic actions

S → id : =E {p = look_up(id.name);
If p ≠ nil then
Emit (p = E.place)
Else
Error;
}

E → E1 + E2 {E.place = newtemp();
Emit (E.place = E1.place ‘+’ E2.place)
}

E → E1 * E2 {E.place = newtemp();
Emit (E.place = E1.place ‘+’ E2.place)
}

E → (E1) {E.place = E1.place}

E → id {p = look_up(id.name);
If p ≠ nil then
Emit (p = E.place)
Else
Error;
}

★ The p returns the entry for id.name in the symbol table.


★ The Emit function is used for appending the three address code to the output
file. Otherwise it will report an error.
★ The newtemp() is a function used to generate new temporary variables.
★ E.place holds the value of E.

Boolean expressions

Boolean expressions have two primary purposes. They are used for computing
the logical values. They are also used as conditional expression using if-then-else or
while-do.
Consider the grammar
1. E → E OR E
2. E → E AND E
3. E → NOT E
3.16 COMPILER DESIGN

4. E → (E)
5. E → id relop id
6. E → TRUE
7. E → FALSE

The relop is denoted by <, >, <, >.

The AND and OR are left associated. NOT has the higher precedence then AND and
lastly OR.

Production rule Semantic actions

E → E1 OR E2 {E.place = newtemp();
Emit {E.place ‘:=’ E1 place ‘OR’ E2.place}
}

E → E1 + E2 {E.place = newtemp();
Emit {E.place ‘:=’ E1.place ‘AND’ E2.place}
}

E → NOT E1 {E.place = newtemp();


Emit {E.place ‘:=’ ‘NOT’ E1.place}
}

E → (E1) {E.place = E1.place}

E → id relop id2 {E.place = newtemp();


Emit {‘if’ id1.place relop.op id2.place ‘goto’
nextstar + 3};
EMIT {E.place ‘:=’ ‘0’
EMIT {‘goto’ nextstat + 2}
EMIT {E.place ‘:=’ ‘1’}
}

E → TRUE {E.place := newtemp();


Emit {E.place ‘:=’ ‘1’}
}

E → FALSE {E.place := newtemp();


Emit {E.place ‘:=’ ‘0’}
}

‘The EMIT function is used to generate the three address code and the
newtemp( ) function is used to generate the temporary variables.
INTERMEDIATE CODE GENERATION 3.17

The E → id relop id2 contains the next_state and it gives the index of next
three address statements in the output sequence.

Here is the example which generates the three address code using the above
translation scheme:

1. p>q AND r<s OR u>r

2. 100: if p>q goto 103

3. 101: t1:=0

4. 102: goto 104

5. 103: t1:=1

6. 104: if r>s goto 107

7. 105: t2:=0

8. 106: goto 108

9. 107: t2:=1

10. 108: if u>v goto 111

11. 109: t3:=0

12. 110: goto 112

13. 111: t3:= 1

14. 112: t4:= t1 AND t2

15. 113: t5:= t4 OR t3

Statements that alter the flow of control

The goto statement alters the flow of control. If we implement goto statements
then we need to define a LABEL for a statement. A production can be added for
this purpose:

1. S → LABEL : S

2. LABEL → id

In this production system, semantic action is attached to record the LABEL and its
value in the symbol table.
3.18 COMPILER DESIGN

Following grammar used to incorporate structure flow-of-control constructs:

1. S → if E then S
2. S → if E then S else S
3. S → while E do S
4. S → begin L end
5. S → A
6. L → L ; S
7. L → S

Here, S is a statement, L is a statement-list, A is an assignment statement and


E is a Boolean-valued expression.

Translation scheme for statement that alters flow of control

★ We introduce the marker non-terminal M as in case of grammar for Boolean


expression.

★ This M is put before statement in both if then else. In case of while-do, we


need to put M before E as we need to come back to it after executing S.

★ In case of if-then-else, if we evaluate E to be true, first S will be executed.


★ After this we should ensure that instead of second S, the code after the
if-then else will be executed. Then we place another non-terminal market N
after first S.

The grammar is as follows:


1. S → if E then M S

2. S → if E then M S else M S

3. S → while M E do M S

4. S → begin L end

5. S → A

6. L → L ; M S

7. L → S

8. M → ∈

9. N → ∈
INTERMEDIATE CODE GENERATION 3.19

The translation scheme for this grammar is as follows:

Production rule Semantic actions

S → if E then M S1 BACKPATCH {E.TRUE, M.QUAD}


S.NEXT = MERGE {E.FALSE, S1.NEXT}

S → if E then M1 S1 else BACKPATCH {E.TRUE, M1.QUAD}


M2 S2 BACKPATCH {E.FALSE, M2.QUAD}
S.NEXT = MERGE {S1.NEXT, N.NEXT,
S2.NEXT}

S → while M1 E do M2 S1 BACKPATCH {S1,NEXT, M1.QUAD}


BACKPATCH {E.TRUE, M2.QUAD}
S.NEXT = E.FALSE
GEN {goto M1.QUAD}

S → begin L end S.NEXT = L.NEXT

S → A S.NEXT = makelist ()

L → L ; M S BACKPATHCH {L1.NEXT, M.QUAD}


L.NEXT = S.NEXT

L → S L.NEXT = S.NEXT

M → ∈ M.QUAD = NEXTQUAD

N → ∈ N.NEXT = MAKELIST {NEXTQUAD}


GEN {goto_}

Postfix Translation
In a production A → α, the translation rule of A.CODE consists of the concatenation
of the CODE translations of the non-terminals in α in the same order as the
non-terminals appear in α. Production can be factored to achiee postfix form.

Postfix translation of while statement


The production
1. S → while M1 E do M2 S1

Can be factored as:


1. S → C S1
2. C → W E do
3. W → while
3.20 COMPILER DESIGN

A suitable transition scheme would be

Production Rulem Semantic Action

W → while W.QUAD = NEXTQUAD

C → W E do C W E do

S → C S1 BACKPATCH {S1.NEXT, C.QUAD}


S.NEXT = C.FALSE
GEN {goto C.QUAD}

Postfix translation of for statement


The production

1. S for L = E1 step E2 to E3 do S1
Can be factored as

1. F → for L
2. T → F = E1 by E2 to E3 do
3. S → T S1
Array references in arithmetic expressions
Elements of arrays can be accessed quickly if the elements are stored in a block
of consecutive location. Array can be one dimensional or two dimensional.
For one dimensional array:
1. A: array [low..high] of the ith elements is at:
2. base + (i-low)*width → i*width + (base - low*width)
Multi-dimensional arrays:
Row major or column major forms
★ Row major: a[1,1], a[1,2], a[1,3], a[2,1], a[2,2], a[2,3]
★ Column major: a[1,1], a[2,1], a[1,2], a{2,2}, a[1,3], a[2,3]
★ In raw major form, the address of a[i1, i2] is
★ Base+((i1-low)*(high2-low2+1)+i2-low2)*width
The production:
1. S → L : = E
2. E → E+E
3. E → (E)
INTERMEDIATE CODE GENERATION 3.21

4. E → L
5. L → Elist]
6. L → id
7. Elist → Elist, E
8. Elist → id

Production Rule Semantic Action


S → L : = E {if L.offset = null then emit(L.place ‘:=’ E.place)
else EMIT {L.place‘[’L.offset ‘]’ ‘:=’ E.place};
}
E → E+E {E.place := newtemp;
EMIT {E.place ‘:=’ E1.0place ‘+’ Es.place};
}
E → (E) {E.place :=.place;}
E → L {if L.offset = null then E.place = L.place
else {E.place = newtemp;
EMIT {E.place ‘:=’ L.place ‘[’ L.offset ‘]’};
}
}
L → Elist ] {L.place = newtemp; L.offset = newtemp;
EMIT {L.place ‘:=’ c{Elist.array}};
EMIT {L.offset ‘:=’ Elist.place ‘*’ width{Elist.array};
}
L → id {L.place = lookup{id.name};
L.offset = null;
}
Elist → Elist, E {t := newtemp;
m := Elist1.ndim + 1;
EMIT {t ‘:=’ Elist1.place ‘*’ limit{Elist1.array, m}};
EMIT {t, ‘:=’ t‘+’ E.place};
Elist.array = Elist1.array;
Elist.place := t;
Elist.ndim := m;
}
Elist → id[E {Elist.array := lookup{id.name};
Elist.place := E.place
Elist.ndim :=1;
}
3.22 COMPILER DESIGN

Where:
ndim denotes the number of dimensions.

limit(array, i) function returns the upper limit along with the dimension of array

width(array) returns the number of byte for one element of array.

Procedures call
Procedure is an important and frequently used programming construct for a
compiler. It is used to generate good code for procedure calls and returns.

Calling sequence:
The translation for a call includes a sequence of actions taken on entry and
exit from each procedure. Following actions take place in a calling sequence:
★ When a procedure call occurs then space is allocated for activation record.
★ Evaluate the argument of the called procedure.
★ Establish the environment pointers to enable the called procedure to access
data in enclosing blocks.
★ Save the state of the calling procedure so that it can resume execution after
the call.
★ Also save the return address. It is the address of the location to which the
called routine must transfer after it is finished.
★ Finally generate a jump to the beginning of the code for the called procedure.
Let us consider a grammar for a simple procedure call statement
1. S → call id (Elist)
2. Elist → Elist, E
3. Elist → E

A suitable transition scheme for procedure call would be:

Production Rule Semantic Action


S → call id(Elist) for each item p on QUEUE do
GEN (param p)
GEN (call id.PLACE)
Elist → Elist, E append E.PLACE to the end of QUEUE
Elist → E initialize QUEUE to contain only
E.PLACE

Queue is used to store the list of parameters in the procedure call.


INTERMEDIATE CODE GENERATION 3.23

3.4. DECLARATIONS:

When we encounter declarations, we need to lay out storage for the declared
variables. For every local name in a procedure, we create a ST(Symbol Table) entry
containing:

1. The type of the name

2. How much storage the name requires

The production:

1. D → integer, id

2. D → real, id

3. D → D1, id

A suitable transition scheme for declarations would be:

Production rule Semantic action

D → integer, id ENTER {id.PLACE, integer}


D.ATTR = integer

D → real, id ENTER {id.PLACE, real}


D.ATTR = real

D → D1, id ENTER {id.PLACE, D1.ATTR}


D.ATTR = D1.ATTR

ENTER is used to make the entry into symbol table and ATTR is used to trace the
data type.

Case Statements:

Switch and case statement is available in a variety of languages. The syntax of


case statement is as follows:

switch E

begin

case V1:S1

case V2:S2
3.24 COMPILER DESIGN

case Vn-1: Sn-1

default: Sn

end

The translation scheme for this shown below:

Code to evaluate E into T


goto TEST

L1: code for S1

goto NEXT

L2: code for S2

goto NEXT

Ln-1: code for Sn-1

goto NEXT

Ln: code for Sn

goto NEXT

TEST: if T = V1 goto L1

if T = V2 goto L2

if T = Vn-1 goto Ln-1

goto
INTERMEDIATE CODE GENERATION 3.25

NEXT:

★ When switch keyword is seen then a new temporary T and two new labels
test and next are generated.

★ When the case keyword occurs then for each case keyword, a new label Li
is created and entered into the symbol table. The value of Vi of each case
constant and a pointer to this symbol-table entry are placed on a stack.

3.5. TRANSLATION OF EXPRESSIONS:

1. Operations Within Expressions

2. Incremental Translation

3. Addressing Array Elements

4. Translation of Array References

The rest of this chapter explores issues that arise during the translation of
ex-pressions and statements. We begin in this section with the translation of
ex-pressions into three-address code. An expression with more than one operator, like
a + b ∗ c, will translate into instructions with at most one operator per in-struction.
An array reference A [i] [j] will expand into a sequence of three-address instructions
that calculate an address for the reference. We shall consider type checking of
expressions and the use of boolean expressions to direct the flow of control through
a program.

1. Operations within Expressions

The syntax-directed definition builds up the three-address code for an


assignment statement S using attribute code for S and attributes addr and code for
an expression E. Attributes S.code and E.code denote the three-address code for S
and E, respectively. Attribute E.addr denotes the address that will hold the value of
E. Recall that an address can be a name, a constant, or a compiler-generated
temporary.
3.26 COMPILER DESIGN

PRODUCTION SEMANTIC RULES

S → id = E: S code = E.code 
gen{top.get (id.lexeme) ‘=’ E.addr}

E → E1 + E2 E.addr = new Temp ()


E.code = E1.code|| E2.code ||
gen (E.addr ‘=’ E1.addr ‘’ E2.addr)

 − E1 E.addr = new Temp ()


E.code = E1.code ||
gen(E.addr ‘=’ ‘minus’ E1.addr)

 ( E1 ) E.addr = E1.addr
E.code = E1.code

 id E.addr = top.get(id.lexeme)
E.code = ’’

Fig. 3.17: Three-address code for expressions

Consider the last production, E → • id, in the syntax-directed definition. When


an expression is a single identifier, say x, then x itself holds the value of the
expression. The semantic rules for this production define E.addr to point to the
symbol-table entry for this instance of id. Let top denote the current symbol table.
Function top. get retrieves the entry when it is applied to the string representation
id.lexeme of this instance of id.E.code is set to the empty string.

When E → (Ei), the translation of E is the same as that of the subex-pression


Ei. Hence, E.addr equals Ei.addr, and E.code equals Ei.code.

The operators + and unary are representative of the operators in a typical


language. The semantic rules for E → E1 + E2, generate code to compute the value
of E from the values of E1 and E2. Values are computed into newly generated
temporary names. If E1 is computed into Ei.addr and E2 into E2.addr, then
E1 + E2 translates into t = E1.addr + E2.addr, where t is a new temporary name.
E.addr is set to t. A sequence of distinct temporary names ti, t2,... is created b y
successively executing new Tempi.

For convenience, we use the notation gen (x ‘=’ y ‘+’ z) to represent the
three-address instruction x = y + z. Expressions appearing in place of variables like x,
y, and z are evaluated when passed to gen, and quoted strings like are taken literally
5. Other three-address instructions will be built up similarly 5 In syntax-directed
INTERMEDIATE CODE GENERATION 3.27

definitions, gen builds an instruction and returns it. In translation schemes, gen
builds an instruction and incrementally emits it by putting it into the stream by
applying gen to a combination of expressions and strings.
When we translate the production E → Ei+E2, the semantic rules build up E.code
by concatenating Ei.code, E2.code, and an instruction that adds the values of E1 and
E2. The instruction puts the result of the addition into a new temporary name for
E, denoted by E.addr.

The translation of E → − E1 is similar. The rules create a new temporary for E


and generate an instruction to perform the unary minus operation.

Finally, the production S id = E; generates instructions that assign the value of


expression E to the identifier id. The semantic rule for this production uses function
top.get to determine the address of the identifier represented by id, as in the rules
for E − v id. S.code consists of the instructions to compute the value of E into an
address given by E.addr, followed by an assignment to the address top.get (id.lexeme)
for this instance of id.

Example: The syntax-directed definition in Fig. translates the as-signment statement


a = b + − c; into the three-address code sequence

t1 = minus c

t2 = b + t1

a = t2

2. Incremental Translation

Code attributes can be long strings, so they are usually generated incrementally,
instead of building up E.code, we can arrange to generate only the new three-address
instructions, as in the translation scheme. In the incremental approach, gen not only
constructs a three-address instruction, it appends the instruction to the sequence of
instructions generated so far. The sequence may either be retained in memory for
further processing, or it may be output incrementally.

The translation scheme generates the same code as the syntax-directed


definition. With the incremental approach, the code attribute is not used, since there
is a single sequence of instructions that is created by successive calls to gen. For
example, the semantic rule for E → ≈ EI + E2 simply calls gen to generate an add
instruction; the instructions to compute Ei into Et.addr and E2 into E2.addr have
already been generated.
3.28 COMPILER DESIGN

The approach can also be used to build a syntax tree. The new semantic action
for E → E1 + E2 creates a node by using a constructor, as in of generated instructions.
E → Ei + E2 {E.addr = new Node(“+’, E1.addr, E2.addr); }

Here, attribute addr represents the address of a node rather than a variable or
constant.

S → id = E ; { gen(top.get(id.lexeme) ‘=’ E.addr); }

E → E1 + E2 { E.addr = new Temp ();


gen(E.addr ‘=] E1.addr ‘+’ E2.addr); }

 − E1 { E.addr = new Temp();


gen(E.addr ‘=’ E1.addr); }

 ( E1 ) { E.addr = E1.addr; }

 id { E.addr = top.get(id.lexeme); }

Fig. 3.18: Generating three-address code for expressions incrementally


3. Addressing Array Elements
Array elements can be accessed quickly if they are stored in a block of
consecutive locations. In C and Java, array elements are numbered 0, 1, ..., n = 1,
for an array with n elements. If the width of each array element is w, then the
iih element of array A begins in location

base + i × w .... (1)

where base is the relative address of the storage allocated for the array. That is,
base is the relative address of A [0].

The formula (1) generalizes to two or more dimensions. In two dimensions, we


write A [ii][i2] in C and Java for element i2 in row ii. Let w1 be the width of a row
and let w2 be the width of an element in a row. The relative address of A[zi][z2]
can then be calculated by the formula.

base + i1 × w1 + i2 × w2 .... (2)

In k dimensions, the formula is

base + i1 × w1 + i2 × w2 + … + ik × wk .... (3)

where Wj, for 1 = < j = < k, is the generalization of w1 and w2 in (2).

Alternatively, the relative address of an array reference can be calculated in


terms of the numbers of elements rij along dimension j of the array and the width
INTERMEDIATE CODE GENERATION 3.29

w = Wk of a single element of the array. In two dimensons (i.e., k = 2 and w = w2),


the location for A [n][i2] is given by
base + (i1 × n2 + i2) × w .... (4)

In k dimensions, the following formula calculates the same address as (3):


base + ((... (i1 × n2 + i2) × n3 + i3) ...) × nk + ik) × w .... (5)

More generally, array elements need not be numbered starting at 0. In a


one-dimensional array, the array elements are numbered low, low + 1, … , high and
base is the relative address of A[low]. Formula (1) for the address of A [i] is replaced
by:
base + (i − low) × w .... (6)

The expressions (1) and (6) can be both be rewritten as i x w + c, where the
subexpression c = base − low xw can be precalculated at compile time.
Note that c = base when low is 0. We assume that c is saved in the symbol
table entry for A, so the relative address of A [i] is obtained by simply adding
i × w to c.
Compile-time precalculation can also be applied to address calculations for
elements of multidimensional arrays; However, there is one situation where we cannot
use compile-time precalculation: when the array’s size is dynamic. If we do not know
the values of low and high (or their generalizations in many dimensions) at compile
time, then we cannot compute constants such as c. Then, formula like (6) must be
evaluated as they are written, when the program executes.
The above address calculations are based on row-major layout for arrays, which
is used in C and Java. A two-dimensional array is normally stored in one of two
forms, either row-major (row-by-row) or column-major (column-by-column). The layout
of a 2 × 3 array A in (a) row-major form and (b) column-major form. Column-major
form is used in the Fortran famil of languages.

A[1, 1] A[1, 1]
First row A[1, 2] A[2, 1] First column
A[1, 3] A[1, 2]
Second column
A[2, 1] A[2, 2]
Second row A[2, 2] A[1, 3]
Third column
A[2, 3] A[2, 3]
(a) Row Major (b) Column Major

Fig. 3.19: Layouts for a two-dimensional arraw


3.30 COMPILER DESIGN

We can generalize row or column-major form to many dimensions. The


ge3neralization of row-major form is to store the elements in such a wa that, as we
scan down a block of storage, the righmost subscripts appear to vary fastest, like
the numbers on an odometer. Column-major form generalizes to the opposite
arrangement, with the leftmost subscripts varying fastest.

4. Translation of Array References

The chief problem in generating code for array references is to relate the address
calculation formulas in Section to a grammar for array references. Let nonterminal
L generate an array name followed by a sequence of index expressions.

L → L [ E ]  id [ E ]

As in C and Java, assume that the lowest-numbered array element is 0.

Let us calculate addresses based on widths, using the formula (3), rather than
on numbers of elements, as in (5). The translation scheme generates three-address
code for expressions with array references. It consists of the productions and semantic
actions, together with productions involving nonterminal L.

S → id = E ; { gen(top.get(id.lexeme) ‘=’ E.addr); }


 L=E; { gen(E.addr.base ‘[’ L.addr ‘]’ ‘=’ E.addr); }
E → E1 + E2 { e.addr = new Temp();
gen(E.addr ‘=’ E1.addr ‘+’ E2.addr); }

 id { E.addr = top.get(id.lexeme); }
 L { E.addr = new Temp();
gen(E.addr ‘=’ L.array.base ‘[’ L.addr ‘]’); }
L → id [ E ] { L.array = top.get(id.lexeme);
L.type = L.array.type.elem;
L.addr = new Temp ();
gen(L.addr ‘=’ E.addr ‘*’ L.type.width); }
 L1 [ E ] { L.array = L1.array;
L.type = L1.type.elem;
t = new Temp ();
L.addr = new Temp ();
gen(t ‘=’ E.addr ‘∗’ L.type.width); }
gen(L.addr ‘=’ L1.addr ‘+’ t); }

Fig. 3.20: Semantic actions for array references


INTERMEDIATE CODE GENERATION 3.31

Nonterminal L has three synthesized attributes:

1. L.addr denotes a temporary that is used while computng the offset for the array
reference by summing the terms in × Wj in (3).

L.array is a pointer to the symbol-table entry for the array name. The base
address of the array, say, L.array.base is used to determine the actual.

l-value of an array reference after all the index expressions are analyzed.

L.type is the type of the subarray generated by L. For any type t, we assume
that its width is given by t.width. We use types as attributes, rather than widths,
since types are needed anyway for the checking. For any array type t, suppose that
t.elem gives the element type.

The production S → id = E; represents an assignment to a nonarray variable,


which is handled as usual. The semantic action for S → L = E; generates an indexed
copy instruction to assign the value denoted by expression E to the location denoted
by the array reference L. Recall that attribute L.array gives the symbol-table entry
for the array. The array’s base address − the address of its oth element − is given
by L.array.base. Attribute L.addr denotes the temporary that holds the offset for the
array reference generated by L. The location for the array reference is therefore
L.array.base [L.addr]. The generated instruction copies the r-value from address
E.addr into the location for L.

Productions E → Ei + E2 and E → id are the same as before. The semantic action


for the new production E → • L generates code to copy the value from the location
denoted by L into a new temporary. This location is L.array.base[L.addr], as discussed
above for the production S → L = E;. Again, attribute L.array gives the array name,
and L.array.base gives its base address. Attribute L.addr denotes the temporary that
holds the offset. The code for the array reference places the r-value at the location
designated by the base and Example 11: Let a denote a 2 × 3 array of integers, and
let c, i, and j all denote integersd. Then, the type of a is array (2, array (S, integer)).
Its width w is 24, assuming that the width of an integer is 4. The type of a [i] is
array (3, inter), of width w1 = 12. The type of a [i] [j] is integer.

An annotated parse tree for the expression c + a [i][j].

The expression is translated into the sequence of three-address instructions in


Figure. As usual, we have used the name of each identifier to refer to its symbol
table entry.
3.32 COMPILER DESIGN

3.6. TYPE CHECKING:


A complier must check that the source program follows both syntactic and
semantic conventions of the source language. This checking, called static checking,
detects and reports programming errors.
Examples of static checks:

1. The checks − A compiler should report an error if an operator is applied to


an incompatibhle operand. Example: If an array variable and function variable
are added together.

2. Flow-of-control checks − Statements that cause flow of control to leave a


construct must have some place to which to transfer the flow of control.
Example: An enclosing statement, such as break, does not exist in switch
statement.

Token Syntax type Syntax intermediate intermediate


parser
stream tree checker tree code generator representation

Fig. 3.21: Poisition of type checker

A type checker verifies that the type of a construct matches that expected by
its context. For example: arithmetic operator mod in Pascal requires integer operands,
so a type checker verifies that the operands of mod have type integer. Type
information gathered by a type checker may be needed when code is generated.

Type Systems

The design of a type checker for a language is based on information about the
syntactic constructs in the language, the notion of types, and the rules for assigning
types to language constructs.

For example: “If both operands of the arithmetic operators of +, − and * are of
type integer, then the result is of type integer”.

3.6.1. Type Expressions:

The type of a language construct will be denoted by a “type expression.” A type


expression is either a basic type or is formed by applying an operator called a type
constructor to other type expressions. The sets of basic types and constructors depend
on the language to be checked. The following are the definitions of type expressions:
INTERMEDIATE CODE GENERATION 3.33

1. Basic types such as boolean, char, integer, real are type expressions.
A special basic type, type_error, will signal an error during type checking;
void denoting “the absence of a value” allows statements to be checked.

2. Since type expressions may be named, a type name is a type expression.

3. A type constructor applied to type expressions is a type expression.

Constructors include:

Arrays: If T is a type expression then array (I,T) is a type expression denoting the
type of an array with elements of type T and index set I.

Products: If T1 and T2 are type expressions, then their Cartesian product T1 × T2


is a type expression.

Records: The difference between a record and a product is that the names. The
record type constructor will be applied to a tuple formed from field names and field
types.

For example:

type row = rocord

address: integer;

lexeme: array[1..15] of char

end;

var table: array[1...101] of row;

Declares the type name row representing the type expression record ((address
X intege) X (lexeme X array(1..15,char))) and the variable table to be an arra of
records of this type.

Pointers: If T is a type expression, then pointer(T) is a type expression denoting


the type “pointer to an object of type T”.

For example, var p: ↑ row declares variable p to have type pointer (row).

Functions: A function in programming languages maps a domain type D to a range


type R. The type of such function is denoted by the type expression D → R
3.34 COMPILER DESIGN

4. Type expressions may contain variables whose values are type expressions.

X Pointer

char char
integer
Fig. 3.22: Tree representation for char x char → pointer (integer)

Type systems

A type system is a collection of rules for assigning type expressions to the


various parts of a program. A type checker implements a type system. It is specified
in a syntax-directed manner. Different type systems may be used by different
compilers or processors of the same language.

Static and Dynamic Checking of Types

Checking done by a compiler is said to be static, while checking done when the
target program runs is termed dynamic. Any check can be done dynamically, if the
target code carries the type of an element along with the value of that element.

Sound type system

A sound type system eliminates the need for dynamic checking of allows us to
determine statically that these errors cannot occur when the target program runs.
That is, if a sound type system assigns a type other than type_error to a program
part, then type errors cannot occur when the target code for the program part is
run.

Strongly typed language

A language is strongly typed if its compiler can guarantee that the programs
it accepts will execute without type errors.

Error Recovery

Since type checking has the potential for catching errors in program, it is
desirable for type checker to recover from errors, so it can check the rest of the
input. Error handling has to be designed into the type system right from the start;
the type checking rules must be prepared to cope with errors.
INTERMEDIATE CODE GENERATION 3.35

3.7. TYPE CONVERSION:

Type conversion is the process of converting one type to another. We will


consider one example to understand the type conversion.

Consider a statement f + i where f is a float type identifier and i is a integer


type identifier and addition of the two has to be done. The representation of integer
and float is different within the computer hence different machine instructions are
used for operation on integer and float. For that, the compiler needs to convert one
of the operand in order to obtain the same type of operands for the operation.

Coercions:

There are two types of conversions: Implicit conversion and explicit


conversion.

If the conversion is done automatically by the compiler then it is called implicit


conversion. The implicit conversions are also called as coercions. In this type of
conversion there is no loss of information for example integer can be converted to
float but not vice versa. That means in conversion from integer to float there is no
loss of information but for converting the information from float to integer there is
a loss.

The conversion is said to be explicit if the programmer specifically writes


something for converting one type to another.

For example:

int xyz,p;

p=(float)xyz;

The identifier xyz is type-casted and this is how explicit conversion from int to
float takes place.

All conversions in Ada are explicit while C supports implicit conversions. (It
converts ASCII characters to integers in arithmetic expression.)
3.36 COMPILER DESIGN

Type checking rules for coercion from integer to float are as given below:

E → num E.type:=integer
E → num.num E.type:=float
E → id E.type:=look_up(id.entry)
E → E1 op E2 E.type:=if E1.type= int and E2.type = int
then int
else if E1.type = int and E2.type = float
then float
else if E1.type = float and E2.type = int
then float
else if E1.type = float and E2.type = float
then float
else
type_error
}

The necessary conversion from int to float takes place implicitly. The function
look_up returns the type saved in the symbol table for the corresponding id entry.
Consider an array A of floats that can be initialized to 1 as follows:

Implicit type conversion


for(i=0:i<n;i++)

A[i] = 1;

This code takes 4.8 microseconds to execute.


Explicit type conversion
for(i=0:i<n;i++)

A[i] = 1.0;

This code takes 5.4 nanoseconds to execute. Since the implicit conversion is done
at compile time there is a great improvement to the run time of the object program.
Thus semantic analysis is a phase in which analysis of declarative statements
is done. The analysis of declarative statement involves two activities (i) Type analysis,
name and scope analysis (ii) Entry of type, length, access control information in
symbol table. We have seen how a type for corresponding statement is decided.
*********
CHAPTER – IV

RUN-TIME ENVIRONMENT
AND CODE GENERATION

4.1. INTRODUCTION:
Run time environments (or) systems:
Manage activation of procedures

↓ into

execution

★ Primary issues

• is recursion possible?
• is parameter passing mechanisms?
(call by reference, call by value)

• is reference to non-local names?


• is creation of dynamic data structures.
Formal parameters − declared in procedure definition

Actual parameters − values / variable passed while calling a function.


4.2 COMPILER DESIGN

Activation tree:
★ Each node represents an activation of a procedure.
★ root − activation of main program.
★ node for ‘a’ is a parent to node for ‘b’ iff control flows from activation ‘a’ to
activation ‘b’.

★ node for ‘a’ is to left of the node for ‘b’ iff life time of ‘a’ occurs before life
time of ‘b’.
Control stack: Keep track of currently active activations

push − procedure call


pop − procedure returns
Declaration: syntactic construct that associates some information with a name.

Scoping rules: determine where in a program a declaration applies.

Data objects: Corresponds to storage location that can hold values.

Even if a name is declared once, the same name can denote different data
objects at run time.
Environment: Function mapping from names (x) to storage locations (s). The
association is a binding (i.e. x is bound to S)

State: Function mapping storage locations to the values held in those locations.

(map l-values to r-values)

Name Storage Valve


has 1-valve l-valve r-valve
&
r-valve
Fig. 4.1

★ Assignments change the state but not the environment.

eg: pi = 3.14

Value of pi is changed but not its location.


RUN-TIME ENVIRONMENT AND CODE GENERATION 4.3

Static view Dynamic view

definition of procedure declaration of activation of procedure bindings


name scope of declaration of name life time of binding

4.2. SOURCE LANGUAGE ISSUES:

Procedures:

A procedure definition is a declaration that associates an identifier with a


statement. The identifier is the procedure name, and the statement is the procedure
body. For example, the following is the definition of procedure named readarray:

procedure readarray; var i : integer;

begin

for i : = 1 to 9 do read(a[i])

end;

When a procedure name appears within an executable statement, the procedure


is said to be called at that point.

Activation trees:

An activation tree is used to depict the way control enters and leaves activations.
In an activation tree,

1. Each node represents an activation of a procedure.

2. The root represents the activation of the main program.

3. The node for a is the parent of the node for b if and only if control flows
from activation a to b.

4. The node for a is to the left of the node for b if and only if the lifetime of
a occurs before the lifetime of b.

Control stack:

A control stack is used to keep track of live procedure activations. The idea is
to push the node for a activation onto the control stack as the activation begins and
to pop the node when the activation ends. The contents of the control stack are
related to paths to the root of the activation tree. When node n is at the top of
control stack, the stack contains the nodes along the path from n to the root.
4.4 COMPILER DESIGN

The Scope of a Declaration:


A declaration is a syntactic construct that associates information with a n
Declarations may be explicit, such as:
var i : integer ;
or they may be implicit. Example, any variable name starting with I is assumed to
denote an integer. The portion of the program to which a declaration applies is called
the scope of that declaration.
Binding of names:
Even if each name is declared once in a program, the same name may denote
different data objects at run time. “Data object” corresponds to a storage location
that holds values. The term environment refers to a function that maps a name to
a storage location. The term state refers to a function that maps a storage location
to the value held there. When an environment associates storage location s with a
name x, we say that x is bound to s. This association is referred to as a binding of
x.
environment state

name storage value


Fig. 4.2: Two-stage mapping from names to values

4.3. STORAGE ORGANIZATION:

1. division of memory into different areas.

2. management of activation records.

3. layout of local data.

1. Subdivision of run-time memory

code area − target code

static area − absolute address can be determind at compile time

stack area − procedure

heap area − data created at run time


RUN-TIME ENVIRONMENT AND CODE GENERATION 4.5

4.4. STORAGE ALLOCATION STRATEGIES:

The different storage allocation strategies are:

1. Static allocation − lays out storage for all data objects at compile time.

2. Stack allocation − manages the run-time storage as a stack.

3. Heap allocation − allocates and deal locates storage as needed at run time
from a data area known as heap.

4.4.1. Static Allocation:

In static allocation, names are bound to storage as the program is compiled, so


there is no need for a run-time support package. Since the bindings do not change
at run-time, everytime a procedure is activated, its names are bound to the same
storage locations. Therefore values of local names are retained across activations of
a procedure.

That is, when control returns to a procedure the values of the locals are the
same as they were when control left the last time. From the type of a name, the
compiler decides the amount of storage for the name and decides where the activation
records go. At compile time, we can fill in the addresses at which the target code
can find the data it operates on.

4.4.2. Stack Allocation of Space:

All compilers for languages that use procedures, functions or methods as units
of user-defined actions manage at least part of their run-time memory as a stack.
Each time a procedure is called, space for its local variables is pushed onto a stack,
and when the procedure terminates, that space is popped off the stack.

Calling sequences:

Procedures called are implemented in what is called as calling sequence, which


consists of code that allocates an activation record on the stack and enters information
into its fields. A return sequence is similar to code to restore the state of machine
so the calling procedure can continue its execution after the call. The code in calling
sequence is often divided between the calling procedure (caller) and the procedure it
calls (callee).

When designing calling sequences and the layout of activation records, the
following principles are helpful:
4.6 COMPILER DESIGN

★ Values communicated between caller and callee are generally placed at the
beginning of the callee’s activation record, so they are as close as possible
to the caller’s activation record.

Fixed length items are generally − link, and the machine placed in the middle.
Such i the control link, the access status fields.

Items − Whose size may not be known early enough are placed at the end of
the activation record. The most common example is dynamically sized array, where
the value of one of the callee’s parameters determines the length of the array.

We must locate the top-of-stack − to the end of pointer judiciously. A common


approach is to have it point fixed-length fields in the activation record. Fixed-length
data can then be accessed by fixed offsets, known to the intermediate-code generator,
relative to the top-of-stack pointer.

The calling sequence and its division between caller and callee are as follows:

★ The caller evaluates the actual parameters.

★ The caller stores a return address and the old value of top_sp into the
callee’s activation record. The caller then increments the top_sp to the
respective positions.

★ The callee saves the register values and other status information.

★ The callee initializes its local data and begins execution.


A suitable, corresponding return sequence is:

★ The callee places the return value next to the parameters.

★ Using the information in the machine-status field, the callee restores top_sp
and other registers, and then branches to the return address that the caller
placed in the status field.

★ Although top_sp has been decremented, the caller knows where the return
value is, relative to the current value of top_sp; the caller therefore may
use that value.

Variable length data on stack:

The run-time memory management system must deal frequently with the
allocation of space for objects, the sizes of which are not known at the compile time,
but which are local to a procedure and thus may be allocated on the stack. The
reason to prefer placing objects on the stack is that we avoid the expense of garbage
RUN-TIME ENVIRONMENT AND CODE GENERATION 4.7

collecting their space. The same scheme works for objects of any type if they are
local to the procedure called and have a size that depends on the parameters of the
call.
4.4.3. Heap Allocation:
Stack allocation strategy cannot be used if either of the following is possible:

1. The values of local names must be retained when an activation ends.

2. A called activation outlives the caller.

Heap allocation parcels out pieces of contiguous storage, as needed for activation
records or other objects. Pieces may be deal located in any order, so over the time
the heap will consist of alternate areas that are free and in use.

Position in the
Activation records in the heap Remarks
activation tree

S Retained
S activation
record for r
control link
r q(1, 9)

control link

q(1, 9)

control link

Fig. 4.3: Records for live activations need not be adjacent in heap

★ The record for an activation of procedure r is retained when the activation


ends.

★ Therefore, the record for the new activation q (1, 9) cannot follow that for s
physically.

★ If the retained activation record for r is deal located, there will be free space
in the heap between the activation records for s and q.
4.8 COMPILER DESIGN

4.5. ACCESS TO NON-LOCAL DATA ON THE STACK:

★ Block
★ Lexical scope
• Without nested procedures
• With nested procedures
★ Dynamic Scope
Block
★ A block is a statement containing its own local data declarations.
★ In C, a block has the syntax
{Declarations statements}
★ A characteristics of blocks is their nesting structure.
★ Delimiters mark the beginning and end of a block.
★ Delimiters ensure that one block is either independent of another, or is
nested inside the other.
Blocks in C Program Declaration Scope
main() int a = 0; B0 − B2
{ int b = 0; B0 − B1
int a = 0;
int b = 1; B1 − B3
int b = 0;
{ int a = 2; B2
int b = 1; int b = 3; B3
{
int a = 2;
B2 printf(“%d%d\n”, a, b);
B0 }
B1 {
B3 int b = 3;
print(“%d%d\n”, a, b);
}
print(“%d %d\n”, a, b);
}
print(“%d %d\n”, a, b);

}
RUN-TIME ENVIRONMENT AND CODE GENERATION 4.9

Nesting depth:
★ The nesting depth of a procedure is used to implement lexical scope.
★ Let the name of the main program be at nesting depth 1; and 1 to the
nesting depth as we go from an enclosing to an enclosed procedure.

★ Form the following figure, procedure quick sort on line 11 is at nesting


depth 2.

★ The procedure partition on line 13 is at nesting depth 3 (since the occurrence


of a, v, i on line 15-17 have nesting depth 1, 2 and 3 respectively.

Access links:

★ A direct implementation of lexical scope for nested procedures is obtained


by adding a pointer called an access link to each activation record.

★ If procedure p is nested immediately within q in the source text, then the


access link in an activation record for p points to the access link in the
record for the most recent activation of q.

★ Snapshots of the run-time stack during an execution of the program in


Figure are shown in the following figure.

Access links for finding storage for non locals:

S S S S

a, x a, x a, x a, x
q(1, 9) q(1, 9) q(1, 9) q(1, 9)
access link access link access link access link
k, v k, v k, v k, v
(a) q(1, 3) q(1, 3) q(1, 3)
access link access link access link
k, v k, v k, v
(b) p(1, 3) p(1, 3)
access link access link
i, j i, j
(c) e(1, 3)
access link

(d)

Fig. 4.4
4.10 COMPILER DESIGN

Dynamic Scope:

★ A new activation inherits the existing bindings of nonlocal names to storage.

★ Approaches : 1. Deep 2. Shallow

★ Deep access with access links and use the control link to search into stack,
looking for the first activation record containing storage for the nonlocal
name.

★ The term deep access comes from the fact that the search may go deep into
the stack.

★ The depth to which the search may go depends on the input to the program
and cannot be determined at compile time.

Shallow access:

★ The idea is to hold the current value of each name in statically allocated
storage.

★ When a new activation of a procedure p occurs, a local name n in p takes


over the storage statically allocated for n.

★ The previous value of n can be saved in the activation record for p and
must be restored when the activation of p ends.

4.6. ACTIVATION RECORD:

Activation record contains 7 fields

★ Local variables

★ Parameters / temporaries

★ Return address

★ Saved registers / saved machine status

★ Static link / access link (pascal)

★ Dynamic link / control link

★ Returned value
RUN-TIME ENVIRONMENT AND CODE GENERATION 4.11

eg:
int g-var;

main ( )

int a [100];

for (int i = 0; i < 100 ; i + +)

scanf (“%d”, & a [i]);

float avgval = avg (a);

float stddev = std (a);

print f (“% f % f”, avgval, stddev);

float avg (int a [100])

int sum = 0;

for (int i = 0; i < 100 ; i ++)

sum + = a [i] ;

float avg = sum/100;

return avg;
}

float std (int a [100])

int sum = 0;

float av = avg (a);

for (int i = 0; 1 < 100 ; i + +)

sum + = (a [i] − av) ∗ (a [i] − av);

float st = sum ⁄ 100;

return st;
}
4.12 COMPILER DESIGN

4.7. PARAMETER PASSING:

→ Pass by value

→ Pass by reference

→ Pass by value − restore

→ Pass by name

Pass by value:

★ Copy of the value of the actual parameter is passed onto the formal
parameter.

★ Caller copies the r-value of the actual into the called method’s activation
record.

★ Changes to a formal have no effect on the actual.

void f (int a)

print (“a value % d”, a); // 5

a = 10 ;

print f (“a value %d”, a); // 10

★ Copying of large objects is expensive & time consuming

void g ( )

int b;

b = 5;

print f (“b value %d”, b); //5

f (b);

print f (“b value %d”,b); //5

}
RUN-TIME ENVIRONMENT AND CODE GENERATION 4.13

Pass by reference:

★ Call by − address
call - by - location.

★ Caller passes to the called procedure, a pointer to the storage address of


each actual parameter.

★ If actual parameter is a name (or) an expression having l-value, then l-value


itself is passed.

★ If the actual parameter is an expression like ‘a + b’ (or) ‘2’ that have no


l-value, then the expression is evaluated in a new location & the address
of that location is passed

:
:
f ( & b)

:
:
void f (int * a)

:
:
* a ~=~ 10;
:
:
}

★ Parameter access is slow due to the use of indirection.


Pass by value-restore:

★ Hybrid between call-by-value & call-by-reference.

★ Also called as copy in copy out (or) value − result.

★ Works as call by reference except when aliases are used.

★ The side effects of the actuals donot have l-values being changed can be
avoided in this method.
4.14 COMPILER DESIGN

Pass by name:
★ Every call statement is replaced by the body of the called method.
★ Each occurrence of a formal parameter in the called method is replaced with
the corresponding argument. It is replaced by the actual text of the
argument, not its value.

4.8. ISSUES OF CODE GENERATOR:


The final phase in compiler model is the code generator. It takes as input an
intermediate representation of the source program and produces as output an
equivalent target program.

Code Generation:
The final phase in compiler model is the code generator. It takes as input an
intermediate representation of the source program and produces as output an
equivalent target program. The code generation techniques presented below can be
used whether or not an optimizing phase occurs before code generation.

source front intermediate code intermediate code target


program end code optimizer code generator program

symbol table

Fig. 4.5: Position of code generator


Issues in the design of a code generator:

The following issues arise during the code generation phase:

1. Input to code generator

2. Target program

3. Memory management

4. Instruction selection

5. Register allocation

6 Evaluation order
RUN-TIME ENVIRONMENT AND CODE GENERATION 4.15

1. Input to code generator:


The input to the code generation consists of the intermediate representation of
the source program produced by front end, together with information in the symbol
table to determine run-time addresses of the data objects denoted by the names in
the intermediate representation.

Intermediate representation can be:

(a) Linear representation such as postfix notation.

(b) Three address representation such as quadruples.

(c) Virtual machine representation such as stack machine code.

(d) Graphical representations such as syntax trees and dages.

(e) Prior to code generation, the front end must be scanned, parsed and translated
into intermediate representation along with necessary type checking.
Therefore, input to code generation is assumed to be error-free.

2. Target program:
The output of the code generator is the target program. The output may be: (a)
Absolute machine language.

★ It can be placed in a fixed memory location and can be executed immediately.


(b) Relocatable machine language.

★ It allows subprograms to be compiled separately. (c) Assembly language


★ Code generation is made easier.
3. Memory management:

★ Names in the source program are mapped to addresses of data objects in


run-time memory by the front end and code generator.

★ It makes use of symbol table, that is, a name in a three-address statement


refers to a symbol-table entry for the name.

★ Labels in three-address statements have to be converted to addresses of


instructions. For example.

j: gotoigenerates jump instruction as follows:

★ if i < j, a backward jump instruction with target address equal to location


of code for quadruple i is generated.
4.16 COMPILER DESIGN

★ if i > j, the jump is forward. We must store on a list for quadruple i the
location of the first machine instruction generated for quadruple j. When i
is processed, the machine locations for all instructions that forward jumps
to i are filled.

4. Instruction selection:

★ The instructions of target machine should be complete and uniform.

★ Instruction speeds and machine idioms are important factors when efficiency
of target program is considered.

★ The quality of the generated code is determined by its speed and size.

★ The former statement can be translated into the latter statement as shown
below:

a:=b+c
d:=a+e (a)
MOV b,R0
ADD c,R0
MOV R0,a (b)
MOV a,R0
ADD e,R0
MOV R0,d

5. Register allocation:

★ Instructions involving register operands are shorter and faster than those
involving operands in memory. The use of registers is subdivided into two
subproblems:

1. Register allocation − the set of variables that will reside in registers at a


point in the program is selected.

2. Register assignment − the specific register that a value picked.

3. Certain machine requires even-odd register pairs for some operands and
results. For example, consider the division instruction of the form: D x, y

where, x − dividend even register in even/odd register pair y-divisor


even register holds the remainder
odd register holds the quotient
RUN-TIME ENVIRONMENT AND CODE GENERATION 4.17

6. Evaluation order

The order in which the computations are performed can affect the efficiency of
the target code. Some computation orders require fewer registers to hold intermediate
results than others.

4.9. TARGET MACHINE DESCRIPTION:

★ The target computer is a byte-addressable machine with four types to a word


and n general purpose registers, R0 , R1 , ..., Rn − 1.

★ It has two-address instructions of the form

op source, destination

(source and destination − data fields)

★ The source and destination fields are not long enough to hold memory
addresses, so certain bit patterns in these fields specify that words following
an instruction contain operands and/or addresses.

★ The source and destination of an instruction are specified by combining


registers and memory locations with address modes.

★ The contents (a) denotes the contents of the register or memory address
represented by a.

★ The address modes together with their assembly-language forms and


associated costs are as follows:

MODE FORM ADDRESS ADDED COST

absolute M M 1

register R R 0

indexed c(R) c + contents (R) 1

indirect register *R contents(R) 0

indirect indexed *c(R) contents 1


(c + contents(R))
4.18 COMPILER DESIGN

★ A memory location M or a register R represents itself when used as a source


or destination

MOV R0, M

stores the contents of register R0 into memory location M.

★ An address offset c from the value in register R is written as c (R). Thus,


MOV 4(R0), M

stores the value

contents (4 + contents (R0))

into memory location M.

★ Indirect versions of the last two modes are indicated by prefix *. Thus,
MOV * 4(R0), M

stores the value

contents(contents(4 + contents(R0)))

into memory location M.

★ A final address mode allows the source to be a constant:

MODE FORM ADDRESS ADDED COST

literal #c c l

★ Thus, the instruction


MOV #1, R0

loads the constants 1 into register R0.

4.9.1. Instruction Costs:

★ The cost of an instruction is one plus the costs associated with the source
and destination address modes.

★ This cost corresponds to the length (in words) of the instruction.


★ Address modes involving registers have cost zero.
RUN-TIME ENVIRONMENT AND CODE GENERATION 4.19

★ While those with a memory location or literal have cost one.

★ For most instructions, the time taken to fetch an instruction from memory
exceeds the time spent executing the instruction.

★ Therefore, by minimizing the instruction length, it minimizes the time taken


to perform the instruction as well.

★ Some examples:

• MOV R0 , R1 − instruction cost = 1, since it occupies only one word


of memory.

• MOV R5 , M − instruction cost = 2, since the address of memory location


M is in the word following the instruction.

• ADD #1, R3 − instruction cost = 2, since the constant 1 must appear in


the next word following the instruction.

• SUB 4(R0), *12(R1) − instruction cost = 2, since the constants 4 and 12


are stored in the next two words following the instruction.

4.10. DESIGN OF A SIMPLE CODE GENERATOR:

If the name in a register is no longer needed, then we remove the name from
the register and the register can be used to store some other names.

Next-use Information:

★ If the name in a register is no longer needed, then we remove the name


from the register and the register can be used to store some other names.

Input: Basic block B of three-address statements.

Output: At each statement i: x = y op z, we attach to i the liveliness and next-uses


of x, y and z.

Method: We start at the last statement of B and scan backwards.

1. Attach to statement i the information currently found in the symbol table


regarding the next use and live lines of x, y and z.

2. In the symbol table, set x to “not live” and “no next use”.

3. In the symbol table, set y and z to “live”, and next-uses of y and z to i.


4.20 COMPILER DESIGN

Symbol Table

Names Liveliness Next-use

x not live no next-use

y Live i

z Live i

A Simple Code Generator:

★ A code generator generates target code for a sequence of three address


statements and effectively uses registers to store operands of the statements.

★ for example: consider the three-address statement a: = b + c It can have the


following sequence of codes:

ADD Rj, Ri Cost = 1

(or)

ADD c, Ri Cost = 2

(or)

MOV c, Rj Cost = 3

ADD Rj, Ri

Register and Address Descriptors:

★ A register descriptor is used to keep track of what is currently in each


registers. The register descriptors show that initially all the registers are
empty.

★ An address descriptor stores the location where the current value of the
name can be found at run time.

A code-generation algorithm:

The algorithm takes as input a sequence of three-address statements constituting


a basic block. For each three-address statement of the form x : = y op z, perform the
following actions:
RUN-TIME ENVIRONMENT AND CODE GENERATION 4.21

1. Invoke a function getreg to determine the location L where the result of the
computation y op z should be stored.
2. Consult the address descriptor for y to determine y’, the current location of
y. Prefer the register for y’ if the value of y is currently both in memory and
a register. If the value of y is not already in L, generate the instruction MOV
y’, L to place a copy of y in L.
3. Generate the instruction OP z’, L where z’ is a current location of z. Prefer
a register to a memory location if z is in both. Update the address descriptor
of x to indicate that x is in location L. If x is in L, update its descriptor and
remove x from all other descriptors.
4. If the current values of y or z have no next uses, are not live on exit from
the block, and are in registers, alter the register descriptor to indicate that,
after execution of x : = y op z, those registers will no longer contain y or z.

Generating Code for Assignment Statements:

★ The assignment d: = (a − b) + (a − c) + (a − c) might be translated into the


following three-address code sequence:
Code sequence for the example is:
t:=a−b
u:=a−c
v:=t+u
d:=v+u
with d live at the end.
Code Sequence

Code Register Address


Statement
Generator Descriptor Descriptor
registers empty
t:=a−b MOV a, R0 R0 contains t t in R0
SUB b, R0
u:=a−c MOV a, R1 R0 contains t t in R0
SUB c, R1 R1 contains u u in R1
v:=t+u ADD R1, R0 R0 contains v u in R0
R1 contains u v in R1
d:=v+u ADD R1, R0 R0 contains d d in R0
MOV R0, d d in R0 and memory
4.22 COMPILER DESIGN

Generating Code for Indexed Assignments:


The table shows the code sequences generated for the indexed assignment
a: = b[i] and a[i]:=b

Statements Code Generated Cost

a : = b[i] MOV b(Ri), R 2

a[i] : = b MOV b, a(Ri) 3

Generating Code for Pointer Assignments:


The table shows the code sequences generated for the pointer assignments
a : = ∗p and ∗p : = a

Statements Code Generated Cost

a : = ∗p MOV *Rp, a 2

*p : = a MOV a, *Rp 2

Generating Code for Conditional Statements:

Statement Code

if x < y goto z CMP x, y


CJ < z /* jump to if condition code is negative */

x:=y+z MOV y, R0

if x < 0 goto z ADD z, R0


MOV R0, x
CJ < z

*********
CHAPTER – V

CODE OPTIMIZATION

5.1. INTRODUCTION TO CODE OPTIMIZATION:

The code optimization in the synthesis phase is a program transformation


technique, which tries to improve the intermediate code by making it consume fewer
resources (i.e. CPU, Memory) so that faster-running machine code will result.
Compiler optimizing process should meet the following objectives:

★ The optimization must be correct, it must not, in any way, change the
meaning of the program.

★ Optimization should increase the speed and performance of the program.

★ The compilation time must be kept reasonable.

★ The optimization process should not delay the overall compiling process.
When to Optimize?

Optimization of the code is often performed at the end of the development stage
since it reduces readability and adds code that is used to increase the performance.

Types of Code Optimization: The optimization process can be broadly classified


into two types:
5.2 COMPILER DESIGN

1. Machine Independent Optimization: This code optimization phase


attempts to improve the intermediate code to get a better target code as
the output. The part of the intermediate code which is transformed here does
not involve any CPU registers or absolute memory locations.

2. Machine Dependent Optimization: Machine-dependent optimization is done


after the target code has been generated and when the code is transformed
according to the target machine architecture. It involves CPU registers and
may have absolute memory references rather than relative references.
Machine-dependent optimizers put efforts to take maximum advantage of the
memory hierarchy.

Code Optimization is done in the following different ways:

1. Compile Time Evaluation:


(i) A = 2∗(22.0 ⁄ 7.0)∗r
Perform 2*(22.0/7.0)*r at compile time.

(ii) x = 12.4

y = x ⁄ 2.3

Evaluate x/2.3 as 12.4/2.3 at compile time.

2. Variable Propagation:
//Before Optimization

c=a∗b

x=a

till

d=x∗b+4

//After Optimization

c=a∗b

x=a

till

d=a∗b+4

Hence, after variable propagation, a*b and x*b will be identified as common
sub-expression.
CODE OPTIMIZATION 5.3

3. Dead code elimination: Variable propagation often leads to making assignment


statement into dead code

c=a∗b

x=a

till

d=a∗b+4

//After elimination:

c=a∗b

till

d=a∗b+4

4. Code Motion:

★ Reduce the evaluation frequency of expression.


★ Bring loop invariant statements out of the loop.
a = 200;

while(a > 0)

b = x + y;

if (a % b = 0)}

printf(“%d”, a);

//This code can be further optimized as

a = 200;

b = x + y;

while (a>0)

if (a % b = 0)

printf(“%d”, a);

}
5.4 COMPILER DESIGN

5. Induction Variable and Strength Reduction:

★ An induction variable is used in loop for the following kind of assignment


i = i + constant.

★ Strength reduction means replacing the high strength operator by the low
strength.

i = 1;

while (i<10)

y = i ∗ 4;

//After Reduction

i=1

t=4

while (t<40)

y = t;

t = t + 4;

5.2. PRINCIPAL SOURCES OF OPTIMISATION:

A transformation of a program is called local if it can be performed by looking


only at the statements in a basic block; otherwise, it is called global. Many
transformations can be performed at both the local and global levels. Local
transformations are usually performed first.

Function-Preserving Transformations:

There are a number of ways in which a compiler can improve a program without
changing the function it computes.
CODE OPTIMIZATION 5.5

5.2.1. Function preserving transformations examples:

★ Common sub expression elimination


★ Copy propagation,
★ Dead-code elimination
★ Constant folding
The other transformations come up primarily when global optimizations are
performed Frequently, a program will include several calculations of the offset in an
array. Some of the duplicate calculations cannot be avoided by the programmer
because they lie below the level of detail accessible within the source language.

5.2.2. Common sub-expressions elimination:

★ An occurrence of an expression E is called a common sub-expression if E


was previously computed, and the values of variables in E have not changed
since the previous computation. We can avoid recomputing the expression if
we can use the previously computed value.

★ For example
t1: = 4*i

t2: = a [t1]

t3: = 4*j

t4: = 4*i

t5: = n

t6: = b [t4] + t5

The above code can be optimized using the common sub-expression elimination
as

t1: = 4*i

t2: = a [t1]

t3: = 4*j

t5: = n

t6: = b [t1] + t5

The common sub expression t4: = 4*i is eliminated as its computation is already
in t1 and the value of i is not been changed from definition to use.
5.6 COMPILER DESIGN

5.2.3. Copy Propagation:


Assignments of the form f : = g called copy statements, or copies for short. The
idea behind the copy-propagation transformation is to use g for f, whenever possible
after the copy statement f: = g. Copy propagation means use of one variable instead
of another. This may not appear to be an improvement, but as we shall see it gives
us an opportunity to eliminate x.
★ For example: x = Pi;

A = x∗r∗r;
The optimization using copy propagation can be done as follows: A = Pi∗r∗r;
Here the variable x is eliminated.
5.2.4. Strength Reduction:
Strength in reduction replaces expensive operations by equivalent cheaper ones
on the target machine. Certain machine instructions are considerably cheaper than
others and can often be used as special cases of more expensive operators. For
example, x2 is invariably cheaper to implement as x∗x than as a call to an
exponentiation routine. Fixed-point multiplication or division by a power of two is
cheaper to implement as a shift. Floating-point division by a constant can be
implemented as multiplication by constant, which may be cheaper.
B1
i : = m-1
j :=n
t1 : = 4*n
v : = a[t1 ]

B2
i : = i+1
t2 : = 4*1
t3 : = a[t2]
if t 3 < v goto B2

B3
j : = j-1
t4 : = 4*j
t5 : = a[t4]
if t 5 > v goto B3

B4
if i>=j goto B6

B5 B6
x : = t3 x : = t3
a[t2 ] : = t5 t14 : = a[t1]
a[t4 ] : = x a[t2] : = t 14
goto B2 a[t 1 ] : = x

Fig. 5.1: B5 and B6 after common subexpression elimination


CODE OPTIMIZATION 5.7

5.2.5. Dead Code Eliminations:


A variable is live at a point in a program if its value can be used subsequently;
otherwise, it is dead at that point. A related idea is dead or useless code, statements
that compute values that never get used. While the programmer is unlikely to
introduce any dead code intentionally, it may appear as the result of previous
transformations.

Example:
i = 0;

if(i=1)

a=b+5;

Here, ‘if’ statement is dead code because this condition will never get satisfied.

Constant folding:
Deducing at compile time that the value of an expression is a constant and
using the constant instead is known as constant folding. One advantage of copy
propagation is that it open turns the copy statement into dead code.

For example,

a=3.14157/2 can be replaced by

a=1.570 thereby eliminating a division operation.

5.2.6. Loop Optimizations:


In loops, especially in the inner loops, programs tend to spend the bulk of their
time. The running time of a program may be improved if the number of instructions
in an inner loop is decreased, even if we increase the amount of code outside that
loop.

Three techniques are important for loop optimization:

★ Code motion, which moves code outside a loop;


★ Induction-variable elimination, which we apply to replace variables from
inner loop.

★ Reduction in strength, which replaces and expensive operation by a cheaper


one, such as a multiplication by an addition.
5.8 COMPILER DESIGN

B1
i : = m-1
j :=n
t1 : = 4*n
v : = a[t1 ]

B2
i : = i+1
t2 : = 4*1
t3 : = a[t2]
if t 3 < v goto B2

B3
j : = j-1
t4 : = 4*j
t5 : = a[t4]
if t 5 > v goto B3

B4
if i>=j goto B6

B5 B6
t 6 : = 4*1 t 11 : = 4*1
x : = a[t6 ] x : = a[t11]
t 7 : = 4*i t 12 : = 4*i
t8 : = 4*j t13 : = 4*n
t9 : = a[t8 ] t14 : = a[t13]
a [t7] : = t 9 a [t12] : = t14
t10 : = 4*j t15 : = 4*n
a [t10] : = x a [t15] : = x
goto B2

Fig. 5.2: Flow graph Page 6

Code Motion:

An important modification that decreases the amount of code in a loop is code


motion. This transformation takes an expression that yields the same result
independent of the number of times a loop is executed (a loop-invariant computation)
and places the expression before the loop. Note that the motion “before the loop”
assumes the existence of an entry for the loop. For example, evaluation of limit-2 is
a loop-invariant computation in the following while-statement:

while (i < = limit-2) /* statement does not change limit*/

Code motion will result in the equivalent of

t=limit-2;

while (i<=t) /* statement does not change limit or t */


CODE OPTIMIZATION 5.9

Induction Variables:

Loops are usually processed inside out. For example consider the loop around
B3. Note that the values of j and t4 remain in lock-step; every time the value of j
decreases by 1, that of t4 decreases by 4 because 4*j is assigned to t4. Such identifiers
are called induction variables.

When there are two or more induction variables in a loop, it may be possible
to get rid of all but one, by the process of induction-variable elimination. For the
inner loop around B3 in Fig. 5.3 we cannot get rid of either j or t4 completely; 14
is used in B3 and j in B4.

However, we can illustrate reduction in strength and illustrate a part of the


process of induction-variable elimination. Eventually j will be eliminated when the
outer loop of B2-B5 is considered.

Example:

As the relationship t4: =4*j surely holds after such an assignment to t4 in Fig.
and t4 is not changed elsewhere in the inner loop around B3, it follows that just
after the statement j:−j−1 the relationship t4:=4∗j−4 must hold. We may therefore
replace the assignment t4:=4∗j by t4:=t4−4. The only problem is that t4 does not have
a value when we enter block B3 for the first time. Since we must maintain the
relationship t4=4∗j on entry to the block B3, we place an initializations of t4 at the
end of the block where j itself is initialized, shown by the dashed addition to block
B1 in Fig. 5.1

The replacement of a multiplication by a subtraction will speed up the object


code if multiplication takes more time than addition or subtraction, as it the case
on many machines.

5.3. PEEP-HOLE OPTIMIZATION:

A statement-by-statement code-generations strategy often produces target code


that contains redundant instructions and suboptimal constructs. The quality of such
target code can be improved by applying “optimizing” transformations to the target
program.

A simple but effective technique for improving the target code is peep-hole
optimization, a method for trying to improving the performance of the target program
by examining a short sequence of target instructions (called the peep-hole) and
replacing these instructions by a shorter or faster sequence, whenever possible.
5.10 COMPILER DESIGN

The peephole is a small, moving window on the target program. The code in
the peep-hole need not be contiguous, although some implementations do require this.
It is characteristic of peep-hole optimization that each improvement may spawn
opportunities for additional improvements.

Characteristics of pee-phole optimizations:

★ Redundant-instructions elimination

★ Flow-of-control optimizations

★ Algebraic simplifications

★ Use of machine idioms

★ Unreachable.

Redundant Loads and Stores:

If we see the instructions sequence

1. MOV Ro,a

2. MOV a,R0

We can delete instructions (2) because whenever (2) is executed. (1) will ensure
that the value of a is already in register R0. If (2) had a label we could not be sure
that (1) was always executed immediately before (2) and so we could not remove (2).

Unreachable Code:

Another opportunity for peephole optimizations is the removal of unreachable


instructions. An unlabeled instruction immediately following an unconditional jump
may be removed. This operation can be repeated to eliminate a sequence of
instructions. For example, for debugging purposes, a large program may have within
it certain segments that are executed only if a variable debug is 1. In C, the source
code might look like:

#define debug 0

....

If ( debug ) {

Print debugging information

}
CODE OPTIMIZATION 5.11

In the intermediate representations the if statement may be translated as:

If debut =1 goto L1 goto L2

L1: print debugging information L2: .... (1)

One obvious peephole optimizationj is to eliminate jumps over jumps. Thus no matter
what the value of debug; (a) can be replaced by:

If debug ≠ 1 goto L2

Print debugging information

L2: .... (b)

If debut ≠0 goto L2

Print debugging information

L2: .... (c)

As the argument of the statement of (c) evaluates to a constant true it can be


replaced.

By goto L2. Then all the statement that print debugging aids are manifestly
unreachable and can be eliminated one at a time.

Flows-of-Control Optimizations:

The unnecessary jumps can be eliminated in either the intermediate code or the
target code by the following types of peep-hole optimizations. We can replace the
jump sequence.

goto L1

....

L1: goto L2 (d)

by the sequence

goto L2

....

L1: goto L2

If there are now no jumps to L1, then it may be possible to eliminate the statement
L1: goto L2 provided it is preceded by an unconditional jump. Similarly, the sequence
5.12 COMPILER DESIGN

if a < b goto L1

....

L1: goto L2 (e)

can be replaced by

If a < b goto L2

....

L1: goto L2

★ Finally, suppose there is only one jump to L1 and L2 is preceded by an


unconditional goto. Then the sequence

goto L1

L1: if a < b goto L2 (f) L3:

may be replaced by

If a < b goto L2

goto L3

......

L3:

While the number of instructions in (e) and (f) is the same, we sometimes skip
the unconditional jump in (f), but never in (e). Thus (f) is superior to (e) in execution
time

Algebraic Simplification:

There is no end to the amount of algebraic simplification that can be attempted


through peephole optimization. Only a few algebraic identities occur frequently
enough that it is worth considering implementing them. For example, statements
such as

x : = x + 0 or

x:=x∗1

are often produced by straight forward intermediate code-generation algorithms, and


they can be eliminated easily through peephole optimization.
CODE OPTIMIZATION 5.13

Reduction in Strength:

Reduction in strength replaces expensive operations by equivalent cheaper ones


on the target machine. Certain machine instructions are considerably cheaper than
others and can often be used as special cases of more expensive operators.

For example, x2 is invariably cheaper to implement as x∗x than as a call to an


exponentiation routine. Fixed-point multiplication or division by a power of two is
cheaper to implement as a shift. Floating-point division by a constant can be
implemented as multiplication by constant, which may be cheaper.

X2 → X∗X

Use of Machine Idioms:

The target machine may have hardware instructions to implement certain


specific operations efficiently. For example, some machines have auto-increment and
auto-decrement addressing modes. These add or subtract one from an operand before
or after using its value. The use of these modes greatly improves the quality of code
when pushing or popping a stack, as in parameter passing. These modes can also
be used in code for statements like i : = i + 1.

i:=i+1 → i++

i:=i−1 → i− −

5.4. DAG REPRESENTATION OF BASIC BLOCKS:

★ Directed acyclic graph (DAGs) are useful data structures for implementing
transformations on basic blocks.

★ A DAG gives a picture of how the value computed by each statement in a


basic block is used in subsequent statements of the block.

★ Constructing a DAG from three-address statements is a good way of

• determining common subexpressions within block,

• determining which names used inside the block but evaluated outside the
block, and

• determining which statements of the block could have their computed


value used outside the block.
5.14 COMPILER DESIGN

★ A DAG for a basic block is a directed acyclic graph with the following labels
on nodes:
1. Leaves are labeled by unique identifiers, either variable names or
constants.
2. Interior nodes are labeled by an operator symbol.
3. Nodes are also optionally given a sequence of identifiers for labels.

★ Each node of a flow graph can be represented by a dag, since each node of
the flow graph stands for a basic block.
Example:
(1) t1 : = 4 ∗ i
(2) t2 : = a [ t1 ]
(3) t3 : = 4 ∗ i
(4) t4 : = b [ t3 ]
(5) t5 : = t2 ∗ t4
(6) t6 : = prod + t5
(7) prod ; = t6
(8) t7 : = i + 1
(9) i : = t7

Three − address code for Block B2


The corresponding DAG is
t 6 , prod
+

t5
prod0 *

t4
[] t2 [] (1)
<=

t 1, t3
* + t7 , i
20
a b

4 i0 1

Fig. 5.3: DAG for block of previous figure


CODE OPTIMIZATION 5.15

5.4.1. Construction of DAG:

INPUT: A basic block.

Output: A DAG for the basic block containing the following information:

1. A label for each node. For leaves the label is an identifier, and for interior
nodes, an operator symbol.

2. For each node a list of attached identifiers.

METHOD:

★ Assume function node (identifier), returns the most recently created node
associated with identifier.

★ The DAG construction process is to do the following steps (1) through (3)
for each statement of the block.

★ Initially, assume there are no nodes and node is undefined for all arguments.

★ Suppose the “current” three-address statement is either case (i) x : = y op z,


(ii) op y, or (iii) x : = y.

★ Refer the above cases (i), (ii) and (iii).

BACKPATCHING:

★ The easiest way to implement the syntax directed definitions is to use


two passes. (for boolean expressions and flow-of-control).

• First, construct a synta tree for the input,

• Then walk the tree in depth-first order, computing the translations


given in the definition.
5.16 COMPILER DESIGN

★ The main problem with generating code in a single pass is that during one
single pass, the labels that control must go at the time of jump statements
are generated is not known.

★ Each such statement will be put on a list of goto statements whose labels
will be filled in when the proper label can be determined. This
subsequent filling in of labels is called as

Backpatching:

★ In this section:

★ How backpatching can be used to generate code for boolean expressions and
flow-of-control statements in one pass.

★ The translation generated will be of the same form as non-backpatching,


except for the manner in which the labels are generated.

★ For specificity, generate quadruples into quadruple array.

★ Labels will be indices into this array.

★ Three functions are used to manipulate lists of labels:

1. makelist(i) creates a new list containing only i, an index into the array
of quadruples; makelist returns a pointer to the list it has made.

2. merge(p1, p2) concatenates the list pointed to by p1 and p2, and returns
a pointer to the concatenated list.

3. backpatch(p, i) inserts i as the target label for each of the statements


on the list pointed to by p.

Boolean Expressions:

★ Construction of a translation scheme suitable for producing quadruple for


boolean expressions during bottom-up parsing.
CODE OPTIMIZATION 5.17

★ Insert a market nonterminal M into the grammar to cause a semantic action


to pick up, at appropriate times, the index of the next quadruple to be
generated.

★ The grammar:

(1) E → E1 or M E2

(2)  E1 and M E2

(3)  not E

(4)  (EI)

(5)  id1 relop id2

(6)  true

(7)  false

(8) M→ε

★ Synthesized attributes truelist and falselist of nonterminal E are used to


generate jumping code for boolean expressions.

★ As code is generated for E, jumps to the true and false exits are left
incomplete, with the label field unfilled.

★ These incomplete jumps are placed on lists pointed to by E-truelist and


E-falselist.

★ The semantic actions reflect the considerations mentioned above.

★ Consider the production:

E → E1 and M E2
5.18 COMPILER DESIGN

★ If E1 falselist become a part of E-falselist.

★ If E1 is true, the target for the statements E1-truelist must be the beginning
of the code generated for the statements E2.

★ This target is obtained using the marker nonterminal M.

★ Attribute M.quad records the number of the first statement of


E2.code.

★ With the production M → ε associate the semantic action.

{ M.quad : = nextquad }

where the variable nextquad holds the index of the next quadruple to follow:

★ This value will be backpatched onto the E1. trulist for the remainder of the
production E1 and E2.

★ The translation scheme is as follows:

(1) E → E1 or M E2 { backpatch (E1-falselist, m.quad);


E.truelist : = merge (E1-truelist, E2.truelist);
E.falselist := E2-falselist }

(2) E → E1 and M E2 { backpatch (E1.truelist, M.quad);


E.truelist:= E2-truelist;
E.falselist:= merge(E1.falselist, E2.falselist) }

(3) E → not E1 { E.truelist := E1.falselist;


E.flaselist := E1.truelist }

(4) E → (E1) { E.truelist := E1, truelist;


E.flaselist := E1.flaselist }
CODE OPTIMIZATION 5.19

(5) E → id1 relopid2 { E.truelist := makelist(nextquad);


E.falselist := makelist(nextquad + 1);
emit(“if id1 place relop op id2 place ‘goto_’);
emit (‘goto_’) }

(6) E → true { E.truelist := makelist(nextquad);


emit(‘goto_’) }

(7) E → false { E.falselist := makelist(nextquad);


emit (‘goto_’) }

(8) M → ε { M.quod := nextquard }

★ The semantic action (5) generates two statements, a conditional goto and an
unconditional one.

★ Neither has its target filled in.

★ The index of the first generated statement is made into a list, and E.truelist
is given a pointer to that list.

★ The second generated statement goto_ is also made into a list and given to
E.falselist.

5.5. OPTIMIZATION OF BASIC BLOCKS:

There are two types of basic block optimizations. They are:

★ Structure-Preserving Transformations.

★ Algebraic Transformations.

Structure-Preserving Transformations:

The primary Structure-Preserving Transformation on basic blocks are:

★ Common sub-expression elimination.

★ Dead code elimination.


5.20 COMPILER DESIGN

★ Renaming of temporary variables.

★ Interchange of two independent adjacent statements.

Common sub-expression elimination:

Common sub expressions need not be computed over and over again. Instead
they can be computed once and kept in store from where it’s referenced.

Example:

(1) a: = b + c

(2) b: = a − d

(3) c: = b + c

(4) d: = a − d

The 2nd and 4th statements compute the same expression: b + c and a − d

Basic block can be transformed to

a: = b+c

b: = a−d

c: = a

d: = b

Dead code elimination:

It is possible that a large amount of dead (useless) code may exist in the
program. This might be especially caused when introducing variables and procedures
as part of construction or error-correction of a program − once declared and defined,
one forgets to remove them in case they serve no purpose. Eliminating these will
definitely optimize the code.
CODE OPTIMIZATION 5.21

Renaming of temporary variables:

A statement t:=b+c where t is a temporary name can be changed to u:=b+c


where u is another temporary name, and change all uses of t to u. In this a basic
block is transformed to its equivalent block called normal-form block.

Interchange of two independent adjacent statements:

★ Two statements

t1:=b+c

t2:=x+y

can be interchanged or reordered in its computation in the basic block when value
of t1 does not affect the value of t2.

Algebraic Transformations:

Algebraic identities represent another important class of optimizations on basic


blocks. This includes simplifying expressions or replacing expensive operation by
cheaper ones i.e. reduction in strength. Another class of related optimizations is
constant folding. Here we evaluate constant expressions at compile time and replace
the constant expressions by their values. Thus the expression 2*3.14 would be
replaced by 6.28.

The relational operators <=, > =, <, >, + and = sometimes generate unexpected
common sub expressions. Associative laws may also be applied to expose common
sub expressions. For example, if the source code has the assignments

a :=b+c

e :=c+d+b

the following intermediate code may be generated: a :=b+c

t :=c+d e :=t+b
5.22 COMPILER DESIGN

Example:

x:=x+0 can be removed

x:=y∗∗2 can be replaced by a cheaper statement x:=y∗y

The compiler writer should examine the language specification carefully to


determine what rearrangements of computations are permitted, since computer
arithmetic does not always obey the algebraic identities of mathematics. Thus, a
compiler may evaluate x∗y−x∗z as x*(y-z) but it may not evaluate a+ (b−c) as
(a+b)−c.

5.6. GLOBAL DATA FLOW ANALYSIS:

★ To efficiently optimize the code compiler collects all the information about
the program and distribute this information to each block of the flow graph.
This process is known as data-flow graph analysis.

★ Certain optimization can only be achieved by examining the entire program.


It can’t be achieve by examining just a portion of the program.

★ For this kind of optimization user defined chaining is one particular problem.

★ Here using the value of the variable, we try to find out that which definition
of a variable is applicable in a statement.

Based on the local information a compiler can perform some optimizations. For
example, consider the following code:

1. x = a + b;

2. x = 6 ∗ 3

★ In this code, the first assignment of x is useless. The value computer for x
is never used in the program.

★ At compile time the expression 6*3 will be computed, simplifying the second
assignment statement to x = 18;
CODE OPTIMIZATION 5.23

Some optimization needs more global information. For example, consider the following
code:

1. a = 1;

2. b = 2;

3. c = 3;

4. if (....) x = a + 5;

5. else x = b + 4;

6. c = x + 1;

In this code, at line 3 the initial assignment is useless and x + 1 expression can be
simplified as 7. But it is less obvious that how a compiler can discover these facts
by looking only at one or two consecutive statements. A more global analysis is
required so that the compiler knows the following things at each point in the program:

★ Which variables are guaranteed to have constant values

★ Which variables will be used before being redefined.

Data flow analysis is used to discover this kind of property. The data flow
analysis can be performed on the program’s control flow graph (CFG).

The control flow graph of a program is used to determine those parts of a


program to which a particular value assigned to a variable might propagate.

5.7. EFFICIENT DATA FLOW ANALYSIS:

5.7.1. Redundant Common Sub expression Elimination:

We have already discussed “what is common sub-expression?” we have also seen


how to eliminate such an expression while performing local optimization. In this
section we will learn how to perform global optimization using transformations such
as common sub-expression elimination, copy propagation and induction variable
elimination.
5.24 COMPILER DESIGN

The available expressions allow us to determine if an expression at point p in


a flow graph is common sub-expression. Using following algorithm we can eliminate
common sub-expressions.

Algorithm − Algorithm for global common expression elimination.

Input − A flow graph with available expression.

Output − A flow graph after eliminating common sub-expression.

Method − For every statement s of the form a: = b + c such that b + c is


available expression (i.e. b+c is available at the beginning of
block containing s and neither b nor c is defined prior to
statement s) then perform following steps.

1. Discover the evaluations of b + c that reach in the block containing statement


s.

2. Create a new variable m.

3. Replace each statement d := b + c which is obtained in step 1. by

m := b + c

d := m

4. Replace statement s by a := m.

Let us apply this algorithm and perform global common sub-expression


elimination.

Example: Consider a flow graph as given below.

Step 1

t1 : = 4 ∗ k We will discover evaluation


of 4 * k, the 4 * k is
t2 : = a [t1]
available expression

t5 : = 4 ∗ k
t6 : = a[t5]
CODE OPTIMIZATION 5.25

Step 2 and 3

m:=4∗k (12)

t1 : = m

t2 : = a [t1] (15)

t5 : = m (12) can be assigned to t5

t6 : = a[t5] (15) can be assigned to t6

This ultimately avoids


re-computation of 4 * k

Step 4

Now if we assign value numbers to common sub-expression then,

(12) : = 4 * k

(15) : = a[(12)]

t5 : = (12)

t6 : = (15)

5.7.2. Copy Propagation:

The assignment in the form a:=b is called copy statement. The idea behind the
copy propagation transformation is to use b for a wherever possible after copy
statement a: = b. Let us see the algorithm for copy propagation.

Algorithm: Copy propagation

Input: A flow graph containing used-definition i.e. ud chains reaching to block B.


5.26 COMPILER DESIGN

The flow graph should also consists of a set of copies d: = y that reach block B
along every path and there should not be any change to either x or y along that
path. We also need du chains (i.e. definition and used chain) so that use of every
definition can be obtained.

Output: A graph after applying copy propagation transformation.

Method: For each statement x: = y which is denoted by s perform following steps.

1. Determine the statements in which x is used. These statements should be


reachable from definition of x.

2. There should not be any definition of x or y occur prior to use of x in that


block containing s.

3. If s satisfies the condition mentioned in step (2) then remove s and replace
all uses of x found in (1) by y.

For example:

Step 1 and 2

x: = t3 This is a copy statement, (def)

a[t1] : = t2

a[t4] : = x Use

y:=x+3 Use

a[t5] : = y

Since value of t3 and x is not altered along the path from its definition we will
replace x by t3 and then eliminate the copy statement.
CODE OPTIMIZATION 5.27

x : = t3
a[t1] : = t2
a[t1] : = t2
a[t4] : = t3
a[t4] : = t3

Eliminating

Copy statement

y : = t3 + 3 y : = t3 + 3

a[t5] : = y a[t5] : = y

5.7.3. Induction Variable:

A variable i is called an induction variable of loop L if every time the variable


i changes values i.e. every time i either gets incremented or decremented.

For example:

i is an induction variable. For a for loop for i : = 1 to 10.

While eliminating induction variables first of all we have to identify all the
induction variables.

Generally induction variables come in following forms

a:=i∗b

a:=b∗i

a:=i±b

a:=b±i

Where b is a constant and i is an induction variable, basic or otherwise.

If b is a basic then a is in the family of j. The a depends on the definition of


i.

For example: a : = i ( b then the triple for a is


5.28 COMPILER DESIGN

(j, b, 0). We will understand this concept of writing triple with the help of block.

In the block B2 has basic induction variable i because i gets incremented each
time of loop L by 1. The family of i contains t2 because there is an assignment
t2 : = 4 ∗ i. Hence the triple for t2 is

B2
i : = i+1
t2 : = 4*1
t3 : = a[t2]
if t 3 < 10 goto B 2

(i, 4, 0)

Induction Matching with


variable constant b

Let us see an algorithm for elimination of induction variables.

Algorithm − Elimination of induction variables.

Input − A loop L with reaching definition information, loop invariant


computation and live variable information.

Output − A flow graph without induction variables.

Method:

1. Find the induction variable i with triple (i, c, d). Consider a test form

if i relop x goto B where i is an induction variable and x is not an induction


variable,
CODE OPTIMIZATION 5.29

t:=c∗x

t:=t+d

if j relop x goto B

Where t is a new temporary.

Finally delete all assignments to the eliminated induction variables from the
loop L because these induction variables will be useless.

For example

B1
i :=m+3
k :=n
t1 : = 4*n
v : = u[t1 ]

B2
i : = i+1 Here I in B2
t2 : = 4*1 and k in B3 are
t3 : = a[t2] two induction variables
because their values get
if t 3 < v goto B2 changed at each iteration,
we will create now
B3 temporary variables
k:=k-1 r and r to which
Induction variables
t4 : = 4*k
i and k are assigned
t5 : = B[t4] + 10
if t 5 < v goto B3

B4
if - i >= k goto B 6

B5 B6

Fig. 5.5
5.30 COMPILER DESIGN

The flow can be rewritten as :

B1
i :=m+3
k :=n
t1 : = 4*n
v : = a[t1 ]
r1: = 4*I
r2: = 4*k } Note these
newly introduced variables
B2
i : = i+1 Note that now we
r1: = r1 + 4 can simply perform
t2 : = r1 r1 + 4 to get effect
t3 : = a[t2] of next value
if t3 < v goto B2 of 4 * i

B3
k:=k-1 Note that now we
r2: = r2-4 can simply perform
t4 : = r2 r2 + 4 to get effect
t5 : = a[t4 ] + 10 of next value
if t 5 < v goto B 3 of 4 * k
B4
if - i >= k goto B 6

B5 B6

Fig. 5.6: Flow graph with strength reduction

*********
LABORATORY

EX. NO: 01

DATE:

DEVELOP A LEXICAL ANALYZER


TO RECOGNIZE A FEW PATTERNS INC

AIM:

To Write a C program to develop a lexical analyzer to recognize a few patterns


in C.

ALGORITHM:

Step 1 : Start the program

Step 2 : Include the header files.

Step 3 : Allocate memory for the variable by dynamic memory allocation


function.

Step 4 : Use the file accessing functions to read the file.

Step 5 : Get the input file from the user.

Step 6 : Separate all the file contents as tokens and match it with the
functions.

Step 7 : Define all the keywords in a separate file and name it as key.c

Step 8 : Define all the operators in a separate file and name it as open.c

Step 9 : Give the input program in a file and name it as input.c

Step 10 : Finally print the output after recognizing all the tokens.

Step 11 : Stop the program.


2 COMPILER DESIGN

PROGRAM:
#include<stdio.h>
#include<conio.h>
#include<ctype.h>
#include<string.h>
void main()
{
FILE *fi,*fo,*fop,*fk;
int flag=0,i=1;
char c,t,a[15],ch[15], file[20];
clrser();
printf(“\n Enter the File Name:”);
scanf(“%s”,&file);
fi=fopen(file, “r”);
fo=fopen(“inter.c”, “w”);
fop=fopen(“oper.c”,"r");
fk=fopen(“key.c”,"r");
c=gete(fi);
while(!feof(fi))
{
if(isalpha(c)||isdigit(c)||(c==‘[’||c==‘]’||c==‘.’==]))
fputc(c,fo);
else
{
if(c==‘\n’)
fprintf(fo,"\t$\t");
else fprintf(fo,"\t%c\t",c);
}
c=gete(fi);
}
LABORATORY 3

fclose(fi);
fclose(fo);
fi=fopen(“inter.c”,"r");
printf(“\n Lexical Analysis”);
fscanf(fi,"%s",a);
printf(“\n Line: %d\n”,i++);
while (!feof(fi))
{
if(stremp(a,"$")==0)
{
printf(“\n Line: %d \n”,i++);
fscanf(fi,"%s",a);
}
fscanf(fop,"%s",ch);
while(!feof(fop))
{
if(strcmp(ch,a)==0)
{
fscanf(fop,"%s",ch);
printf(“\t\t%s\t:\t%s\n”,a,ch);
flag=1;
} fscanf(fop, “%s”,ch);
}
rewind(fop);
fscanf(fk,"%s",ch);
while(!feof(fk))
{
if(strcmp(ch,a)==0)
{
fscanf(fk,"%k",ch);
4 COMPILER DESIGN

printf(“\t\t%s\t:\tKeyword\n”,a);
flag=1;
}fscanf(fk,"%s",ch);
}
rewind(fk);
if(flag==0)
{
if(isdigit(a[0]))
printf(“\t\t%s\t:\tConstant\n”,a);
else
printf(“\t\t%s\t:\tIdentifier\n”,a);
}flag=0;
fscanf(fi,"%s",a); }
getch();
}

Key.C:
int
void
main
char
if
for
while
else
printf
scanf
FILE
Include
stdio.h
conio.h
iostream.h
LABORATORY 5

Oper.C:
( open para
) closepara
{ openbrace
} closebrace
< lesser
> greater
“ doublequote ‘ singlequote
: colon
; semicolon
# preprocessor
= equal
== assign
% percentage
^ bitwise
& reference
* star
+ add
− sub
\ backslash
/ slash

Input.C:
#include “stdio.h”
#include “conio.h”
void main()
{
int a=10,b,c;
a=b*c;
getch();
}
6 COMPILER DESIGN

OUTPUT:

Enter the file name : input.c

LEXICAL ANALYSIS

Line : 1

# : preprocessor

include : keyword

“ : doublequote

stdio.h : keyword

“ : doublequote

Line : 2

# : preprocessor

include : keyword

“ : doublequote

stdio.h : keyword

“ : doublequote

Line : 3

void : keyword

main : keyword

( : openpara

) : closepara

Line : 4

( : openrace

Line : 5

int : keyword

a : identifier

= : equal
LABORATORY 7

10 : constant

. : identifier

b : identifier

. : identifier

c : identifier

: : semicolon

Line : 6

a : identifier

= : equal

b : identifier

* : star

c : identifier

; : semicolon

Line : 7

getch : identifier

( : openpara

) : closepara

; : semicolon

Line : 8

) : closebrace

Line : 9

$ : identifier

Result:

Thus the above program for developing the lexical the lexical analyzer and
recognizing the few pattern s in C is executed successfully and the output is verified.

*********
8 COMPILER DESIGN

EX. NO: 02
DATE:

IMPLEMENTATION OF SYMBOL TABLE USING C


AIM:
To write a program for implementing Symbol Table using C.

ALGORITHM:

Step 1 : Start the program for performing inset, display, delete, search and
modify option in symbol table.

Step 2 : Define the structure of the Symbol Table.

Step 3 : Enter the choice for performing the operations in the symbol Table.

Step 4 : If the entered choice is 1, search the symbol table for the symbol to
be inserted. If the symbol is already present, it displays “Duplicate
Symbol”. Else, insert the symbol and the corresponding address in the
symbol table.

Step 5 : If the entered choice is 2, the symb ols present in the symbol table
are displayed.

Step 6 : If the entered choice is 3, the symbol to be deleted is searched in the


symbol table.

Step 7 : If it is not found in the symbol table it displays “Label Not found”.
Else, the symbol is deleted.

Step 8 : If the entered choice is 5, the symbol to be modified is searched in the


symbol table.

PROGRAM CODE:
#include<stdio.h>
#include<ctype.h>
#include<stdlib.h>
#include<string.h>
#include<math.h>
void main()
{
LABORATORY 9

int i=0,j=0,x=0,n;
void *p,*add[5];
char ch,srch,b[15],d[15],c;
printf(“Expression terminated by $:”);
while((c=getchar())!=‘$’)
{
b[i]=c;
i++;
}
n=i-l;
printf(“Given Expression:”);
i=0;
while(i<=n)
{
printf(“%c”,b[i]);
i++;
}
printf(“\n Symbol Table\n”);
printf(“Symbol \t addr \t type”);
while(j<=n)
{
c=b[j];
if(isalpha(toascii(c)))
{
p=malloc(c);
add[x]=p;
d[x]=c;
printf(“\n%c \t%d \t identifier\n”,c,p);
x++;
j++;
10 COMPILER DESIGN

}
else
{
ch=c;
if(ch==‘+’||ch==‘-’||‘∗’||ch==‘=’)
{
p=malloc(ch);
add[x]=p;
d[x]=ch;
printf(“\n %c \t%d \t operator\n”,ch,p);
x++;
j++;
}}}}

OUTPUT:
Expression terminated by $:A+B+C=D$
Given Expression: A+B+C=D
Symbol Table

Symbol Addr type


A 25731088 identifier
+ 25731168 operator
B 25731232 identifier
+ 25731312 operator
C 25731376 identifier
= 25731456 operator
D 25731536 identifier

RESULT:

Thus the program for symbol table has been executed successfully.

*********
LABORATORY 11

EX. NO: 03
DATE:

IMPLEMENTATION OF A LEXICAL ANALYZER


USING LEX TOOL

AIM:
To write a program for implementing a Lexical analyzer using LEX tool in Linux
platform.

ALGORITHM:

Step 1 : Lex program contains three sections: definitions, rules, and user
subroutines. Each section must be separated from the others by a line
containing only the delimiter, %%. The general format for LEX tool is
as follows: definitions %% rules %% user_subroutines.
Step 2 : In definition section, the variables make up the left column, and their
definitions make up the right column. Any C statements should be
enclosed in %{..}%. Identifier is defined such that the first letter of an
identifier is alphabet and remaining letters are alphanumeric.
Step 3 : In rules section, the left column contains the pattern to be recognized
in an input file to yylex(). The right column contains the C program
fragment executed when that pattern is recognized. The various
patterns are keywords, operators, new line character, number, string,
identifier, beginning and end of block, comment statements,
preprocessor directive statements etc.
Step 4 : Each pattern may have a corresponding action, that is, a fragment of
C source code to execute when the pattern is matched.
Step 5 : When yylex() matches a string in the input stream, it copies the
matched text to an external character array, yytext, before it executes
any actions in the rules section.
Step 6 : In user subroutine section, main routine calls yylex(). yywrap() is used
to get more input.
Step 7 : The lex command uses the rules and actions contained in file to
generate a program, lexyy.c, which can be compiled with the cc
command. That program can then receive input, break the input into
the logical pieces defined by the rules in file, and run program
fragments contained in the actions in file.
12 COMPILER DESIGN

PROGRAM:
%{
int COMMENT=0;
%}
identifier [a-zA-Z][a-zA-Z0-9]*
%%
#.*{printf(“\n%s is a preprocessor directive”,yytex);}
int |
float |
char |
double |
while |
for |
struct |
typedef |
do |
if |
break |
continue |
void |
switch|
return |
else |
goto {printf(“\n\t%s is a keyword”,yytext);}
“/*” {COMMENT=1;} {printf(“\n\t%s is a COMMENT”,yytext);}
{identifier}\( {if(!COMMENT)printf(“\nFUNCTION\n\t%s”,yytext);}
\{ {if(!COMMENT)printf(“\n BLOCK BEGINS”);}
\} {if(!COMMENT)printf(“BLOCK BEGINS”);}
{identifier} (\[[0-9]*\])? {if(!COMMENT) printf(“\n %s IDENTIFIER”,yytext);}
\".*\" {if(!COMMENT)printf(“\n\t %s is a STRING”,YYTEXT);}
LABORATORY 13

[0-9]+ {IF(!COMMENT) printf(“\n %s is a NUMBER ”,yytext);}

\)(\:)? {if(!COMMENT)printf(“\n\t”);ECHO;printf(“\n”);}

\( ECHO;
= {if(!COMMENT)printf(“\n\t %s is an ASSIGNMENT OPERATOR”,yytext);}

\<= |

\>= |

\< |
== |

\> {if(!COMMENT) printf(“\n\t%s is a RELATIONAL OPERATOR”,yytext);}

%%
int main(int argc, char **argv)

FILE *file;

file=fopen(“output.c”,"r");
if(!file)

printf(“could not open the file”);

exit(0);
}

yyin=file;

yylex();
printf(“\n”);

return(0);

int yywrap()
{

return(1);

}
14 COMPILER DESIGN

INPUT:
/*output.c*/
#include<stdio.h>
#include<conio.h>
void main()
{
int a,b,c;
a=1;
b=2;
c=a+b;
printf(“Sum:%d”,c);
}
OUTPUT:
To Compile and Run:
lex filename.l
cc filename.yy.c
./a.out

#include<stdio.h> is a preprocessor directive


#include<conio.h> is a preprocessor directive
void is a keyword
FUNCTION
main(
)

BLOCK BEGINS

int is a keyword
a IDENTIFIER,
b IDENTIFIER,
c IDENTIFIER;
LABORATORY 15

a IDENTIFIER

= is an ASSIGNMENT OPERATOR

1 is a NUMBER ;

b IDENTIFIER

= is an ASSIGNMENT OPERATOR

2 is a NUMBER ;

c IDENTIFIER

= is an ASSIGNMENT OPERATOR

a IDENTIFIER.

b IDENTIFIER;

FUNCTION

printf(

“SUMIED” is a STRING,

c IDENTIFIER

BLOCK ENDS

RESULT:

Thus the program for implementation of Lexical Analyzer using LEX tool has
been executed successfully.

*********
16 COMPILER DESIGN

EX. NO: 04

DATE:

IMPLEMENTATION AN ARITHMETIC CALCULATOR


USING LEX & YACC

AIM:

To write a program for implementing an arithmetic calculator for computing the


given expression using semantic rules of the YACC and LEX Tool.

ALGORITHM:

Step 1 : A YACC source program has three parts as follows:


Declaration %% Translation Rules %% Supporting C Routines
Step 2 : Declarations Section: This section contains entries that:
(i) Include standard I/O header file.
(ii) Define global variables.
(iii) Define the list rule as the place to start processing.
(iv) Define the tokens used by the parser.
(v) Define the operators and their precedence.
Step 3 : Rules Section: The rules section defines the rules that parse the input
stream. Each rule of a grammar production and the associated semantic
action.
Step 4 : Programs Section: The programs section contains the following
subroutines. Because these subroutines are included in this file, it is
not necessary to use the YACC library when processing this file.
Step 5 : Main − The required main program that calls the yyparse() subroutine
to start the program.
Step 6 : yyerror(s) − This error-handling subroutine only prints a syntax error
message.
Step 7 : yywrap() − The wrap-up subroutine that returns a value of 1 when the
end of input occurs. The calc.lex file contains include statements for
standard input and output, as programmer file information if we use
the −d flag with the YACC command. The y.tab.h file contains
definitions for the tokens that the parser program uses.
Step 8 : calc.lex contains the rules to generate these tokens from the input
stream.
LABORATORY 17

PROGRAM:
LEX:
%{
#include<stdio.h>
#include “y.tab.h”
extern int yylval;
%}
%%
[0-9]+ {
yylval=atoi(yytext);
return NUMBER;
}
[\t];
[\n] return 0;
. return yytext[0];
%%
int yywrap()
{
return 1;
}

YACC:
%{
#include<stdio.h>
int flag=0;
%}
%token NUMBER
%left ‘+’ ‘-’
%left ‘*’‘/’‘%’
%left ‘(’’)’
%%
18 COMPILER DESIGN

ArithmeticExpression: E{
printf(“\nResult=%d\n”,$$);
return 0;
};
E:E‘+’E {$$=$1+$3;}
|E‘-’E {$$=$1-$3;}
|E‘*’E {$$=$1*$3;}
|E‘/’E {$$=$1/$3;}
|E‘%’E {$$=$1%$3;}
|‘(’E‘)’ {$$=$2;}
| NUMBER {$$=$1;};
%%
void main()
{
printf(“\nEnter Any Arithmetic Expression which can have operations Addition,
Subtraction, Multiplication, Division, Modulus and Round brackets:\n”);
yyparse();
if(flag==0)
printf(“\nEntered arithmetic expression is Valid\n\n”);
}
void yyerror()
{
printf(“\nEntered arithmetic expression is Valid\n\n”);
}
void yyerror()
{
printf(“\nEntered arithmetic expression is Invalid\n\n”);
flat=1;
}
LABORATORY 19

OUTPUT:

Enter Any Arithmetic Expression which can have operations Addition, subtraction.
Multiplication, Divison, Modulus and Round brackets:
((5+6+10+4+5)/5)×2

Result=0

Entered arithmetic expression is valid

virusovirus.desktop:-/Desktop/syedviruss ./a.out
Enter Any Arithmetic Expression which can have operations Addition, Subtraction,
Multiplication, Division, Modulus and Round brackets:
(9=0)

Entered arithmetic expression is invalid

Result:
The above C program to implement a calculator using LEX and YACC was
successfully executed and verified.

*********
20 COMPILER DESIGN

EX. NO: 05

DATE:

GENERATE THREE ADDRESS CODE FOR A


SIMPLE PROGRAM USING LEX & YACC

AIM:

To Generate three address code for a simple program using Lex & YACC

ALGORITHM:

Step 1 : Declaration of header files specially y.tab.h which contains declaration


for Letter, Digit, expr.

Step 2 : End declaration section by %%

Step 3 : Match regular expression.

Step 4 : If match found then convert it into char and store it in yylval.p where
p is pointer declared in YACC.

Step 5 : Return token.

Step 6 : If input contains new line character (\n) then return 0.

Step 7 : If input contains, “then return yytext[0]”

Step 8 : End rule-action section by %%.

Step 9 : Declare main function.

PROGRAM:

%{
#include
#include
#include “y.tab.h”
%}
%%
LABORATORY 21

[0-9]+ {yylval.dval=atoi(yytext);return NUM;}


[t];
n return 0;
. {return yytext[0];}
%%
void yyerror(char *str)
{
printf(“n Invalid Character...”);
}
int main()
{
printf(“Enter Expression x=>”);
yyparse();
return(0);
}
*****************threee.y*******************
%{
#include
int yylex(void);
char p=‘A’-1;
%}
%union
{
char dval;
}
%token NUM
%left ‘+’ ‘-’
%left ‘*’ ‘/’
%nonassoc UMINUS
22 COMPILER DESIGN

%type S
%type E
%%
S : E {printf(“x=%cn”,%%);}
;
E : NUM {}
| E ‘+’ E {p++; printf(“n %c = %c + %c”,p,$1,$3);$$=p;}
| E ‘-’ E {p++; printf(“n %c = %c - %c”,p,$1,$3);$$=p;}
| E ‘*’ E {p++; printf(“n %c = %c * %c”,p,$1,$3);$$=p;}
| E ‘/’ E {p++; printf(“n %c = %c / %c”,p,$1,$3);$$=p;}
| ‘(’E‘)’ {$$=p;}
| ‘-’ E %prec UMINUS {p++;printf(“n %c = -%”,p,$2);$$=p;}
;
%%

OUTPUT:
Enter Expression x => 1+2-3*3/1+4*5
A = 1+2
B = 3*3
C = B/1
D = A-C
E = 4*5
F = D+E
X = F
[a40@localhost ~]$ ./a.out
Enter Expression x => 1+2*(3+4)/5
A = 3+4
B = 2*A
C = B/5
D = 1+C
X = D
LABORATORY 23

[40@localhost ~]$ ./a.out


Enter Expression x => 1+2*(-3+-6/1)*3
A = -3
B = -6
C = B/1
D = A+C
E = 2*D
F = E*3
G = 1+F
X = G

RESULT:
The above program to Generate three address code for a simple program using
Lex & Yacc was successfully executed and verified.

*********
24 COMPILER DESIGN

EX. NO: 06
DATE:

IMPLEMENTATION OF SIMPLE CODE


OPTIMIZATION TECHNIQUES
AIM:
To write a C program to implement Code Optimization Techniques.

ALGORITHM:
1. Start

2. Create an input file which contains three address code.

3. Open the file in read mode.

4. If the file pointer returns NULL, exit the program else go to 5.

5. Scan the input symbol from the left to right.

Common Sub expression elimination


6. Store the first expression in a string.

7. Compare the string with the other expressions in the file.

8. If there is a match, remoe the expression from the input file.

9. Perform these steps 5 to 8 for all the input symbols in the file.

Dead code Elimination


10. Scan the input symbol from the file from left to right.

11. Get the operand before the operator from the three address code.

12. Check whether the operand is used in any other expression in the three
address code.

13. If the operand is not used, then eliminate the complete expression from the
three address code else go to 14.

14. Perform steps 11 to 13 for all the operands in the three address code till end
of file is reached.

15. Stop.
LABORATORY 25

PROGRAM:

#include<stdio.h>
#include<conio.h>
#include<string.h>
struct op
{
char 1;
char r[20];
}
op[10],pr[10];
void main()
{
int a,i,k,j,n,z=0,m,q;
char *p,*l;
char temp,t;
char *tem;
clrscr();
printf(“Enter the Number of Values:”);
scanf(“%d”,&n);
for(i=0;i<n;i++)
{
printf(“left: ”);
op[i].1=getche();
printf(“\tright: ”);
scanf(“%s”,op[i].r);
}
printf(“Intermediate Code\n”);
for(i=0;i<n;i++)
{
26 COMPILER DESIGN

printf(“%c=”,op[i].l);
printf(“%s\n”,op[i].r);
}
for(i=0;i<n-1;i++)
{
temp=op[i].l;
for(j=0;j<n;j++)
{
p=strchr(op[j].r,temp);
if(p) {
pr[z].l=op[i].l;
strcpy(pr[z].r,op[i].r);
z++;
}
}
}
pr[z].l=op[n-1].l;
strepy(pr[z].r,op[n-1].r);
z++;
print(“\nAfter Dead Code Elimination\n”);
for(k=0;k<z;k++) {
printf(“%c\t=”,pr[k].l);
printf(“%s\n”,pr[k].r);
}
for(m=0;m<z;m++) {
tem=pr[m].r;
for(j=m+1;j<z;j++)
{
p=strstr(tem,pr[j].r);
if(p)
LABORATORY 27

{
t=pr[j].1;
pr[j].l=pr[m].l;
for(i=0;i<z;i++)
{
l=strchr(pr[i].r,t) ;
if(l)
{
a=1-pr[i].r;
printf(“pos: %d”,a);
pr[i].r[a]=pr[m].l;
} }
} }
}
printf(“Eliminate Common Expression\n”);
for(i=0;i<z;i++) {
printf(“%c\t=”,pr[i].l);
printf(“%s\n”,pr[i].r);
}
for(i=0;i<z;i++) {
for(j=i+1;j<z;j++) {
q=stremp(pr[i].r,pr[j].r);
if((pr[i].l==pr[j].l)&&!q) {
pr[i].l=‘\0’;
strcpy(pr[i].r,‘\0’);
}
}
}
printf(“Optimized Code\n”);
for(i=0;i<z;i++) {
28 COMPILER DESIGN

if(pr[i].1!=‘\0’) {

printf(“%c=”,pr[i].l);
printf(“%s\n”,pr[i].r);

}
}

getch();
}

OUTPUT:

Enter the Number of Values:5


left: a right: 9

left: b right: c+d


left: e right: c+d

left: f right: b+c


left: r right: f

Intermediate Code

a=9
b=c+d

e=c+d
f=b+e

r=f

After Dead Code Elimination


b =c+d

e =c+d
f =b+c

r =f
LABORATORY 29

pos: 2Eliminate Common Expression


b =c+d
b =c+d
f =b+b
r =f

Optimized Code
b=c+d
f=b+b
r=f

*********
30 COMPILER DESIGN

EX. NO: 07

DATE:

IMPLEMENT THE BACK END OF THE COMPILER

AIM:

To implement the back end of the compiler which takes the three address code
and produces the 8086 assembly language instructions that can be assembled and
run using a 8086 assembler. The target assembly instructions can be simple move,
add, sub, jump. Also simple addressing modes are used.

ALGORITHM:

Step 1 : Start the program

Step 2 : Open the source file and store the contents as quadruples.

Step 3 : Check for operators, in quadruples, if it is an arithmetic operator


generator it or if assignment operator generates it, else perform unary
minus on register C.

Step 4 : Write the generated code into output definition of the file in outp.c

Step 5 : Print the output.

Step 6 : Stop the program.

PROGRAM:

#include<stdio.h>

#include<stdio.h>

//#include<conio.h>

#include<string.h>

void
{ main()
LABORATORY 31

char icode[10][30],str[20],opr[10];
int i=0;
//clrscr();
printf(“\n Enter the set of intermediate code (terminated by exit):\n”);
do
{
scanf(“%s”,icode[i]);
} while(strcmp(icode[i++], “exit”)!=0);
printf(“\n target code generation”);
printf(“\n****************”);
i=0;
do
{
strcpy(str,icode[i]);
switch(str[3])
{
case ‘+’:
strcpy(opr,"ADD);
break;
case ‘-’:
strcpy(opr,"SUB");
break;
case ‘*’:
strcpy(opr,"MUL");
break;
case ‘/’:
strepy(opr,"DIV");
break;
}
printf(“\n\tMov %c,R%d”,str[2],i);
32 COMPILER DESIGN

printf(“\n\t%s%c,R%d”,opr,str[4],i);
printf(“\n\tMov R%d,%c”,i,str[0]);
}while(stremp(icode[++i],"exit")!=0);
//getch();
}

OUTPUT:

Enter the set of intermediate code (terminated by exit):

d=2/3
c=4/5
a=2*e
exit

target code generation


***********************
Mov 2,R0
DIV3,R0
Mov R0,d
Mov 4,R1
DIV5,R1
Mov R1,c
Mov 2,R2
MULe,R2
Mov R2,a

RESULT:

Thus the program was implemented to the Three Address Code has been
successfully executed.

*********
B.E./B.Tech. DEGREE EXAMINATION, APRIL/MAY 2017
Sixth Semester
Computer Science and Engineering
CS6660 − COMPILE DESIGN
(Common to : Information Technology)
(Regulations 2013)

Time : Three Hours Maximum : 100 Marks


Answer ALL questions (10 × 2 = 20 Marks)
PART – A

1. Define the two parts of compilation.


2. List the cosine of the compiler?
3. Write a regular expression for an identifier and number.

4. What are the various parts in LEX program?


5. Eliminate the left recursion for the grammar A → Ac  Aad  bd.
6. What are the various conflicts that occur during shift reduce parsing?

7. What do you mean by binding of names?


8. Mention the rules for type checking.
9. What is a basic block?

10. What do you mean by copy propagation?

PART – B (5 × 16 = 80 Marks)

11. (a) What are the phases of the compiler? Explain the phases in detail. Write
down the output of each phase for the expression a : = b + c ∗ 60. (16)

(Or)

(b) (i) Explain briefly about compiler construction tools. (6)

Describe in detail about Cousins of compiler? (4)

Draw the transition diagram for relational operators and usigned


numbers. (6)
Q.P.2 COMPILER DESIGN

12. (a) Convert the Regular Expression abb (a / b)* to DFA using direct method
and minimize it. (16)

(Or)

(b) (i) Differentiate between lexeme, token and pattern. (6)

(ii) What are the issues in lexical analysis? (4)

(iii) Draw the transition diagram for relational operators and unsigned
numbers. (6)

13. (a) Construct a predictive parsing table for the grammar

S → (L)  a

L → L, S  S.

and show whether the following string will be accepted or not. (a,(a,(a,a))). 16

(Or)

(b) Consider the following Grammar

E → E+T  T

T → TF  F

F → F∗  a  b

Construct the SLR parsing table for the above grammar. (16)

14. (a) What are the different storage allocation strategies? (16)

(Or)

(b) (i) Explain in detail about Specification of a simple type checker (10)

(ii) Explain about the parameter passing. (6)

15. (a) Discuss the various issues in design of Code Generator. (16)

(Or)

(b) (i) Explain in detail about optimization of Basic Blocks. (8)


QUESTION PAPERS Q.P.3

(ii) Construct the DAG for the following Basic Block. (8)

1. t1 : = 4∗i

2. t2 : = a [t1]

3. t3 : = 4∗i

4. t4 : = b[t3]

5. t5 : = t2∗t4

6. t6 : = prod+t5

7. prod : = t6

8. t7 : = i+1

9. i : = t7

10. if i < = 20 goto (1).

*********
Q.P.4 COMPILER DESIGN

B.E./B.Tech. DEGREE EXAMINATION, NOVEMBER/DECEMBER 2017


Sixth Semester
Computer Science and Engineering
CS6660 − COMPILE DESIGN
(Common to : Information Technology)
(Regulations 2013)

Time : Three Hours Maximum : 100 Marks


Answer ALL questions (10 × 2 = 20 Marks)
PART – A

1. What is an interpreter?
2. What do you mean by Cross-Compiler?
3. What is the role of lexical analysis phase?

4. Define Lexeme.
5. Draw syntax tree for the expression a=b ∗ − c+b∗ − c.
6. What are the three storage allocation strategies?

7. Differentiate NFA and DFA.


8. Compare syntax tree and parse tree.
9. Draw the DAG for the statement a = (a∗b+c)−(a∗b+c).

10. What are the properties of optimizing compilers?

PART – B (5 × 16 = 80 Marks)

11. (a) What are compiler construction tools? Write note on each Compiler
Construction tool.

(Or)

(b) Explain in detail the various phases of compilers with an example.

12. (a) (i) Discuss the issues involved in designing Lexical Analyzer.

(ii) Draw NFA for the regular expression ab*/ab.

(Or)
QUESTION PAPERS Q.P.5

(b) Write an algorithm to convert NFA to DFA and minimize DFA. Give an
example.

13. (a) Explain LOWER parsing algorithm with an example.

(Or)

(b) Explain the non-recursive implementation of predictive parsers with the


help of the grammar.

E → E+T  T
T → T∗F  F
F →(E)  id

14. (a) Explain the specification of simple type checker for statements,
expressions and functions.

(Or)

(b) Explain about runtime storage management.

15. (a) Discuss the issues in code generation with examples.

(Or)

(b) Explain briefly about the principal sources of optimization.

*********
Q.P.6 COMPILER DESIGN

B.E./B.Tech. DEGREE EXAMINATION, NOVEMBER/DECEMBER 2018


Sixth Semester
Computer Science and Engineering
CS6660 − COMPILE DESIGN
(Common to : Information Technology)
(Regulations 2013)

(Also common to PTCS 6660 – Compiler Design – for B.E. (Part-Time)


Fifth Semester – Computer Science and Engineering – Regulations 2014)

Time : Three Hours Maximum : 100 Marks

Answer ALL questions (10 × 2 = 20 Marks)


PART – A

1. Recall the basic the two parts of a compilation process.


2. How a source code is translated to machine code?

3. State the rules to define regular expression.


4. Construct Regular expression for the language L = {w ∈ { a, b } ⁄ w ends in abb}.

5. What are the different stages that a parser can recover from a syntactic error?
6. Define LOWER (0) item.
7. List three kinds of intermediate representation.

8. When procedure call occurs, what are the steps taken?


9. State the problems in code generation.
10. Define common sub expression.

PART – B (5 × 16 = 80 Marks)

11. (a) Write short notes about:

(i) Compiler Construction Tools. (7)

(ii) Lexomo, taken and pattern. (6)

(Or)

(b) Discuss in detail about the operations of compiler which transforms the
source program from one representation into another. Illustrate the output
for the input: a = (b + c) ∗ (b + c) ∗ 2. (13)
QUESTION PAPERS Q.P.7

12. (a) Write briefly about :

(i) The role of Lexical analyzer with the possible error Recovery actions. (5)

(ii) Recognition and specification of tokens. (8)

(Or)

(b) Construct the minimized DFA for the regular expression


(0 + 1) ∗ (0 + 1) 01. (13)

13. (a) Show that the following grammar


S → Aa  bAc  dc  bda.
A→a
is LALR (1) but not SLR (1). (13)

(Or)

(b) Show that the following grammar


S → Aa  bAc symbol ¥ Bc symbol ¥ bBa
A → d
B → d
is LOWER(1) but not LALR(1). (13)

14. (a) Apply the S-attributed definition and constructs syntax trees for a simple
expression grammar involving only the binary operators + and −. As usual,
these operators are at the same precedence level and are jointly left
associative. All non-terminal have one synthesized attribute node, which
represents a node of the syntax tree.

Production: E → E1 + T, E → T, T → (E), T → id ⁄ num. (13)

(Or)

(b) Discuss in detail about:

(i) Storage allocation strategies. (7)

(ii) Parameter passing methods. (6)

15. (a) Discuss in detail about optimization of basic blocks. (13)

(b) Explain in detail about issues in the design of a code generator. (13)
Q.P.8 COMPILER DESIGN

PART C – (1 × 15 = 15 marks)

16. (a) Suppose we have a production A → B C D. Each of the four non-terminals


has two attributes s; which is synthesized, and i, which is inherited. For
each set of rules below, check whether the rules are consistent with (i) an
S-attributed definition, (ii) an L-attributed definition (iii) any evaluation
order at all.

(1) A.s = B.i + C.i

(2) A.s = B.i + C.s and D.i = A.i + B.s

(3) A.s = B.s + D.s

(4) A.s = D.i


B.i = A.s + C.s
C.i = B.s
D.i = B.i + C.i.

(Or)

(b) Construct a Syntax-Directed Translation scheme that translates arithmetic


expression from infix into postfix notation. (15)

*********
INDEX
A G
Activation Record, 4.10 Global Data Flow Analysis, 5.22
Ambiguous Grammar, 2.21
Analysis and Synthesis Model, 1.6
H
Automata, 1.42 Heap Allocation, 4.7

B I
Implementation of Three
Bottom up Parsing, 2.36
Address Code, 3.10
Induction Variable, 5.27
C
Input Buffering, 1.24
Canonical LR Parsing (CLR), 2.61 Instruction Costs, 4.18
Comparison of LR Parsers, 2.73 Introduction to Code Optimization, 5.1
Compiler Construction Tools, 1.10 Introduction to Lexical Analysis, 1.11
Concept of Shift Reduce Parsing, 2.41 Issues of Code Generator, 4.14
Construction of DAG, 5.15
Construction of LL(1) Parser, 2.34 L
Context – free Grammar, 2.3 LALR, 2.70
Copy Propagation, 5.6, 5.25 Language Processing System, 1.4
Language for Specifying
D Lexical Analyzer, 1.33
Lex Program to Count Total
Dead Code Eliminations, 5.7 Number of Tokens, 1.41
Declarations, 3.23 Lexical Errors, 1.23
Dependency Graphs, 3.4 Loop Optimizations, 5.7
Design of a Simple Code LR Parser, 2.50
Generator, 4.19
M
E
Minimizing DFA, 1.63
Efficient Data Flow Analysis, 5.23
Either, 3.2 N
Error Handling, 2.25 NFA with ∈ Closure, 1.46

F O
Function preserving Operator Precedence Parser, 2.46
transformations examples, 5.5 Optimization of Basic Blocks, 5.19
I.ii COMPILER DESIGN

P Type Checking, 3.32


Type Expressions, 3.32
Parameter Passing, 4.12
Type Conversion, 3.35
Parser, 1.8
Types of Three Address Code, 3.14
Peep-hole Optimization, 5.9
Types of Finite Automata, 1.45
Predictive Parsing, 2.28
Principal Sources of Optimisation, 5.4 W
Writing a Grammar, 2.20
R
Y
Recognition of Token, 1.30
Recursive Descent Parsing, 2.26 YACC Specification, 2.79
Redundant Common YACC, 2.76
Sub Expression Elimination, 5.23
Regular Expressions to *********
Finite Automata, 1.58
Role of parser, 2.1
Role of Lexical Analyzer, 1.22

S
Semantic Routines, 1.8
Simple LR Parsing (SLR), 2.58
Source Language Issues, 4.3
Specification of Tokens, 1.25
Stack Allocation of Space, 4.5
Static Allocation, 4.5
Storage Allocation Strategies, 4.5
Storage Organization, 4.4
Strength Reduction, 5.6
Syntax Directed Definitions, 3.1
Syntax Tree, 3.7

T
Target Machine Description, 4.17
The Phases of the Compiler, 1.7
Three Address Code, 3.9
Token, Patterns, Lexemes, 1.23
Top Down Parsing, 2.26
Translation of Expressions, 3.25
Translator, 1.1

You might also like