Notes _compiler design
Course Content: Theory: Introduction to Compiler, Phase; passes, Bootstrapping,
Regular grammar and regular language, Context free grammars, capabilities of CFG,
Implementation of lexical analyzers, lexical-analyzer generator, ambiguous grammar,
BNF notation, YACC and LEX, Run time environment.
Introduction to Parsers, Type of Parser, Derivatives: LMD, RMD and
parse tree; Syntax trees, CNF and GNF normal form, Recursive grammar,
Left factoring, First and follow symbol, LL (1) parser and
Construction of LL (1) parsing table, Recursive decent parser.
Introduction to Compiler
Introduction to Compiler
o A compiler is a translator that converts the high-level language into the machine language.
o High-level language is written by a developer and machine language can be understood
by the processor.
o Compiler is used to show errors to the programmer.
o The main purpose of compiler is to change the code written in one language without
changing the meaning of the program.
o When you execute a program which is written in HLL programming language then it
executes into two parts.
o In the first part, the source program compiled and translated into the object program (low
level language).
o In the second part, object program translated into the target program through the
assembler.
Fig: Execution process of source program in Compiler
Phase; passes
Compiler Phases
The compilation process contains the sequence of various phases. Each phase takes source
program in one representation and produces output in another representation. Each phase takes
input from its previous stage.
There are the various phases of compiler:
Fig: phases of compiler
Lexical Analysis:
Lexical analyzer phase is the first phase of compilation process. It takes source code as input. It
reads the source program one character at a time and converts it into meaningful lexemes. Lexical
analyzer represents these lexemes in the form of tokens.
Syntax Analysis
Syntax analysis is the second phase of compilation process. It takes tokens as input and generates
a parse tree as output. In syntax analysis phase, the parser checks that the expression made by the
tokens is syntactically correct or not.
Semantic Analysis
Semantic analysis is the third phase of compilation process. It checks whether the parse tree
follows the rules of language. Semantic analyzer keeps track of identifiers, their types and
expressions. The output of semantic analysis phase is the annotated tree syntax.
Intermediate Code Generation
In the intermediate code generation, compiler generates the source code into the intermediate
code. Intermediate code is generated between the high-level language and the machine language.
The intermediate code should be generated in such a way that you can easily translate it into the
target machine code.
Code Optimization
Code optimization is an optional phase. It is used to improve the intermediate code so that the
output of the program could run faster and take less space. It removes the unnecessary lines of the
code and arranges the sequence of statements in order to speed up the program execution.
Code Generation
Code generation is the final stage of the compilation process. It takes the optimized intermediate
code as input and maps it to the target machine language. Code generator translates the
intermediate code into the machine code of the specified computer.
Example:
Compiler Passes
Pass is a complete traversal of the source program. Compiler has two passes to traverse the source
program.
Multi-pass Compiler
o Multi pass compiler is used to process the source code of a program several times.
o In the first pass, compiler can read the source program, scan it, extract the tokens and
store the result in an output file.
o In the second pass, compiler can read the output file produced by first pass, build the
syntactic tree and perform the syntactical analysis. The output of this phase is a file that
contains the syntactical tree.
o In the third pass, compiler can read the output file produced by second pass and check
that the tree follows the rules of language or not. The output of semantic analysis phase is
the annotated tree syntax.
o This pass is going on, until the target output is produced.
One-pass Compiler
o One-pass compiler is used to traverse the program only once. The one-pass compiler
passes only once through the parts of each compilation unit. It translates each part into its
final machine code.
o In the one pass compiler, when the line source is processed, it is scanned and the token is
extracted.
o Then the syntax of each line is analyzed and the tree structure is build. After the semantic
part, the code is generated.
o The same process is repeated for each line of code until the entire program is compiled.
Bootstrapping
Bootstrapping
o Bootstrapping is widely used in the compilation development.
o Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type
of compiler that can compile its own source code.
o Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.
A compiler can be characterized by three languages:
1. Source Language
2. Target Language
3. Implementation Language
The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.
Follow some steps to produce a new language L for machine A:
1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that
compiler runs on machine A.
2. Create a compiler LCSA for language L written in a subset of L.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which runs
on machine A and produces code for machine A.
The process described by the T-diagrams is called bootstrapping.
Regular grammar and regular language,
Regular Grammar :
A grammar is regular if it has rules of form A -> a or A -> aB or A -> ? where ? is a
special symbol called NULL.
Regular Languages :
A language is regular if it can be expressed in terms of regular expression.
Closure Properties of Regular Languages
Union :
If L1 and If L2 are two regular languages, their union L1 ? L2 will also be regular.
For example, L1 = {a
n
| n ? 0} and L2 = {b
n
| n ? 0} L3 = L1 ? L2 = {a
n
?b
n
| n ? 0} is also regular.
Intersection :
If L1 and If L2 are two regular languages, their intersection L1 ? L2 will also be
regular. For example, L1= {a
m
b
n
| n ? 0 and m ? 0} and L2= {a
m
b
n
?b
n
a
m
| n ? 0 and m ? 0} L3 = L1 ? L2 = {a
m
b
n
| n ? 0 and m ? 0} is also regular.
Concatenation :
If L1 and If L2 are two regular languages, their concatenation L1.L2 will also be
regular. For example, L1 = {a
n
| n ? 0} and L2 = {b
n
| n ? 0} L3 = L1.L2 = {a
m
.b
n
| m ? 0 and n ? 0} is also regular.
Kleene Closure :
If L1 is a regular language, its Kleene closure L1* will also be regular. For example,
L1 = (a ? b) L1* = (a ? b)*
Complement :
If L(G) is regular language, its complement L’(G) will also be regular. Complement
of a language can be found by subtracting strings which are in L(G) from all possible
strings. For example, L(G) = {a
n
| n > 3} L’(G) = {a
n
| n <= 3}
Context free grammars, capabilities of CFG,
What is Context Free Grammar?
A formal grammar called context free grammar (CFG) is used to produce every
conceivable string in a given formal language.
Four tuples are used to define the context free grammar G:
G = (V, T, P, S)
Here,
G refers to a grammar that consists of sets of various production rules. We use it to
generate a language’s strings.
T refers to the terminal symbol’s final set. Lower case letters are used to denote it.
V refers to the nonterminal symbol’s final set. Capital letters are used to denote it.
P refers to a set of production rules that can be used to replace the nonterminal
symbols (on the production’s left side) in a string along with other terminals (present
on the production’s right side).
S refers to the start symbol that is used to derive the string.
The start symbol is used in CFG to derive the string. This string can be derived by
replacing a nonterminal repeatedly by the production’s right-hand side, until and
unless the terminal symbols replace all the nonterminals.
Capabilities of CFG
The CFG comes with various capabilities, such as:
The majority of programming languages may be described using context-free
grammar.
An effective parser can be created automatically if the grammar is well designed.
It is possible to create grammars that are appropriate for expressions by utilising the
associative properties and precedence data.
Context-free grammar can describe nested structures, such as balancing
parentheses, matching begin-end, related if-then-else statements, and more.
Implementation of lexical analyzers.
### Lexical Analyzers - Short Notes
#### Definition
A **Lexical Analyzer** (or **Lexer**) is a
fundamental component of a compiler or interpreter
that converts a sequence of characters from source
code into a sequence of tokens. Each token represents
a meaningful element such as keywords, identifiers,
literals, and operators.
#### Functions of Lexical Analyzer
1. **Tokenization**: Breaking the input source code into
tokens.
2. **Removing Whitespaces and Comments**: Ignoring
irrelevant characters and comments.
3. **Error Detection**: Identifying invalid tokens or
sequences.
#### Components
- **Input Buffer**: Stores the source code to be
analyzed.
- **Scanner**: Reads characters from the input buffer
and groups them into tokens.
- **Symbol Table**: Maintains information about
identifiers (variables, functions, etc.).
#### Tokens
Tokens are the smallest units of meaningful data,
categorized as:
- **Keywords**: Reserved words with special meaning
(e.g., `if`, `else`, `while`).
- **Identifiers**: Names given to variables, functions,
etc.
- **Literals**: Constant values (e.g., numbers, strings).
- **Operators**: Symbols representing operations (e.g.,
`+`, `-`, `*`, `/`).
- **Punctuation**: Symbols that organize the code
structure (e.g., `;`, `,`, `{`, `}`).
#### Lexical Analysis Process
1. **Reading Input**: Characters are read from the
source code.
2. **Pattern Matching**: Sequences of characters are
matched against patterns defined by regular expressions.
3. **Token Generation**: Matched sequences are
converted into tokens.
4. **Token Stream**: A stream of tokens is generated
for the syntax analyzer (parser).
#### Example
Consider the source code snippet: `int x = 10;`
1. **Input**: `int x = 10;`
2. **Tokenization**:
- `int` → Keyword
- `x` → Identifier
- `=` → Operator
- `10` → Literal
- `;` → Punctuation
3. **Output**: Token stream: [Keyword(int),
Identifier(x), Operator(=), Literal(10), Punctuation(;)]
#### Tools and Techniques
- **Regular Expressions**: Used to define patterns for
tokens.
- **Finite State Machines (FSMs)**: Automata to
recognize patterns.
- **Lex/Flex**: Tools for generating lexical analyzers.
#### Error Handling
- **Unrecognized Tokens**: Report unexpected
characters.
- **Malformed Tokens**: Detect and report tokens that
do not conform to the expected pattern.
#### Benefits
- **Simplifies Parsing**: Reduces complexity by
breaking down the input into manageable tokens.
- **Error Reporting**: Provides early detection of
lexical errors.
#### Implementation Steps
1. **Define Token Patterns**: Using regular expressions.
2. **Create FSM**: For recognizing token patterns.
3. **Generate Lexer**: Using tools like Lex/Flex or
manually coding the FSM.
4. **Integrate Lexer**: Combine with parser and other
components of the compiler.
By understanding these key points, one can grasp the
essential role and functioning of lexical analyzers in the
compilation process.
### Lexical Analyzer Generator - Short Notes
#### Definition
A **Lexical Analyzer Generator** is a tool that
automatically generates a lexical analyzer (lexer) from a
set of specified token patterns. The most common
examples of these tools are **Lex** and **Flex**.
#### Purpose
The primary purpose is to simplify the process of
creating a lexer by allowing the programmer to define
patterns using regular expressions, which the tool then
converts into a working lexer.
#### Key Concepts
1. **Tokens**: Basic elements recognized by the lexer
(keywords, identifiers, literals, operators).
2. **Patterns**: Regular expressions that define how
tokens are recognized.
3. **Actions**: Code snippets executed when a pattern
is matched.
#### Workflow
1. **Define Token Patterns**: Use regular expressions
to specify patterns for tokens in a source file.
2. **Generate Lexer**: Run the lexical analyzer
generator tool to produce the lexer code.
3. **Integrate Lexer**: Use the generated lexer in your
compiler or interpreter to process source code.
#### Example: Using Lex/Flex
1. **Lex/Flex Input File Structure**:
- **Definitions**: Declare variables and include files.
- **Rules**: Define patterns and corresponding actions.
- **User Code**: Additional C/C++ code to be
included.
2. **Sample Lex/Flex Input File** (`example.l`):
```lex
%{
#include <stdio.h>
%}
%%
[0-9]+ { printf("NUMBER: %s\n", yytext); }
[a-zA-Z]+ { printf("WORD: %s\n", yytext); }
[ \t\n] { /* Ignore whitespace */ }
. { printf("UNKNOWN: %s\n", yytext); }
%%
int main() {
yylex();
return 0;
}
```
3. **Generating the Lexer**:
- Run the command: `flex example.l`
- This generates `lex.yy.c`, the C source file for the
lexer.
4. **Compiling the Lexer**:
- Compile the generated C file: `gcc lex.yy.c -o lexer -
lfl`
- This produces an executable `lexer`.
5. **Running the Lexer**:
- Execute the lexer: `./lexer`
- Input text through standard input, and the lexer will
tokenize it based on the defined patterns.
#### Benefits
- **Automation**: Automates the tedious process of
writing a lexer manually.
- **Efficiency**: Generates efficient code for pattern
matching.
- **Simplicity**: Simplifies the definition of complex
token patterns using regular expressions.
#### Common Tools
- **Lex**: Traditional lexical analyzer generator for
Unix-based systems.
- **Flex**: An improved version of Lex, providing
faster and more powerful lexing capabilities.
#### Summary
Lexical analyzer generators like Lex and Flex are
powerful tools that streamline the creation of lexical
analyzers. By defining token patterns using regular
expressions and actions for each pattern, these tools
generate efficient lexers that can be easily integrated into
compilers or interpreters.
ambiguous grammar
Definition
An ambiguous grammar is a type of grammar that can generate the same string in a
language in multiple ways, resulting in multiple parse trees or derivations.
Example
Consider the grammar:
1. E→E+EE \rightarrow E + EE→E+E
2. E→E∗ EE \rightarrow E * EE→E∗ E
3. E→(E)E \rightarrow ( E )E→(E)
4. E→idE \rightarrow idE→id
For the input string id + id * id, the grammar can produce multiple parse trees:
Parse tree 1: E→E+E→id+(id∗ id)E \rightarrow E + E \rightarrow id + (id *
id)E→E+E→id+(id∗ id)
Parse tree 2: E→E∗ E→(id+id)∗ idE \rightarrow E * E \rightarrow (id + id) *
idE→E∗ E→(id+id)∗ id
Issues
Ambiguity makes it difficult to create parsers that correctly interpret the language.
Parsing ambiguity can lead to errors in compiler design and program interpretation.
BNF notation,
BNF Notation
BNF stands for Backus-Naur Form. It is used to write a formal representation of a context-free
grammar. It is also used to describe the syntax of a programming language.
BNF notation is basically just a variant of a context-free grammar.
In BNF, productions have the form:
1. Left side → definition
Where leftside ∈ (Vn∪ Vt)+ and definition ∈ (Vn∪ Vt)*. In BNF, the leftside contains one non-
terminal.
We can define the several productions with the same leftside. All the productions are separated by
a vertical bar symbol "|".
There is the production for any grammar as follows:
1. S → aSa
2. S → bSb
3. S→c
In BNF, we can represent above grammar as follows:
1. S → aSa| bSb| c
YACC
o YACC stands for Yet Another Compiler Compiler.
o YACC provides a tool to produce a parser for a given grammar.
o YACC is a program designed to compile a LALR (1) grammar.
o It is used to produce the source code of the syntactic analyzer of the language produced
by LALR (1) grammar.
o The input of YACC is the rule or grammar and the output is a C program.
These are some points about YACC:
Input: A CFG- file.y
Output: A parser y.tab.c (yacc)
o The output file "file.output" contains the parsing tables.
o The file "file.tab.h" contains declarations.
o The parser called the yyparse ().
o Parser expects to use a function called yylex () to get tokens.
The basic operational sequence is as follows:
This file contains the desired grammar in YACC format.
It shows the YACC program.
It is the c source program created by YACC.
C Compiler
Executable file that will parse grammar given in gram.Y
LEX,
LEX
o Lex is a program that generates lexical analyzer. It is used with YACC parser generator.
o The lexical analyzer is a program that transforms an input stream into a sequence of
tokens.
o It reads the input stream and produces the source code as output through implementing
the lexical analyzer in the C program.
The function of Lex is as follows:
o Firstly lexical analyzer creates a program lex.1 in the Lex language. Then Lex compiler runs
the lex.1 program and produces a C program lex.yy.c.
o Finally C compiler runs the lex.yy.c program and produces an object program a.out.
o a.out is lexical analyzer that transforms an input stream into a sequence of tokens.
Lex file format
A Lex program is separated into three sections by %% delimiters. The formal of Lex source is as
follows:
1. { definitions }
2. %%
3. { rules }
4. %%
5. { user subroutines }
Definitions include declarations of constant, variable and regular definitions.
Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.
Where pi describes the regular expression and action1 describes the actions what action the
lexical analyzer should take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded
with the lexical analyzer and compiled separately.
Run-Time Environment
Definition
The run-time environment is the part of a system that provides the necessary support
for executing programs, including memory management, input/output operations, and
system calls.
Components
1.
Memory Organization:
2.
1. Stack: For function calls, local variables.
2. Heap: For dynamically allocated memory.
3. Static Data: For global variables.
4. Code Segment: For executable code.
3.
Storage Allocation:
4.
1. Static Allocation: Memory allocated at compile time.
2. Dynamic Allocation: Memory allocated at run time (e.g., using malloc).
5.
Calling Conventions:
6.
1. Function Prolog/Epilog: Code to set up and tear down stack frames.
2. Parameter Passing: Mechanism to pass arguments to functions.
7.
Garbage Collection:
8.
1. Automatic memory management to reclaim memory occupied by objects no longer
in use.
Example
In C, the run-time environment supports features like:
Function Calls: Stack frames are created for each function call.
Dynamic Memory: Using functions like malloc and free to manage heap memory.
By understanding these key components, one can effectively work with lexical
analyzers, parsers, and run-time environments in compiler design and language
processing.
Parsers
Introduction
A **parser** is a component of a compiler or interpreter that
takes input in the form of tokens and constructs a parse tree,
representing the syntactic structure of the source code.
#### Types of Parsers
1. **Top-Down Parsers**: Start from the root and proceed
towards the leaves.
- Examples: LL parser, Recursive Descent parser
2. **Bottom-Up Parsers**: Start from the leaves and proceed
towards the root.
- Examples: LR parser, Shift-Reduce parser
### Derivations: LMD, RMD, and Parse Tree
1. **Leftmost Derivation (LMD)**: Replaces the leftmost
nonterminal at each step.
2. **Rightmost Derivation (RMD)**: Replaces the rightmost
nonterminal at each step.
3. **Parse Tree**: A tree representation of the syntactic
structure of the input, showing how the start symbol of the
grammar derives the input string.
### Syntax Trees
- A more abstract representation of the parse tree that omits
some intermediate nodes, focusing on the structure of
expressions and statements.
### CNF and GNF Normal Form
1. **Chomsky Normal Form (CNF)**: A grammar is in CNF if
all production rules are of the form \( A \rightarrow BC \) or \( A
\rightarrow a \), where \( A, B, \) and \( C \) are nonterminals
and \( a \) is a terminal.
2. **Greibach Normal Form (GNF)**: A grammar is in GNF if
all production rules are of the form \( A \rightarrow a\alpha \),
where \( a \) is a terminal and \( \alpha \) is a (possibly empty)
sequence of nonterminals.
### Recursive Grammar
- A grammar is recursive if a nonterminal can be eventually
rewritten as itself through one or more production rules.
### Left Factoring
- A technique to transform a grammar to remove ambiguity by
factoring out common prefixes in production rules.
### First and Follow Symbols
1. **First**: The set of terminals that begin the strings derivable
from a nonterminal.
2. **Follow**: The set of terminals that can appear immediately
to the right of a nonterminal in some "sentential" form.
### LL(1) Parser and Construction of LL(1) Parsing Table
- **LL(1) Parser**: A type of top-down parser that uses one
lookahead token to decide the parsing action.
- **Parsing Table**: Constructed using the First and Follow
sets to determine which production to use for each combination
of nonterminal and lookahead symbol.
### Recursive Descent Parser
- A top-down parser implemented using a set of recursive
procedures where each procedure corresponds to a nonterminal
in the grammar.
### LR Parsers and Parsing Table
1. **LR Parser**: A type of bottom-up parser that reads input
from left to right and produces a rightmost derivation in reverse.
2. **Parsing Table**: Consists of action and goto tables used to
control the parsing process.
### Shift-Reduce Parsing
- A bottom-up parsing technique where shifts read the next input
token onto the stack and reduces combine elements of the stack
according to production rules.
### LR(0) Parser
- A type of LR parser with no lookahead, used for simple
grammars.
### SLR(1) Parser
- A simple LR parser that uses lookahead symbols to reduce
conflicts.
### LR(1) Items
- Items used in LR parsing that include lookahead symbols to
provide more context during parsing.
### Canonical LR Parser
- A more powerful LR parser that uses a complete set of LR(1)
items to handle a wider range of grammars.
### LALR Parsing Tables
- **LALR (Look-Ahead LR)**: Combines states with the same
core but different lookahead sets to reduce the size of the
parsing table while maintaining accuracy.
### Operator Precedence Grammar
- A grammar where the precedence and associativity of
operators are explicitly defined to resolve conflicts.
### Syntax Directed Definitions (SDD)
- An approach that associates attributes with grammar symbols
and semantic rules with productions to specify the translation of
a language.
### Implementation of Syntax Directed Translators
- Translators that use SDDs to perform translations during
parsing, typically by extending parsing tables with semantic
actions.
### Synthesis and L-Attributes of SDT
1. **S-Attributed Definitions**: Only synthesized attributes are
used.
2. **L-Attributed Definitions**: Both synthesized and inherited
attributes are used, with restrictions to maintain efficiency.
### Intermediate Code Generation
- Generates a platform-independent code representation during
compilation.
#### Postfix Notation
- An arithmetic expression notation where operators follow their
operands.
#### Parse Trees and Syntax Trees
- Represent the syntactic structure of the source code at different
abstraction levels.
#### Three Address Codes
1. **Quadruple**: A four-field structure (operator, argument1,
argument2, result) representing an intermediate code instruction.
2. **Triple**: A three-field structure (operator, argument1,
argument2) where the result is implied by the position.
### Translation of Assignment Statements
- Converts high-level assignment statements into intermediate
code.
### Statements that Alter the Flow of Control
- Includes constructs like loops and conditionals, translated into
intermediate code with appropriate control flow mechanisms.
### Postfix Translation
- Direct translation of expressions into postfix notation during
parsing.
### Translation with a Top-Down Parser
- Uses a top-down approach to translate source code into
intermediate code.
### Code Generation
#### Design Issues
- Considerations include the target language, optimization goals,
and resource constraints.
#### Target Language
- The machine language or assembly language for which the
code is generated.
#### Addresses in the Target Code
- Methods for addressing memory locations, registers, and other
resources in the generated code.
#### Basic Blocks and Flow Graphs
1. **Basic Blocks**: Sequence of instructions with a single
entry point and a single exit point.
2. **Flow Graphs**: Directed graphs representing the control
flow between basic blocks.
### Optimization of Basic Blocks
- Techniques to improve the efficiency of code within basic
blocks.
### Code Optimization
#### Machine-Independent Optimizations
- General techniques applicable to any target architecture, such
as constant folding and common subexpression elimination.
#### Loop Optimization
- Techniques to improve the performance of loops, such as loop
unrolling and loop invariant code motion.
#### DAG Representation of Basic Block
- A Directed Acyclic Graph (DAG) representing expressions
within a basic block to optimize code generation.
### Global Data-Flow Analysis
- Analyzes the flow of data across the entire program to
optimize code and improve performance.
By understanding these key topics and concepts, you will be
well-prepared for your end-term exam in compiler design and
language processing.