Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views66 pages

CH 3

Uploaded by

fozia tariq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views66 pages

CH 3

Uploaded by

fozia tariq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

1

Lexical Analysis and


Lexical Analyzer Generators
Chapter 3
2

The Reason Why Lexical


Analysis is a Separate Phase
Convert a stream of characters into a
stream of tokens.
Simplicity: Conventions about “words" are
often diff erent from conventions about
“sentences".
Eff iciency: Word identification problem has
a much more eff icient solution than
sentence identification problem.
3

Interaction of the Lexical


Analyzer with the Parser
Token,
Source Lexical tokenval
Parser
Program Analyzer
GetNextToken

error error

Symbol Table
4

The Role of the Lexical


Analyzer
It reads the input characters of the source
program, group them into lexemes(lexeme is a
sequence of characters in the source code that forms a
meaningful unit and matches a specific pattern defined for
a particular token) , and produce as output a
sequence of tokens for each lexeme in the source
program.
It strips out comments and whitespace (blank,
newline, tab, and perhaps other characters that are
used to separate tokens in the input).
5

Attributes of Tokens

y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <‘+’, > <num, 28> <‘*’, > <id, “x”>

token
(lookahead)
tokenval Parser
(token attribute)
6

Tokens, Patterns, and


Lexemes
A token is a classification of lexical units
– For example: id and num
Lexemes are the specific character strings
that make up a token
– For example: abc and 123
Patterns are rules describing the set of
lexemes belonging to a token
– For example: “letter followed by letters and digits”
and “non-empty sequence of digits”
7

Examples of tokens
8

Lexical Errors
It is hard for a lexical analyzer to tell, without
the aid of other components, that there is a
source-code error. e.g.
fi (x == f(x)) …..
A lexical analyzer cannot tell whether fi is a
misspelling of the keyword if or an
undeclared function identifier.
Probably the parser in this case - handle an
error due to transposition of the letters.
The lexical analyzer can detect characters
9

General idea of input buffering


The string of characters between the two
pointers is the current lexeme.
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of
the current lexeme, whose extent we are
attempting to determine.
2. Pointer forward scans ahead until a pattern
match is found.
10

Specification of Patterns for


Tokens: Definitions
An alphabet is a finite set of symbols
(characters)
A string s is a finite sequence of
symbols from
– s denotes the length of string s
– denotes the empty string, thus =0
A language is a specific set of strings
over some fixed alphabet
Specification of Patterns for
11

Tokens: String Operations


Terms for Parts of Strings
1. A Prefix of string s is any string obtained by
removing zero or more symbols from the end of s.
For example, ban, banana and ϵ are prefixes of
banana.
2. A Suff ix of string s is any string obtained by
removing zero or more symbols from the beginning
of s. For example, nana, banana and ϵ are suff ixes of
banana.
3. A substring of s is obtained by deleting any prefix
and any suff ix from s. For instance, banana, nan, and
12

Terms for Parts of Strings


4. The proper prefixes, suff ixes, and substrings
of a string s are those, prefixes, suff ixes, and
substrings, respectively, of s that are not ϵ or
not equal to s itself.
5. A subsequence of s is any string formed by
deleting zero or more not necessarily
consecutive positions of s. For example, baan
is a subsequence of banana.
Specification of Patterns for 13

Tokens: Language Operations


Union (The set of all strings in L or M)
L M = {s s L or s M}
Concatenation (The set of all strings formed by
concatenating a string from L with a string from M).
LM = {xy x L and y M}
Exponentiation
0 i i-1
L = { }; L = L L
Kleene closure (The set of all strings formed by
concatenating zero or more strings from L (including the
empty string ε))
* i
L = i=0,…, L
Specification of Patterns for 14

Tokens: Regular Expressions


Regular expressions are a powerful notation used to define
patterns for tokens. They are built recursively using the
following rules
Basis symbols:
– is a regular expression denoting language containing
only the empty string { }
– a is a regular expression Represents the language
containing only the string 'a'
If r and s are regular expressions denoting languages L(r)
and M(s) respectively, then
– r s is a regular expression denoting L(r) M(s)
– rs is a regular expression denoting L(r)M(s)
15

Algebraic Laws for Regular


Expressions
16

Specification of Patterns for


Tokens: Regular Definitions
If Σ is an alphabet of basic symbols, then a
regular definition is a sequence of definitions
of the form:

where:
Each di is a new symbol, not in Σ and not the
same as any other of the d's, and
Specification of Patterns for 17

Tokens: Regular Definitions


Regular definition assigns a name to a regular
expression, providing a convenient way to
define and reuse complex patterns
Example:
letter A B … Z a b … z
digit 0 1 … 9
*
id letter ( letter digit )

Regular definitions cannot be recursive:


18

Specification of Patterns for


Tokens: Notational Shorthand
Regular expression variants.

* + + * *
One or more instances: r = r | and r = r r =rr
Zero or one instance: r? = r
Character classes: [a-z] = a b c … z
[abc] = a b c
19
20

Regular Definitions and


Grammars
Grammar
stmt if expr then stmt
if expr then stmt else stmt

expr term relop term


Regular definitions
term
if if
term id
then then
num else else
relop < <= <> > >= =
*
id letter ( letter | digit )
+ + +
Regular Definitions and 21

Grammars
Token ws is diff erent from the other tokens in
that, when we recognize it, we do not return it to
the parser, but rather restart the lexical analysis
from the character that follows the whitespace.
It is the following token that gets returned to
blank b
the parser.
tab ^T
newline ^M
delim blank | tab | newline
+
ws delim
22

Tokens, Pattern & Attribute


Value
23

Coding Regular Definitions in


Transition Diagrams
relop < <= <> > >= =
start 0 < 1 = 2 return(relop, LE)
> 3 return(relop, NE)
other 4 * return(relop, LT)
= 5 return(relop, EQ)
> 6 = 7 return(relop, GE)
other 8 * return(relop, GT)
*
id letter ( letter digit ) letter or digit

start 9 letter 10 other 11* return(gettoken(),


install_id())
24

Coding Regular Definitions in


Transition Diagrams
What Else Does Lexical 25

Analyzer Do?
All Keywords / Reserved words are matched
After the match, the symbol table or a special
as ids
keyword table is consulted

Keyword table contains string versions of all


keywords and associated token
if values15
then 16
begin 17
... ...

When a match is found, the token is returned,


along with its symbolic value, i.e., “then”, 16

If a match is not found, then it is assumed that


Coding Regular Definitions in 26

Transition Diagrams: Code


token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) {
Decides the
state = 0; next start state
lexeme_beginning++;
} to check
else if (c==‘<’) state = 1;
else if (c==‘=’) state = 5;
else if (c==‘>’) state = 6; int fail()
else state = fail(); { forward = token_beginning;
break; swith (start) {
case 1: case 0: start = 9; break;
… case 9: start = 12; break;
case 9: c = nextchar(); case 12: start = 20; break;
if (isletter(c)) state = 10; case 20: start = 25; break;
else state = fail(); case 25: recover(); break;
break; default: /* error */
case 10: c = nextchar(); }
if (isletter(c)) state = 10; return start;
The Lex and Flex Scanner 27

Generators
Scanner generator tools that automatically
create lexical analyzers (scanners) from user-
provided rules defined using regular
expressions.
Lex is the original Unix tool, while Flex is its
faster, modern, and free open-source
counterpart, which generates C source code
for the scanner.
These scanners then read input streams,
28

Creating a Lexical Analyzer


with Lex and Flex
lex
source lex (or flex) lex.yy.c
program working
lex.l lexical
C analyzer that
lex.yy.c a.out can
compiler
take a stream
of input
input sequence characters
a.out
stream of tokens and produce
a stream of
tokens
29

Lex Specification The declarations


section includes
A lex specification consists of three parts: declarations of
regular definitions, variables, manifest
C declarations in %{ %} constants
(identifiers declared
%%
to stand for a
Translation rules constant, e.g., the
%% name of a token), and
User-defined auxiliary procedures regular definitions

– Functions used in actions, can be compiled


separately and loaded with lexical analyzer.
Pattern is a regular
The translation rules are of the form:
Pattern1 { action1 } expression
Actions are fragments of
Pattern2 { action2 } code, typically written in C

30

Regular Expressions in Lex


31

Predefinded Functions of Lex


yyin :- the input stream pointer (i.e it points to an input
file which is to be scanned or tokenised), however the
default input of default main() is stdin .
yylex() :- implies the main entry point for lex, reads the
input stream generates tokens, returns zero at the end of
input stream . It is called to invoke the lexer (or scanner)
and each time yylex() is called, the scanner continues
processing the input from where it last left off .
yytext :- a buff er that holds the input characters that
actually match the pattern (i.e lexeme) or say a pointer
to the matched string .
32

yyleng :- the length of the lexeme .


yylval :- contains the token value .
yyval :- a local variable .
yyout :- the output stream pointer (i.e it points to a file
where it has to keep the output), however the default
output of default main() is stdout .
yywrap() :- it is called by lex when input is exhausted (or at
EOF). default yywrap always return 1.
yymore() :- returns the next token .
yyless(k) :- returns the first k characters in yytext .
yyparse() :- it parses (i.e builds the parse tree) of lexeme .
33

Example Lex Specification 1


Contains
%{ the matching
Translation #include <stdio.h>
%}
lexeme
rules
%%
[0-9]+ { printf(“%s\n”, yytext); }
.|\n { } Invokes
%% the lexical
main()
analyzer
{ yylex();
}
lex spec.l
gcc lex.yy.c -ll
./a.out < spec.l
34

Example Lex Specification 2


%{
#include <stdio.h> Regular
int ch = 0, wd = 0, nl = 0; definition
Translation %}
rules delim [ \t]+
%%
\n { ch++; wd++; nl++; }
^{delim} { ch+=yyleng; }
{delim} { ch+=yyleng; wd++; }
. { ch++; }
%%
main()
{ yylex();
printf("%8d%8d%8d\n", nl, wd, ch);
}
35

Example Lex Specification 3


%{
#include <stdio.h> Regular
%} definitions
Translation digit [0-9]
rules letter [A-Za-z]
id {letter}({letter}|{digit})*
%%
{digit}+ { printf(“number: %s\n”, yytext); }
{id} { printf(“ident: %s\n”, yytext); }
. { printf(“other: %s\n”, yytext); }
%%
main()
{ yylex();
}
36

Example Lex Specification 4


%{ /* definitions of manifest constants */
#define LT (256)

%}
delim [ \t\n]
ws {delim}+ Return
letter [A-Za-z]
digit [0-9]
token to
id {letter}({letter}|{digit})* parser
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%% Token
{ws} { }
attribute
if {return IF;}
then {return THEN;}
else {return ELSE;}
{id} {yylval = install_id(); return ID;}
{number} {yylval = install_num(); return NUMBER;}
“<“ {yylval = LT; return RELOP;}
“<=“ {yylval = LE; return RELOP;}
“=“ {yylval = EQ; return RELOP;}
Install yytext as
“<>“ {yylval = NE; return RELOP;} identifier in symbol table
“>“ {yylval = GT; return RELOP;}
37

Finite Automata
These are essentially graphs, like transition diagrams,
with a few diff erences:
They are recognizers.
Answer in yes or no.
Nondeterministic Finite automata (NFA)
Have no restrictions on the labels of their edges. A
symbol can label several edges out of the same
state, and the ϵ empty string, is a possible label.
Deterministic Finite automata (DFA)
Have, for each state, and for each symbol of its
input alphabet exactly one edge with that symbol
leaving that state.
38

Design of a Lexical Analyzer


Generator
Translate regular expressions to NFA
Translate NFA to an eff icient DFA

Optional

regular
NFA DFA
expressions

Simulate NFA Simulate DFA


to recognize to recognize
tokens tokens
Nondeterministic Finite 39

Automata
40

Transition Graph
An NFA can be diagrammatically
represented by a labeled directed
graph called a transition graph

a
S = {0,1,2,3}
start 0 a 1 b 2 b 3 = {a,b}
s0 = 0
b
F = {3}
41

Transition Table
The Language Defined by an 42

NFA
43

Design of a Lexical Analyzer


Generator: RE to NFA to DFA
Lex specification with NFA
regular expressions
p1 { action1 } N(p1) action1
p2 { action2 } start
s0 N(p2) action2
… …
pn { actionn } N(pn) actionn

Subset construction

DFA
From Regular Expression to 44

NFA (Thompson’s
Construction)
start
i f

a start a
i f

start
N(r1)
r1 r2 i f
N(r2)
start
r 1r 2 i N(r1) N(r2) f

r* start
i N(r) f
45

Combining the NFAs of a Set


of Regular Expressions
start
1 a 2

a { action1 } start
abb { action2 } 3 a 4 b 5 b 6
a b
a*b+ { action3 }
start
7 b 8

1 a 2

start
0 3 a 4 b 5 b 6
a b

7 b 8
46

Simulating the Combined NFA


Example 1
1 a 2 action1

start
0 3 a 4 b 5 b 6 action2
a b

7 b 8 action3

a a b a
none
0 2 7 8
action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are possib
When last state is accepting: execute actio
47

Simulating the Combined NFA


Example 2
1 a 2 action1

start
0 3 a 4 b 5 b 6 action2
a b

7 b 8 action3

a b b a
none
0 2 5 6
action2
1 4 8 8
action3
3 7
7 When two or more accepting states are reached, t
first action given in the Lex specification is execut
48

Deterministic Finite Automata


A deterministic finite automaton is a special
case of an NFA
– No state has an -transition
– For each state s and input symbol a there is at
most one edge labeled a leaving s
Each entry in the transition table is a single
state
– At most one path exists to accept a string
– Simulation algorithm is simple
49

Example DFA

A DFA that accepts (a b)*abb

b
b
a
start 0 a 1 b 2 b 3

a a
50

Conversion of an NFA into a


DFA
The subset construction algorithm converts
an NFA into a DFA using:
-closure(s) = {s} {t s … t}
-closure(T) = s T -closure(s)
move(T,a) = {t s a t and s T}
The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new DFA
51

-closure and move Examples


-closure({0}) = {0,1,3,7}
1 a 2 move({0,1,3,7},a) = {2,4,7}
-closure({2,4,7}) = {2,4,7}
start
0 3 a 4 b 5 b 6 move({2,4,7},a) = {7}
a b
-closure({7}) = {7}
7 b 8 move({7},b) = {8}
-closure({8}) = {8}
a a b move({8},a)
a =
none
0 2 7 8
1 4
3 7
7 Also used to simulate NFAs
52

Simulating an NFA using


-closure and move
S := -closure({s0})
Sprev :=
a := nextchar()
while S do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev F then
execute action in Sprev
return “yes”
53

The Subset Construction


Algorithm
Initially, -closure(s0) is the only state in Dstates and it is unmarke
while there is an unmarked state T in Dstates do
mark T
for each input symbol a do
U := -closure(move(T,a))
if U is not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
Subset Construction Example 54

1
2 a 3

start
0 1 6 7 a 8 b 9 b 10

4 b 5

b
Dstates
C
A = {0,1,2,4,7}
b a b
start
B = {1,2,3,4,6,7,8}
A a B b D b E C = {1,2,4,5,6,7}
a
a
a D = {1,2,4,5,6,7,9}
Subset Construction Example 55

2
1 a 2 a1

start
0 3 a 4 b 5 b 6 a2
a b

7 b 8 a3
b
Dstates
C a3
a A = {0,1,3,7}
b
b b B = {2,4,7}
start
A D C = {8}
a a D = {7}
B b E b F
E = {5,8}
a1 a3 a2 a3
56

Minimizing the Number of


States of a DFA

C
b
b a b
start start
A a B b D b E AC a B b D b E
a a
a
a b a a
57

From Regular Expression to


DFA Directly
The “important states” of an NFA are
those without an -transition, that is if
move({s},a) for some a then s is an
important state
The subset construction algorithm
uses only the important states when it
determines
-closure(move(T,a))
58

From Regular Expression to


DFA Directly (Algorithm)
Augment the regular expression r with
a special end symbol # to make
accepting states important: the new
expression is r#
Construct a syntax tree for r#
Traverse the tree to construct
functions nullable, firstpos, lastpos,
and followpos
59

From Regular Expression to


DFA Directly: Syntax Tree of (a|
b)*abb#
concatenation
#
6
b
closure 5
b
4

* a
3
alternation
| position
number
a b
1 2 (for leafs )
From Regular Expression to 60

DFA Directly: Annotating the


Tree
nullable(n): the subtree at node n generates
languages including the empty string
firstpos(n): set of positions that can match
the first symbol of a string generated by the
subtree at node n
lastpos(n): the set of positions that can
match the last symbol of a string generated
be the subtree at node n
followpos(i): the set of positions that can
follow position i in the tree
From Regular Expression to 61

DFA Directly: Annotating the


Tree
Node n nullable(n) firstpos(n) lastpos(n)

Leaf true

Leaf i false {i} {i}


| nullable(c1) firstpos(c1) lastpos(c1)
/\ or
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
if nullable(c1) if nullable(c2)
• nullable(c1)
then firstpos(c1) then lastpos(c1)
/\ and
firstpos(c2) lastpos(c2)
c1 c2 nullable(c2)
else firstpos(c1) else lastpos(c2)
*
| true firstpos(c1) lastpos(c1)
62

From Regular Expression to


DFA Directly: Syntax Tree of (a|
b)*abb# {1, 2, 3} {6}

{1, 2, 3} {5} {6} # {6}


6
{1, 2, 3} {4} {5} b {5}
nullable 5
{1, 2, 3} {3} {4} b {4}
4
firstpos lastpos
{1, 2}
* {1, 2} {3} a {3}
3

{1, 2} | {1, 2}

{1} a {1} {2} b {2}


1 2
63

From Regular Expression to


DFA Directly: followpos
for each node n in the tree do
if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i) firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i) firstpos(n)
end do
end if
From Regular Expression to 64

DFA Directly: Algorithm


s0 := firstpos(root) where root is the root of the syntax tree
Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
65

From Regular Expression to


DFA Directly: Example
follow
Node 1
pos
3 4 5 6
1 {1, 2, 3} 2

2 {1, 2, 3}

3 {4}b b

4 {5} a
start a 1,2, b 1,2, b 1,2,
1,2,3
5 {6} 3,4 3,5 3,6
a
6 - a
66

Time-Space Tradeoffs

Space Time
Automato
(worst (worst
n
case) case)
NFA O( r ) O( r x )
|r|
DFA O(2 ) O( x )

You might also like