Lexical Analysis and
Design of Lexical Analyzer
Lexical Analysis
• Input is scanned completely to identify the tokens
• Tokens (Logical unit)
– Identifier, Keywords, operators etc.
Specification of Tokens
– Strings and Languages
• Finite sequence of Symbols is called Strings
• Set of strings over some alphabet is called Language
– Operation on Languages
• Concatenation:
– L1L2 = { s1s2 | s1 L1 and s2 L2 }
• Union
– L1 L2 = { s | s L1 or s L2 }
• Kleene Closure
– L* =
L i
• Positive Closure
i 0
– L+ =
L i
– Regular Expressions
i 1
4
Regular Expression
• Notation for representing Tokens
• Ex: Identifiers in Pascal
letter A | B | ... | Z | a | b | ... | z
digit 0 | 1 | ... | 9
id letter (letter | digit ) *
5
The Reason Why Lexical
Analysis is a Separate Phase
• Simplifies the design of the compiler
– LL(1) or LR(1) parsing with 1 token lookahead would
not be possible (multiple characters/tokens to match)
• Provides efficient implementation
– Systematic techniques to implement lexical analyzers
by hand or automatically from specifications
– Stream buffering methods to scan input
• Improves portability
– Non-standard symbols and alternate character
encodings can be normalized (e.g. trigraphs)
6
Interaction of the Lexical
Analyzer with the Parser
Token,
Source Lexical tokenval
Program Parser
Analyzer
Get next
token
error error
Symbol Table
7
Attributes of Tokens
y := 31 + 28*x Lexical analyzer
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
token
tokenval
(token attribute) Parser
8
Tokens, Patterns, and Lexemes
• A token is a classification of lexical units
– For example: id and num
• Lexemes are the specific character strings that
make up a token
– For example: abc and 123
• Patterns are rules describing the set of lexemes
belonging to a token
– For example: “letter followed by letters and digits” and
“non-empty sequence of digits”
9
Specification of Patterns for
Tokens: Definitions
• An alphabet is a finite set of symbols
(characters)
• A string s is a finite sequence of symbols
from
– s denotes the length of string s
– denotes the empty string, thus = 0
• A language is a specific set of strings over
some fixed alphabet
10
Specification of Patterns for
Tokens: String Operations
• The concatenation of two strings x and y is
denoted by xy
• The exponentation of a string s is defined
by
s0 =
si = si-1s for i > 0
note that s = s = s
11
Specification of Patterns for
Tokens: Language Operations
• Union
L M = {s s L or s M}
• Concatenation
LM = {xy x L and y M}
• Exponentiation
L0 = {}; Li = Li-1L
• Kleene closure
L* = i=0,…, Li
• Positive closure
L+ = i=1,…, Li
12
Specification of Patterns for
Tokens: Regular Expressions
• Basis symbols:
– is a regular expression denoting language {}
– a is a regular expression denoting {a}
• If r and s are regular expressions denoting
languages L(r) and M(s) respectively, then
– rs is a regular expression denoting L(r) M(s)
– rs is a regular expression denoting L(r)M(s)
– r* is a regular expression denoting L(r)*
– (r) is a regular expression denoting L(r)
• A language defined by a regular expression is
called a regular set
13
Specification of Patterns for
Tokens: Regular Definitions
• Regular definitions introduce a naming
convention:
d 1 r1
d 2 r2
…
d n rn
where each ri is a regular expression over
{d 1, d 2, …, d i-1 }
• Any d j in ri can be textually substituted in ri to
obtain an equivalent set of definitions
14
Specification of Patterns for
Tokens: Regular Definitions
• Example:
letter AB…Zab…z
digit 01…9
id letter ( letterdigit )*
• Regular definitions are not recursive:
digits digit digitsdigit wrong!
15
Specification of Patterns for
Tokens: Notational Shorthand
• The following shorthands are often used:
r+ = rr*
r? = r
[a-z] = abc…z
• Examples:
digit [0-9]
num digit+ (. digit+)? ( E (+-)? digit+ )?
16
Regular Definitions and
Grammars
Grammar
stmt if expr then stmt
if expr then stmt else stmt
expr term relop term
term Regular definitions
term id if if
num then then
else else
relop < <= <> > >= =
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+-)? digit+ )?
17
Coding Regular Definitions in
Transition Diagrams
relop <<=<>>>==
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
4 * return(relop, LT)
=
5 return(relop, EQ)
> =
6 7 return(relop, GE)
other
8 * return(relop, GT)
id letter ( letterdigit )* letter or digit
start letter other
9 10 11 * return(gettoken(),
install_id())
Coding Regular Definitions in 18
Transition Diagrams: Code
token nexttoken()
{ while (1) {
switch (state) {
case 0: c = nextchar();
if (c==blank || c==tab || c==newline) { Decides the
state = 0;
lexeme_beginning++; next start state
}
else if (c==‘<’) state = 1; to check
else if (c==‘=’) state = 5;
else if (c==‘>’) state = 6;
else state = fail();
break; int fail()
case 1: { forward = token_beginning;
… swith (start) {
case 9: c = nextchar(); case 0: start = 9; break;
if (isletter(c)) state = 10; case 9: start = 12; break;
else state = fail(); case 12: start = 20; break;
break; case 20: start = 25; break;
case 10: c = nextchar(); case 25: recover(); break;
if (isletter(c)) state = 10; default: /* error */
else if (isdigit(c)) state = 10; }
else state = 11; return start;
break; }
…
19
The Lex and Flex Scanner
Generators
• Lex and its newer cousin flex are scanner
generators
• Systematically translate regular definitions
into C source code for efficient scanning
• Generated code is easy to integrate in C
applications
20
Creating a Lexical Analyzer with
Lex and Flex
lex
source lex or flex lex.yy.c
program compiler
lex.l
lex.yy.c C a.out
compiler
input sequence
stream a.out of tokens
21
Design of a Lexical Analyzer
Generator
• Translate regular expressions to NFA
• Translate NFA to an efficient DFA
Optional
regular
NFA DFA
expressions
Simulate NFA Simulate DFA
to recognize to recognize
tokens tokens
22
Nondeterministic Finite
Automata
• An NFA is a 5-tuple (S, , , s0, F) where
S is a finite set of states
is a finite set of symbols, the alphabet
is a mapping from S to a set of states
s0 S is the start state
F S is the set of accepting (or final) states
23
Transition Graph
• An NFA can be diagrammatically
represented by a labeled directed graph
called a transition graph
a
S = {0,1,2,3}
start a b b = {a,b}
0 1 2 3
s0 = 0
b F = {3}
24
Transition Table
• The mapping of an NFA can be
represented in a transition table
Input Input
State
(0,a) = {0,1} a b
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3}
2 {3}
25
The Language Defined by an
NFA
• An NFA accepts an input string x if and only if
there is some path with edges labeled with
symbols from x in sequence from the start state to
some accepting state in the transition graph
• A state transition from one state to another on the
path is called a move
• The language defined by an NFA is the set of
input strings it accepts, such as (ab)*abb for the
example NFA
26
Design of a Lexical Analyzer
Generator: RE to NFA to DFA
Lex specification with NFA
regular expressions
p1 { action1 }
N(p1) action1
p2 { action2 } start
s0
N(p2) action2
…
…
pn { actionn }
N(pn) actionn
Subset construction
DFA
27
From Regular Expression to NFA
(Thompson’s Construction)
start
i f
a start a
i f
N(r1)
r1r2
start
i f
N(r2)
start
r1r2 i N(r1) N(r2) f
r* start
i N(r) f
28
Combining the NFAs of a Set of
Regular Expressions
start a
1 2
a { action1 }
start a b b
abb { action2 } 3 4 5 6
a b
a*b+ { action3 }
start
7 b 8
a
1 2
start
0 3
a
4
b
5
b
6
a b
7 b 8
29
Simulating the Combined NFA
Example 1
a
1 2 action1
start
0 3
a
4
b
5
b
6 action2
a b
7 b 8 action3
a a b a
none
0 2 7 8 action3
1 4
3 7 Must find the longest match:
7 Continue until no further moves are possible
When last state is accepting: execute action
30
Simulating the Combined NFA
Example 2
a
1 2 action1
start
0 3
a
4
b
5
b
6 action2
a b
7 b 8 action3
a b b a
none
0 2 5 6 action2
1 4 8 8 action3
3 7
7 When two or more accepting states are reached, the
first action given in the Lex specification is executed
31
Deterministic Finite Automata
• A deterministic finite automaton is a special case
of an NFA
– No state has an -transition
– For each state s and input symbol a there is at most one
edge labeled a leaving s
• Each entry in the transition table is a single state
– At most one path exists to accept a string
– Simulation algorithm is simple
32
Example DFA
A DFA that accepts (ab)*abb
b
b
a
start a b b
0 1 2 3
a a
33
Conversion of an NFA into a
DFA
• The subset construction algorithm converts an
NFA into a DFA using:
-closure(s) = {s} {t s … t}
-closure(T) = sT -closure(s)
move(T,a) = {t s a t and s T}
• The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new DFA
34
-closure and move Examples
-closure({0}) = {0,1,3,7}
a
1 2 move({0,1,3,7},a) = {2,4,7}
-closure({2,4,7}) = {2,4,7}
start
0 3
a
4
b
5
b
6
move({2,4,7},a) = {7}
a b -closure({7}) = {7}
move({7},b) = {8}
7 b 8 -closure({8}) = {8}
move({8},a) =
a a b a
none
0 2 7 8
1 4
3 7
7 Also used to simulate NFAs
35
Simulating an NFA using
-closure and move
S := -closure({s0})
Sprev :=
a := nextchar()
while S do
Sprev := S
S := -closure(move(S,a))
a := nextchar()
end do
if Sprev F then
execute action in Sprev
return “yes”
else return “no”
36
Minimizing the Number of States
of a DFA
C
b a
b a
start a b b start a b b
A B D E A B D E
a a
a
a b a