Lexical Analysis
Dr. Alok Kumar
Department of Computer Science
and Engineering
UIET, CSJM University, Kanpur
Topic Covered
Role of Lexical Analyzer
Tokens, Patterns, Lexemes
Lexical Errors and Recovery
Specification of Tokens
Recognition of Tokens
Finite Automata
Tool lex
Conclusion
2 Lexical Analysis- Dr. Alok Kumar
Lexical analyzer
The main task of lexical analysis is to
read input characters in the code and
produce tokens.
"Get next token" is a command which is
sent from the parser to the lexical
analyzer.
On receiving this command, the lexical
analyzer scans the input until it finds the
next token.
3 Lexical Analysis- Dr. Alok Kumar
Role of lexical analyzer
4 Lexical Analysis- Dr. Alok Kumar
Why to separate Lexical
analysis and parsing
Simplicity of design
Improving compiler efficiency
Enhancing compiler portability
5 Lexical Analysis- Dr. Alok Kumar
Tokens, Patterns and
Lexemes
• A token is a pair – a token name and an optional token
value
• A pattern is a description of the form that the
lexemes of a token may take
• A lexeme is a sequence of characters in the source
program that matches the pattern for a token
6 Lexical Analysis- Dr. Alok Kumar
Example
7 Lexical Analysis- Dr. Alok Kumar
Attributes for tokens
• E = M * C ** 2
– <id, pointer to symbol table entry for E>
– <assign-op>
– <id, pointer to symbol table entry for M>
– <mult-op>
– <id, pointer to symbol table entry for C>
– <exp-op>
– <number, integer value 2>
8 Lexical Analysis- Dr. Alok Kumar
Lexical errors
• Some errors are out of power of lexical analyzer to
recognize:
– fi (a == f(x)) …
• However it may be able to recognize errors like:
– d = 2r
• Such errors are recognized when no pattern
for tokens matches a character sequence
9 Lexical Analysis- Dr. Alok Kumar
Error recovery
• Panic mode: successive characters are ignored
until we reach to a well formed token
• Delete one character from the remaining input
• Insert a missing character into the remaining
input
• Replace a character by another character
• Transpose two adjacent characters
10 Lexical Analysis- Dr. Alok Kumar
Input Buffering
• Sometimes lexical analyzer needs to look ahead some
symbols to decide about the token to return
– In C language: we need to look after -, = or < to decide what
token to return
– In Fortran: DO 5 I = 1.25
• We need to introduce a two buffer scheme to handle
large look-aheads safely
11 Lexical Analysis- Dr. Alok Kumar
Specification of tokens
• In theory of compilation regular expressions are used to
formalize the
• specification of tokens
• Regular expressions are means for specifying regular
languages
• Example:
• letter(letter | digit)*
• Each regular expression is a pattern specifying the form of
strings
12 Lexical Analysis- Dr. Alok Kumar
Regular Expressions
• Ɛ is a regular expression denoting the language L(Ɛ) = {Ɛ}, containing
only the empty string
• If a is a symbol in ∑then a is a regular expression, L(a) =
{a}
• If r and s are two regular expressions with languages L(r)
and L(s), then
– r|s is a regular expression denoting the language L(r)∪L(s),
containing all strings of L(r) and L(s)
– rs is a regular expression denoting the language L(r)L(s),
created by
concatenating the strings of L(s) to L(r)
– r* is a regular expression denoting (L(r))*, the set containing
zero or more occurrences of the strings of L(r)
– (r) is a regular expression corresponding to the language L(r)
13 Lexical Analysis- Dr. Alok Kumar
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
Id -> letter_ (letter_ | digit)*
14 Lexical Analysis- Dr. Alok Kumar
Extensions
• One or more instances: (r)+
• Zero of one instances: r?
• Character classes: [abc]
• Example:
• letter_-> [A-Za-z_]
• digit-> [0-9]
• ID-> letter_(letter_|digit)*
15 Lexical Analysis- Dr. Alok Kumar
Examples with ∑= {0,1}
• (0|1)*: All binary strings including the empty
string
• (0|1)(0|1)*: All nonempty binary strings
• 0(0|1)*0: All binary strings of length at least 2,
starting and ending with 0s
• (0|1)*0(0|1)(0|1)(0|1): All binary strings with at
least three characters in which the third-last
character is always 0
• 0*10*10*10*: All binary strings possessing
exactly three 1s
16 Lexical Analysis- Dr. Alok Kumar
Recognition of tokens
• Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
|term
term -> id
|
numb
er
17 Lexical Analysis- Dr. Alok Kumar
Recognition of tokens
(cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+
18 Lexical Analysis- Dr. Alok Kumar
Transition diagrams
Transition diagram for relop
19 Lexical Analysis- Dr. Alok Kumar
Transition diagrams
(cont.)
Transition diagram for reserved words and
identifiers
Transition diagram for unsigned numbers
20 Lexical Analysis- Dr. Alok Kumar
Architecture of a transition-
diagram-based lexical analyzer
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …
…
case 8: retract();
retToken.attribute = GT; return(retToken);
}
21 Lexical Analysis- Dr. Alok Kumar
Finite Automata
• Regular expressions = specification
• Finite automata = implementation
• A finite automaton consists of
– An input alphabet
– A set of states S
– A start state n
– A set of accepting states F S
A set of transitions state state
22 Lexical Analysis- Dr. Alok Kumar
Lexical Analyzer
Generator - Lex
23 Lexical Analysis- Dr. Alok Kumar
Structure of Lex programs
declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions
24 Lexical Analysis- Dr. Alok Kumar
Example
. %{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP
%} */
/* regular definitions
delim [ \t\n]
ws {delim}+
letter[A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int)
installID(); return(ID);
}
{number}
{yylval = (int)
installNum(); 25
return(NUMBER);}
Lexical
… Analysis- Dr. Alok Kumar
Conclusion
• Words of a language can be specified using regular
expressions
• NFA and DFA can act as acceptors
• Regular expressions can be converted to NFA
• NFA can be converted to DFA
• Automated tool lex can be used to generate lexical
analyser for a language
26 Lexical Analysis- Dr. Alok Kumar