Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views26 pages

Lexical Analysis 1

Uploaded by

Believer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views26 pages

Lexical Analysis 1

Uploaded by

Believer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Lexical Analysis

Dr. Alok Kumar


Department of Computer Science
and Engineering
UIET, CSJM University, Kanpur
Topic Covered
 Role of Lexical Analyzer
 Tokens, Patterns, Lexemes
 Lexical Errors and Recovery
 Specification of Tokens
 Recognition of Tokens
 Finite Automata
 Tool lex
 Conclusion

2 Lexical Analysis- Dr. Alok Kumar


Lexical analyzer
The main task of lexical analysis is to
read input characters in the code and
produce tokens.
"Get next token" is a command which is
sent from the parser to the lexical
analyzer.
On receiving this command, the lexical
analyzer scans the input until it finds the
next token.

3 Lexical Analysis- Dr. Alok Kumar


Role of lexical analyzer

4 Lexical Analysis- Dr. Alok Kumar


Why to separate Lexical
analysis and parsing
 Simplicity of design
 Improving compiler efficiency
 Enhancing compiler portability

5 Lexical Analysis- Dr. Alok Kumar


Tokens, Patterns and
Lexemes
• A token is a pair – a token name and an optional token
value
• A pattern is a description of the form that the
lexemes of a token may take
• A lexeme is a sequence of characters in the source
program that matches the pattern for a token

6 Lexical Analysis- Dr. Alok Kumar


Example

7 Lexical Analysis- Dr. Alok Kumar


Attributes for tokens
• E = M * C ** 2
– <id, pointer to symbol table entry for E>
– <assign-op>
– <id, pointer to symbol table entry for M>
– <mult-op>
– <id, pointer to symbol table entry for C>
– <exp-op>
– <number, integer value 2>

8 Lexical Analysis- Dr. Alok Kumar


Lexical errors
• Some errors are out of power of lexical analyzer to
recognize:
– fi (a == f(x)) …
• However it may be able to recognize errors like:
– d = 2r
• Such errors are recognized when no pattern
for tokens matches a character sequence

9 Lexical Analysis- Dr. Alok Kumar


Error recovery
• Panic mode: successive characters are ignored
until we reach to a well formed token
• Delete one character from the remaining input
• Insert a missing character into the remaining
input
• Replace a character by another character
• Transpose two adjacent characters

10 Lexical Analysis- Dr. Alok Kumar


Input Buffering
• Sometimes lexical analyzer needs to look ahead some
symbols to decide about the token to return
– In C language: we need to look after -, = or < to decide what
 token to return
– In Fortran: DO 5 I = 1.25
• We need to introduce a two buffer scheme to handle
large look-aheads safely

11 Lexical Analysis- Dr. Alok Kumar


Specification of tokens
• In theory of compilation regular expressions are used to
formalize the
• specification of tokens
• Regular expressions are means for specifying regular
languages
• Example:
• letter(letter | digit)*
• Each regular expression is a pattern specifying the form of
strings

12 Lexical Analysis- Dr. Alok Kumar


Regular Expressions
• Ɛ is a regular expression denoting the language L(Ɛ) = {Ɛ}, containing
only the empty string
• If a is a symbol in ∑then a is a regular expression, L(a) =
{a}
• If r and s are two regular expressions with languages L(r)
and L(s), then
– r|s is a regular expression denoting the language L(r)∪L(s),
containing all strings of L(r) and L(s)
– rs is a regular expression denoting the language L(r)L(s),
created by
 concatenating the strings of L(s) to L(r)
– r* is a regular expression denoting (L(r))*, the set containing
zero or more occurrences of the strings of L(r)
– (r) is a regular expression corresponding to the language L(r)

13 Lexical Analysis- Dr. Alok Kumar


Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn

• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
Id -> letter_ (letter_ | digit)*

14 Lexical Analysis- Dr. Alok Kumar


Extensions
• One or more instances: (r)+
• Zero of one instances: r?
• Character classes: [abc]

• Example:
• letter_-> [A-Za-z_]
• digit-> [0-9]
• ID-> letter_(letter_|digit)*

15 Lexical Analysis- Dr. Alok Kumar


Examples with ∑= {0,1}
• (0|1)*: All binary strings including the empty
string
• (0|1)(0|1)*: All nonempty binary strings
• 0(0|1)*0: All binary strings of length at least 2,
starting and ending with 0s
• (0|1)*0(0|1)(0|1)(0|1): All binary strings with at
least three characters in which the third-last
character is always 0
• 0*10*10*10*: All binary strings possessing
exactly three 1s
16 Lexical Analysis- Dr. Alok Kumar
Recognition of tokens
• Starting point is the language grammar to understand the
tokens:
 stmt -> if expr then stmt
| if expr then stmt else stmt

 expr -> term relop term
|term
term -> id
|

numb
er

17 Lexical Analysis- Dr. Alok Kumar


Recognition of tokens
(cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+
18 Lexical Analysis- Dr. Alok Kumar
Transition diagrams
Transition diagram for relop

19 Lexical Analysis- Dr. Alok Kumar


Transition diagrams
(cont.)
Transition diagram for reserved words and
identifiers

Transition diagram for unsigned numbers

20 Lexical Analysis- Dr. Alok Kumar


Architecture of a transition-
diagram-based lexical analyzer
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …

case 8: retract();
retToken.attribute = GT; return(retToken);
}

21 Lexical Analysis- Dr. Alok Kumar


Finite Automata
• Regular expressions = specification
• Finite automata = implementation

• A finite automaton consists of


– An input alphabet 
– A set of states S
– A start state n
– A set of accepting states F  S

A set of transitions state  state

22 Lexical Analysis- Dr. Alok Kumar


Lexical Analyzer
Generator - Lex

23 Lexical Analysis- Dr. Alok Kumar


Structure of Lex programs

declarations
%%
translation rules Pattern {Action}
%%
auxiliary functions

24 Lexical Analysis- Dr. Alok Kumar


Example
. %{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP
%} */

/* regular definitions
delim [ \t\n]
ws {delim}+
letter[A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?

%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int)
installID(); return(ID);
}
{number}
{yylval = (int)
installNum(); 25
return(NUMBER);}
Lexical
… Analysis- Dr. Alok Kumar
Conclusion
• Words of a language can be specified using regular
expressions
• NFA and DFA can act as acceptors
• Regular expressions can be converted to NFA
• NFA can be converted to DFA
• Automated tool lex can be used to generate lexical
analyser for a language

26 Lexical Analysis- Dr. Alok Kumar

You might also like