Compilers
Lexical Analysis
Lexical Analysis
• What is the goal?
if (i ==0)
z=0;
else
z=1;
• The input is just a string of characters:
• If (i==0)\n\tz=0;\nelse\n\tz=1;
• Goal: Partition input string into substrings
• where the substrings are tokens
Token
• Words which are the smallest unit above letters.
• Is the minimal syntax category.
• English: noun, verb, adjective …
• Programming language: Identifier, integer, keyword, whitespace, …
• Tokens correspond to sets of strings
• Identifier: strings of letters or digits, starting with a letter
• Integer: a non-empty string of digits
• Keyword: ”else” or “if” …
• Whitespace: a non-empty sequence of blanks, newlines and tabs.
Contd…
• Tokens classify program substrings according to its role
• The output of a lexical analysis is a stream of tokens.
• Parser relies on token distinction.
• Identifier, is treated differently than a keyword
Designing a lexical analyser
• Define a finite set of tokens
• Tokens describe all items of interest
• Choice of tokens depends on language, design of parser …
• Recall
• \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
• Useful tokens for this expression:
• Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ;
• N.B., (, ), =, ; are tokens, not characters, here
• Next step is to Describe which substrings belong to each token.
Implementation
• An implementation is responsible for two things.
• Recognize substrings corresponding to tokens accurately
• Return the value or lexeme (substring) of the token.
• First it discards unneeded tokens which won’t contribute to parsing
• Whitespaces and comments.
if (i ==0) //if clause
z=0;
if (i == 0)\n\tz=0;\nelse\n\tz=1;
else /*else clause is located here*/
z=1;
Some examples
• C++
• Most are easily done.
• In Template syntax : Foo<Bar>
• Stream syntax: Cin >> var;
• When there is nested templates occur, there is a conflict: FOO<Bar<Bazz>>
• Is if two variables I and f?
• Is == two equal signs = = or ?
Solution
• Left-to-right scan
• lookahead sometimes required.
Regular languages
• Are one of the several formalisms for specifying tokens.
• Regular languages are simple and useful theory
• Easy to understand
• Efficient implementation
• Definition: Let Σ be a set of characters. A language over Σ is a set of
strings of characters drawn from Σ.
Examples of languages
English Programming language
• Alphabet = characters • Alphabet = ASCII
• Language = Sentences • Language = programs
Notations
• Languages are sets of strings.
• Need some notation for specifying which sets we want
• The standard notation for regular languages is regular expressions.
Regullar expressions
• Single character : ‘c’ ={“c”}
• Epsilon: ε ={“”}
• Union A+B ={ s| s ∈A or s ∈B}
• Concatenation AB = {ab | a ∈A and b ∈A}
• Iteration A* = where = AAA… i times.
Regular expressions
• Definition: The regular expressions over Σ are the smallest set of
expressions including
• ε
• ‘c’ where c ∈ Σ
• A + B where A, B are rexp over Σ
• AB “ “ “
• A* Where A is a rexp over Σ
Examples
• Keywords: “else” or “if” or …
• ‘else’ + ‘if’ …
• ‘else’ abbreviates as ‘e’ ‘l’ ‘s’ ‘e’
• Integer: a non-empty string of digits
• Digit = ‘0’ +'1’ +'2’ +'3’ +'4’ +'5’ +'6’ +'7’ +'8’ +’9’
• Integer = digit digit*
• Abbreviation: = AA*
• Identifir: strings of letters or digits, starting with a letter
• Letter = ‘A’ + … + ‘z’ +’a’+….+’z’
• Identifier = letter (letter + digit)*
• Whitespace: a non empty sequence of blanks, newlines, and tabs
Examples
• Phone Number
• +251-911-00 00 00
• Σ = digits U { -, +, ‘ ‘}
• Email Address
•
[email protected]• There are regular expressions everywhere.
• Everything discussed so far is Syntax not semantics (meaning).