flex
Ismaeel Alkrayyan
AI Departement,4th
lexical analyzer
Scanner :
• This is the first phase of a compiler.
• reading a source text as a file of characters and dividing them up into tokens
by matching sequential characters to patterns.
• Filtering comment lines and white space characters. white space characters
like tab, space, newline characters.
A quick tutorial on fLex 2
Tokens, Patterns, Lexemes
• Token: It is a group of characters with logical meaning. Token is a logical
building block of the language .Example: id, keyword,Num
• Pattern: It is a rule that describes the character that can be grouped into
tokens. It is expressed as a regular expression. Input stream of characters are
matched with patterns and tokens are identified.
• Lexeme: It is the actual text/character stream that matches with the pattern
and is recognized as a token.
• For example, “int” is identified as token keyword. Here “int” is lexeme and keyword is token
A quick tutorial on fLex 3
flex : Overview
Scanner generators:
• Helps write programs whose control flow is directed by instances of regular
expressions in the input stream.
Output: C code
Input: a set of implementing a
regular expressions flex (or lex) scanner:
+ actions function: yylex()
file: lex.yy.c
A quick tutorial on Lex 4
Using flex
file: lex.yy.c
lex input spec yylex()
(regexps + flex {
…
actions)
}
compiler
user
supplies
driver
code
main() {…}
or
parser() {…}
A quick tutorial on Lex 5
flex: input format
An input file has the following structure:
definitions
required
%%
rules optional
%%
user code
Shortest possible legal flex input:
%%
A quick tutorial on Lex 6
Definitions
%option noyywrap
%{ Options
#include<stdio.h>
#include<stdlib.h>
int line_count=0; C Code
%}
whitespace [ \t\v\f\r]+
Newline [\n] Flex
DIGIT [0-9] Definitions
CommentStart "/*"
ID [a-zA-Z][a-zA-Z0-9]*
%% A quick tutorial on Lex 7
Rules
• The rules portion of the input contains a sequence of rules.
• Each rule has the form
pattern action
where:
• pattern describes a pattern to be matched on the input
• pattern must be un-indented
• action must begin on the same line.
A quick tutorial on fLex 8
Rules
%%
[0-9]+ {printf("%s is a number",yytext);}
{whitespace} {printf("whitespace encountered");}
{newline} {line_count++;}
. {printf("Mysterious character found");}
%%
Pattern Action
Patterns extended regular expressions.
Do not place any whitespace at the beginning of a pattern line.
“start conditions” can be used to specify that a pattern match only in
specific situations.
Patterns
• Essentially, extended regular expressions.
• <<EOF>> to match “end of file”
• Character classes:
• [:alpha:], [:digit:], [:alnum:], [:space:].
• {name} where name was defined earlier.
• “start conditions” can be used to specify that a pattern match only in
specific situations.
A quick tutorial on Lex 10
Regular Expressions
• The patterns at the heart of every flex scanner use a rich regular
expression language.
• A regular expression is a pattern description using a metalanguage. a
language that you use to describe what you want the pattern to
match
• The metalanguage uses standard text characters, some of which
represent themselves and others of which represent patterns.
• All characters other than the metacharacter, including all letters and
digits, match themselves.
A quick tutorial on Lex 11
Regular Expressions
A quick tutorial on Lex 12
Regular Expressions
Metacharacter Meaning Example
Matches any single
. character except the
newline character (\n).
Used to escape \n is a newline
\ metacharacters and as
part of the usual C escape \* is a literal
sequences; asterisk.
Trailing context, which means 0/1
/ to match the regular matches 0 in the
expression preceding the slash string 01 but would
but only if followed by the not match
regular expression after the anything in the
slash. string 0 or 02.
A quick tutorial on Lex 13
Only one slash is permitted per
Regular Expressions
Metacharacter Meaning Example
Zero or one occurrence of
? -?[0-9]+
preceding expression
• To specify already
defined names {whitespace}
{}
• To specify number of 1{2}3{4}5{6}
occurrance
a|b
| Or faith|hope|charity
Group series of regular
() expression together
A quick tutorial on Lex (ab|cd)+ 14
Regular Expressions
Metacharacter Meaning Example
• If within [], then means
except following characters [^ab]
^
• Otherwise means start of ^ab
line
$ End of line 124$
“” Match anything literally “^124$”
<<EOF>> End of file
A quick tutorial on Lex 15
Regular Expressions
• complex number pattern with
• exponent part is optional.
• optional decimal point.
• optional sign.
• [-+]?([0-9]*\.?[0-9]+|[0-9]+\.)(E(+|-)?[0-9]+)?
Example
A flex program to read a file of (positive) integers and compute
the average:
%{ Definition for a digit
definitions
#include <stdio.h>
#include <stdlib.h>
%}
Rule to match a number and return its value to
dgt [0-9] the calling routine
%%
rules
{dgt}+ return atoi(yytext);
%%
void main()
Driver code
{ (could instead have been in a separate file)
int val, total = 0, n = 0;
user code
while ( (val = yylex()) > 0 ) {
total += val;
n++;
} A quick tutorial on Lex 17
if (n > 0) printf(“ave = %d\n”,
Example
A flex program to read a file of (positive) integers and compute
the average:
%{
definitions
#include <stdio.h>
defining and using a name
#include <stdlib.h>
%}
dgt [0-9]
%%
rules
{dgt}+ return atoi(yytext);
%%
void main()
{
int val, total = 0, n = 0;
while ( (val = yylex()) > 0 ) {
user code
total += val;
n++;
}
if (n > 0) printf(“ave = %d\n”, total/n);
}
A quick tutorial on Lex 18
Example
A flex program to read a file of (positive) integers and compute
the average:
%{
definitions
#include <stdio.h>
defining and using a name
#include <stdlib.h>
%}
dgt [0-9]
%% char * yytext;
rules
{dgt}+ return atoi(yytext);
a buffer that holds the input
%% characters that actually match the
void main() pattern
{
int val, total = 0, n = 0;
while ( (val = yylex()) > 0 ) {
user code
total += val;
n++;
}
if (n > 0) printf(“ave = %d\n”, total/n);
}
A quick tutorial on Lex 19
Example
A flex program to read a file of (positive) integers and compute
the average:
%{
definitions
#include <stdio.h>
defining and using a name
#include <stdlib.h>
%}
dgt [0-9]
%% char * yytext;
rules
{dgt}+ return atoi(yytext);
a buffer that holds the input
%% characters that actually match the
void main() pattern
{
int val, total = 0, n = 0;
while ( (val = yylex()) > 0 ) {
user code
total += val; Invoking the scanner: yylex()
n++; Each time yylex() is called, the
} scanner continues processing
if (n > 0) printf(“ave = %d\n”, total/n); the input from where it last left
} off.
Returns 0 on end-of-file.
A quick tutorial on Lex 20
Avoiding compiler warnings
• If compiled using “gcc –Wall” the previous flex file will generate
compiler warnings:
lex.yy.c: … : warning: `yyunput’ defined but not used
lex.yy.c: … : warning: `input’ defined but not used
• These can be removed using ‘%option’ declarations in the
first part of the flex input file:
%option nounput
%option noinput
A quick tutorial on Lex 21
Matching the Input (Handles Ambiguous
Patterns)
• When more than one pattern can match the same input, the scanner
behaves as follows:
• Match the longest possible string every time the scanner matches input.
• if multiple rules match, the rule listed first in the flex input file is chosen;
• if no rule matches, the default is to copy the next character to stdout.
• The text that matched (the “token”) is copied to a buffer yytext.
rules
A quick tutorial on Lex 22
Matching the Input (cont’d)
Pattern to match C-style comments: /* … */
"/*"(.|\n)*"*/"
Input:
#include <stdio.h> /* definitions */
int main(int argc, char * argv[ ]) {
if (argc <= 1) {
printf(“Error!\n”); /* no arguments */
}
printf(“%d args given\n”, argc);
return 0;
}
A quick tutorial on Lex 23
Matching the Input (cont’d)
Pattern to match C-style comments: /* … */
"/*"(.|\n)*"*/"
Input:
longest match: #include <stdio.h> /* definitions */
int main(int argc, char * argv[ ]) {
if (argc <= 1) {
printf(“Error!\n”); /* no arguments */
}
printf(“%d args given\n”, argc);
return 0;
}
A quick tutorial on Lex 24
Matching the Input (cont’d)
Pattern to match C-style comments: /* … */
"/*"(.|\n)*"*/"
Input:
longest match: #include <stdio.h> /* definitions */
Matched text
int main(int argc, char * argv[ ]) {
shown in blue
if (argc <= 1) {
printf(“Error!\n”); /* no arguments */
}
printf(“%d args given\n”, argc);
return 0;
}
A quick tutorial on Lex 25
Start Conditions
• Used to activate rules conditionally.
• Any rule prefixed with <S> will be activated only when the scanner is in start
condition S.
• Declaring a start condition S:
• in the definition section: %x S
• “%x” specifies “exclusive start conditions”
• Putting the scanner into start condition S:
• action: BEGIN(S)
A quick tutorial on Lex 26
Start Conditions (cont’d)
• Example:
• <STRING>[^"]* { …match string body… }
• [^"] matches any character other than "
• The rule is activated only if the scanner is in the start condition STRING.
• INITIAL refers to the original state where no start conditions are
active.
• <*> matches all start conditions.
A quick tutorial on Lex 27
Using Start Conditions
• Start conditions let us explicitly simulate finite state machines.
• This lets us get around the “longest match” problem for C-style
comments.
FSA for C comments: flex input:
%x S1, S2, S3
non-* %%
*
"/" BEGIN(S1);
/ S1 * S2 * S3
/ <S1>"*" BEGIN(S2);
<S2>[^*] ; /* stay in S2 */
<S2>"*" BEGIN(S3);
non-{ /,* }
<S3>"*" ; /* stay in S3 */
<S3>[^*/] BEGIN(S2);
<S3>"/" BEGIN(INITIAL);
A quick tutorial on Lex 28
Using Start Conditions
• Start conditions let us explicitly simulate finite state machines.
• This lets us get around the “longest match” problem for C-style
comments.
FSA for C comments: flex input:
%x S1, S2, S3
non-* %%
*
"/" BEGIN(S1);
/ S1 * S2 * S3
/ <S1>"*" BEGIN(S2);
<S2>[^*] ; /* stay in S2 */
<S2>"*" BEGIN(S3);
non-{ /,* }
<S3>"*“ ; /* stay in S3 */
<S3>[^*/] BEGIN(S2);
<S3>"/" BEGIN(INITIAL);
A quick tutorial on Lex 29
Using Start Conditions
• Start conditions let us explicitly simulate finite state machines.
• This lets us get around the “longest match” problem for C-style
comments.
FSA for C comments: flex input:
%x S1, S2, S3
non-* %%
*
"/" BEGIN(S1);
/ S1 * S2 * S3
/ <S1>"*" BEGIN(S2);
<S2>[^*] ; /* stay in S2 */
<S2>"*" BEGIN(S3);
non-{ /,* }
<S3>"*“ ; /* stay in S3 */
<S3>[^*/] BEGIN(S2);
<S3>"/" BEGIN(INITIAL);
A quick tutorial on Lex 30
Using Start Conditions
• Start conditions let us explicitly simulate finite state machines.
• This lets us get around the “longest match” problem for C-style
comments.
FSA for C comments: flex input:
%x S1, S2, S3
non-* %%
*
"/" BEGIN(S1);
/ S1 * S2 * S3
/ <S1>"*" BEGIN(S2);
<S2>[^*] ; /* stay in S2 */
<S2>"*" BEGIN(S3);
non-{ /,* }
<S3>"*“ ; /* stay in S3 */
<S3>[^*/] BEGIN(S2);
<S3>"/" BEGIN(INITIAL);
A quick tutorial on Lex 31
Using Start Conditions
• Start conditions let us explicitly simulate finite state machines.
• This lets us get around the “longest match” problem for C-style
comments.
FSA for C comments: flex input:
%x S1, S2, S3
non-* %%
*
"/" BEGIN(S1);
/ S1 * S2 * S3
/ <S1>"*" BEGIN(S2);
<S2>[^*] ; /* stay in S2 */
<S2>"*" BEGIN(S3);
non-{ /,* }
<S3>"*“ ; /* stay in S3 */
<S3>[^*/] BEGIN(S2);
<S3>"/" BEGIN(INITIAL);
A quick tutorial on Lex 32
Using Start Conditions
• Start conditions let us explicitly simulate finite state machines.
• This lets us get around the “longest match” problem for C-style
comments.
FSA for C comments: flex input:
%x S1, S2, S3
non-* %%
*
"/" BEGIN(S1);
/ S1 * S2 * S3
/ <S1>"*" BEGIN(S2);
<S2>[^*] ; /* stay in S2 */
<S2>"*" BEGIN(S3);
non-{ /,* }
<S3>"*“ ; /* stay in S3 */
<S3>[^*/] BEGIN(S2);
<S3>"/" BEGIN(INITIAL);
A quick tutorial on Lex 33
Using Start Conditions
• Start conditions let us explicitly simulate finite state machines.
• This lets us get around the “longest match” problem for C-style
comments.
FSA for C comments: flex input:
%x S1, S2, S3
non-* %%
*
"/" BEGIN(S1);
/ S1 * S2 * S3
/ <S1>"*" BEGIN(S2);
<S2>[^*] ; /* stay in S2 */
<S2>"*" BEGIN(S3);
non-{ /,* }
<S3>"*“ ; /* stay in S3 */
<S3>[^*/] BEGIN(S2);
<S3>"/" BEGIN(INITIAL);
A quick tutorial on Lex 34
Using Start Conditions
• Start conditions let us explicitly simulate finite state machines.
• This lets us get around the “longest match” problem for C-style
comments.
FSA for C comments: flex input:
%x S1, S2, S3
non-* %%
*
"/" BEGIN(S1);
/ S1 * S2 * S3
/ <S1>"*" BEGIN(S2);
<S2>[^*] ; /* stay in S2 */
<S2>"*" BEGIN(S3);
non-{ /,* }
<S3>"*“ ; /* stay in S3 */
<S3>[^*/] BEGIN(S2);
<S3>"/" BEGIN(INITIAL);
A quick tutorial on Lex 35
Putting it all together
• Scanner implemented as a function
int yylex();
• return value indicates type of token found (encoded as a +ve
integer);
• the actual string matched is available in yytext.
• Scanner and parser need to agree on token type encodings
• let yacc generate the token type encodings
• yacc places these in a file y.tab.h
• use “#include y.tab.h” in the definitions section of the flex
input file.
• When compiling, link in the flex library using “-lfl”
A quick tutorial on Lex 36