Modern Compiler Design
T1 - Overview
Mooly Sagiv and Eran Yahav
School of Computer Science
Tel-Aviv University
[email protected]
http://www.cs.tau.ac.il/~yahave
1
Who
Eran Yahav
Schrieber Open-space
Tel: 6405358
[email protected]
Wednesday 14:00-16:00
http://www.cs.tau.ac.il/~yahave
2
What
Compiler
txt exe
Frontend Semantic Backend
Source Executable
(analysis) Representation (synthesis)
text code
3
Say What?
Compiler
txt exe
Frontend Semantic Backend
Source Executable
(analysis) Representation (synthesis)
text code
txt Lexical Syntax AST Symbol Inter. Code exe
Analysis Analysis Table Rep.
Gen.
etc.
Parsing (IR)
Turkish Executable
Coffee code
4
How
txt Lexical Syntax AST Symbol Inter. Code exe
Analysis Analysis Table Rep.
Gen.
etc.
Parsing (IR)
Turkish Executable
Coffee code
JLex javaCup Java GC Lib
Assembler
5
How II
Groups of 3-4 students
Submit assignments on schedule
In case of doubt – ask questions
6
Why?
Useful techniques and algorithms
Lexical analysis / parsing
Semantic representation
…
Register allocation
Understand programming languages better
Understand internals of compilers
7
Today
txt Lexical Syntax AST Symbol Inter. Code exe
Analysis Analysis Table Rep.
Gen.
etc.
Parsing (IR)
Turkish Executable
Coffee code
Goals:
•Understand project scope
•Learn how to use JLex
8
Turkish Coffee
(extended) subset of Java
Main features
Object oriented
• Objects, virtual method calls, but no overloading
Strongly typed
• Primitives for int, boolean, string
• Reference types, array types
Dynamic allocation and Garbage Collection
• Heap allocation, automatic deallocation
Run-time checks
• Null references, array bounds, negative array size
• Adapted with permission from Cornell course material by Radu
Rugina
9
Good News
No “static” modifier
No interfaces
No method overloading
(but still allow overriding)
No exceptions
No packages
No multiple files to handle
10
Better News
Project to be implemented in Java
Turkish Coffee language is still rich enough for
doing interesting things
11
Jumping into the water
/** Sort the array a[] in ascending order
** using an insertion sort.
*/
void sort(int a[], int size) {
for (int i = 1; i < size; i++) {
// a[0..i-1] is sorted
// insert a[i] in the proper place
int x = a[i];
int j;
for (j = i-1; j >=0; --j) {
if (a[j] <= x)
break;
a[j+1] = a[j];
}
// now a[0..j] are all <= x
// and a[j+2..i] are > x
a[j+1] = x;
}
} // sort
12
Jumping into the water
class HelloTest {
public static void main(String[] args) {
Hello greeter = new Hello();
greeter.speak();
}
}
class Hello {
void speak() {
System.out.println(“I know Java, really!");
}
}
(see http://www.cs.wisc.edu/~solomon/cs537/java-tutorial.html)
13
Jumping into the water
class Pair { int x, y; }
C++ Java
Pair origin; Pair origin = new Pair();
Pair *p, *q, *r; Pair p, q, r;
origin.x = 0; origin.x = 0;
p = new Pair; p = new Pair();
p -> y = 5; p.y = 5;
q = p; q = p;
r = &origin; N/A
(see http://www.cs.wisc.edu/~solomon/cs537/java-tutorial.html) 14
Jumping into the water
p = new Pair();
// ...
q = p;
// ...
delete p;
q -> x = 5; // oops!
15
Jumping into the water
Download recent SDK
Download JLex
Download javaCup
Use of Eclipse is recommended
Java
On-line tutorial
Books (e.g., Java Tutorial 519.836)
• Bruce Eckel’s Thinking in Java
16
Lexical Analysis with JLex
JLex – lexical analyzer generator
Input: spec file
Output: a lexical analyzer
A Java program
text
Lexical
spec JLex .java javac
analyzer
tokens
17
JLex Spec File
Possible source
User code of javac errors
down the road
Copied directly to Java file
%%
DIGIT= [0-9]
JLex directives LETTER= [a-zA-Z]
Define macros, state names
YYINITIAL
%%
Lexical analysis rules
Optional state, regular expression, action
How to break input to tokens
Action when token matched
{LETTER}
({LETTER}|{DIGIT})*
18
User Code
package TC.Lexer;
import TC.Error.*;
import TC.Parser.sym;
…
any lexer-helper Java code
…
19
JLex Directives
Directives - control JLex internals
• %char
• %line
• %class class-name
• %cup
State definitions
%state state-name
Macro definitions
Macro-name = regex
20
Regular Expressions
$ end of a line
. (dot) any character except the newline
"..." ignore meaning
{name} macro expansion
* zero or more repetitions
+ one or more repetitions
? zero or one repetitions
(...) grouping within regular expressions
[...] class of characters - any one character enclosed in brackets
a – b range of characters
[^…] negated class – any one not enclosed in brackets
21
Example Macros
ALPHA=[A-Za-z_]
DIGIT=[0-9]
ALPHA_NUMERIC={ALPHA}|{DIGIT}
IDENT={ALPHA}({ALPHA_NUMERIC})*
NUMBER=({DIGIT})+
WHITE_SPACE=([\ \n\r\t\f])+
22
Lexical Analysis Rules
Rule structure
[states] regexp { action }
Priority for rule matching longest string
More than one match for same length – priority
for rule appearing first !
Important: rules given in a JLex specification
should match all possible input !
23
Action Body
Java code
Can use special methods and vars
yytext()
yyline,yychar (when enabled)
Lexer state transition
yybegin(state-name)
YYINITIAL
24
More on Lexer States
Tokenize differently according to context
// this conditon checks if x > y
if (x>y) {…
}
Example
“if” is a keyword token when in program text
“if” is part of comment text when inside a comment
25
<YYINITIAL> {NUMBER} {
return new Symbol(sym.NUMBER, new Token(yytext(), yyline,yychar));
}
<YYINITIAL> {WHITE_SPACE} { }
<YYINITIAL> "+" {
return new Symbol(sym.PLUS, new Token(yytext(), yyline, yychar));
}
<YYINITIAL> "-" {
return new Symbol(sym.MINUS, new Token(yytext(), yyline, yychar));
}
<YYINITIAL> "*" {
return new Symbol(sym.TIMES, new Token(yytext(), yyline, yychar));
}
...
<YYINITIAL> "//" { yybegin(COMMENTS); }
<COMMENTS> [^\n] { }
<COMMENTS> [\n] { yybegin(YYINITIAL); }
<YYINITIAL> . { return new Symbol(sym.error, null); }
26
Putting it all together –
count number of lines
File: lineCount
import java_cup.runtime.*;
%%
%cup
%{
private int lineCounter = 0;
%}
%eofval{
System.out.println("line number=" + lineCounter);
return new Symbol(sym.EOF);
%eofval}
NEWLINE=\n
%%
<YYINITIAL>{NEWLINE} {
lineCounter++;
}
<YYINITIAL>[^{NEWLINE}] { } 27
Putting it all together –
count number of lines
text
lineCount JLex lineCount.java
Lexical
java JLex.Main lineCount javac
analyzer
javac *.java
Main.java tokens
sym.java
JLex and javaCup must be in the CLASSPATH
28
Running the Lexer
import java.io.*;
import java_cup.runtime.*;
public class Main {
public static void main(String[] args) {
Symbol currToken;
try {
FileReader txtFile = new FileReader(args[0]);
Yylex scanner = new Yylex(txtFile);
do {
currToken = scanner.next_token();
// do something with currToken
} while (currToken.sym != sym.EOF);
} catch (Exception e) {
throw new RuntimeException("IO Error (brutal exit)");
}
}
}
(Just for testing Lexer as stand-alone program) 29
Sym.java File
public class sym {
public static final int EOF = 0;
…
}
• Defines symbol constant ids
• Tells parser what is the token returned by lexer
• Actual value doesn’t matter
• in the future will be generated by javaCup
30
Common Pitfalls
Classpath
Path to executable
Define environment variables
JAVA_HOME
CLASSPATH
Note the use of . (dot) as part of package
name / directory structure
e.g., JLex.Main
31
Assignment 1
class Token
At least - id, value, line
Should extend java_cup.runtime.Symbol
Numeric token Ids in sym.java
• Will be later generated by javaCup
class Compiler
class LexicalError
Don’t forget to generate Lexer and recompile Java when you
change the spec
You need to download and install both JLex and javaCup
32
Token Class
import java_cup.runtime.Symbol;
public class Token extends Symbol {
public int id;
public Object value;
…
}
33
(some of the) JLex directives to
be used
%cup (integrate with cup)
%line (count lines)
%char (count chars)
%type Token (pass type Token)
%class Lexer (gen. lexer class)
34
http://www.cs.tau.ac.il/~yahave
35