Input Buffering
Input Buffering
We often have to look one or more characters beyond the next lexeme before we can be sure we have
the right lexeme.
As characters are read from left to right, each character is stored in the buffer to form a meaningful
token. We introduce a two-buffer scheme that handles large look a head’s safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.
Buffer Pairs:
A buffer is divided into two N-character halves, as shown below.
Each buffer is of the same size N, and N is usually the number of characters on one disk block. E.g.,
1024 or 4096 bytes.
Using one system read command we can read N characters into a buffer.
If fewer than N characters remain in the input file, then a special character, represented by eof, marks
the end of the source file.
SPECIFICATION OF TOKEN:
Regular expressions are notation for specifying patterns.
Each pattern matches a set of strings.
Regular expressions will serve as names for sets of strings.
Strings and Languages String means a finite sequence of symbols.
For example
computer ( c, o, m, p, u, t, e, r)
CS6660 ( C, S, 6, 6, 6, 0)
101001 (1, 0)
Symbols are given through alphabet. An alphabet is a finite set of symbols
The term alphabet or character class denotes any finite set of symbols. e.g., set {0,1} is the binary
alphabet.
The term sentence and word are often used as synonyms for the term string.
The length of a string s is written as | s | - is the number of occurrences of symbols
The empty string denoted by ε – length of empty string is zero.
The term language denotes any set of strings over some fixed alphabet.
Operations on Languages:
There are several operations that can be applied to languages:
Definitions of operations on languages L and M:
Regular Expressions:
It allows defining the sets to form tokens.
Defines a Pascal identifier –identifier is formed by a letter followed by zero or more letters or digits.
e.g., letter (letter | digit) *
A regular expression is formed using a set of defining rules.
Each regular expression r denotes a language L(r).
Order of evaluate Regular expression:
As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop certain pairs
of parentheses if we adopt the conventions that:
The unary operator * has highest precedence and is left associative.
Concatenation has second highest precedence and is left associative.
| has lowest precedence and is left associative.
RECOGNITION OF TOKENS:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examines the input string and finds a prefix
that is a lexeme matching one of the patterns.
Grammar for branching statements:
The terminal of grammar, which are if, then, else, relop,id and numbers are the names of tokens as far
as the lexical analyzer is concerned, the patterns for the tokens are described using regular definitions.
Advantages of Input Buffering
It usually just does one test to determine if the forward pointer is pointing to an eof.
It only runs further tests until it reaches the halfway point of the buffer or eof.
The average number of tests per input character is extremely close to 1 since N input characters are
encountered between eofs.
Disadvantages of Input Buffering
The majority of the time, this approach works effectively, however, the amount of lookahead is
restricted.
In circumstances when the forward pointer must travel a distance greater than the buffer length, this
restricted lookahead may make it difficult to detect tokens.
It must wait until the character that follows the right parenthesis decides whether the DECLARE is a
keyword or an array name.
Introduction
Input buffering is an most important technique in compiler design that helps to improve performance and reduce
expenses, and it must be used carefully and appropriately to avoid problems.
To understand the input buffering in compiler design in detail, a few terms need to be understood. We will
discuss those terms before moving to input buffering in compiler design.
Lexical Analyser
The lexical analyzer's main purpose is to read the source program's input characters, arrange them into lexemes,
and output a sequence of tokens for each lexeme in the source program.
When the lexical analyzer finds a lexeme part of an identifier, it must add it to the symbol table.
The lexical analyzer not only recognizes lexemes but also does pre-processing on the source text, such as
deleting comments and white spaces.
Lexeme
A lexeme is a sequence of characters in the source program that fits the pattern for a token and is recognized as
an instance of that token by the lexical analyzer.
Token
A Token is a pair that consists of a token name and a value for an optional attribute.
The token name is a symbol that denotes the type of lexical unit.
Lexeme
Lexeme Token
= EQUAL_OP
* MULT_OP
, COMMA
( LEFT_PAREN
Pattern
A pattern is a description of the various forms that a token's lexemes can take. The pattern for a keyword as a
token is just a series of characters that make up the term..
One or more characters beyond the following lexeme must be searched up to guarantee that the correct lexeme
is identified.
Using sentinels to mark the buffer end has been embraced as a technique for speeding up the lexical analyzer
process.
moving characters.
Buffer Pairs
It consists of two buffers, each of which has an N-character size and is alternately reloaded.
There are two pointers: lexemeBegin and forward.
Lexeme Begin denotes the start of the current lexeme, which has yet to be discovered.
Forward scans until it finds a match for a pattern.
When a lexeme is discovered, lexeme begin is set to the character immediately after the newly discovered
lexeme, and forward is set to the character at the right end of the lexeme.
The collection of characters between two points is the current lexeme.
Sentinels
Sentinels are used to performing a check each time the forward pointer is shifted to guarantee
that one-half of the buffer has not gone off. If it's finished, the other half will need to be
reloaded.
As a result, each advance of the forward pointer necessitates two checks at the ends of the
buffer halves. Test 1: Check for the buffer's end. Test 2: To figure out which character is being
read. By expanding each buffer in half to store a sentinel character at the end, sentinel reduces
the two checks to one.
The sentinel is a unique character that isn't included in the source code. (The character of serves
as a sentinel.)
What is Input Buffering in compiler design?
To identify tokens, Lexical Analysis must visit secondary memory each time. It takes a long
time and costs a lot of money. As a result, the input strings are buffered before being examined
by Lexical Analysis.
Lexical analysis reads the input string one character at a time from left to right to detect tokens.
To scan tokens, it employs two pointers.
The Begin Pointer (bptr) is a pointer that points to the start of the string to be read.
Look Ahead Pointer(lptr) continues its hunt for the token's end.
Sample Example
Example: For the statement int a,b;
Both points begin at the start of the string that is saved in the buffer.
The Look Ahead Pointer examines the buffer until it finds the token.
Before the token ("int") can be identified, the character ("blank space") beyond the token ("int") must be
checked.
Both pointers will be set to the next token ('a') after processing the token ("int"), and this procedure will be
continued throughout the program.
Two portions of a buffer can be separated. If you move the look Ahead cursor halfway through
the first half, the second half will be filled with fresh characters to read. If you shift the look
Ahead cursor to the right end of the second half's buffer, the first half will be filled with new
characters, and so on.
In the context of compiler design, input buffering is a crucial technique that optimizes the process of reading
and processing source code (or other input) efficiently. A compiler typically works by reading source code and
breaking it down into tokens, which are then analyzed and processed to generate the output. Efficiently reading
the input stream, especially when dealing with large programs, is important to avoid performance bottlenecks.
In a compiler, input buffering refers to the process of reading the source code (or input) in a way that
minimizes the number of I/O operations, making the lexical analysis (tokenization) phase more efficient.
Input buffering is especially important in the lexical analysis phase, where large amounts of source code are
read and tokenized.
Performance: Reading each character individually (e.g., via system calls or directly from the input
stream) is much slower than reading chunks of data at once.
Minimizing I/O Operations: Accessing memory or disk is relatively slow, so minimizing the frequency
of these accesses is essential for efficient processing.
Buffering Reduces Overhead: By reading data into a buffer and working with it in chunks, the
compiler can reduce the overhead of I/O operations, making the lexical analysis more efficient.
1. Double Buffering:
o Double buffering is a common technique used in input buffering. It involves maintaining two
buffers of input data.
o One buffer is used to feed the lexical analyzer (scanner) while the other buffer is being filled
with more input data. When one buffer is exhausted, the roles of the buffers are switched.
o This allows the compiler to continuously read the input without needing to wait for new data,
thereby improving speed.
2. Buffering with a Lookahead Mechanism:
o Lexical analyzers often use lookahead techniques to help identify tokens more efficiently. For
example, if a language construct requires looking ahead at one or more characters (e.g.,
distinguishing between if and identifier), the buffer can store additional characters in advance to
aid this lookahead.
o A typical technique is to use a fixed-size buffer and implement a lookahead pointer that moves
across the buffer to identify the next token.
3. Buffer Size and Chunking:
o A common practice is to read in chunks (often 1024 or 4096 bytes) from the input file into the
buffer.
o The lexical analyzer then processes these chunks one character at a time, moving through the
buffer, until it reaches the end, at which point it refills the buffer with more data from the input
stream.
4. Finite State Machines (FSMs) with Buffering:
o The lexical analyzer can be implemented using a finite state machine (FSM) where each state
corresponds to a specific part of the lexing process. As the FSM moves through its states, it reads
characters from the input buffer and recognizes tokens. This allows it to efficiently handle
tokenization while minimizing the need for frequent I/O operations.
c
Copy
int x = 10;
The steps the lexical analyzer might follow using input buffering could look like this:
1. Buffer Fill: The input is read into a buffer, say BUFFER[1024], with the first few characters (e.g., int x =
10;).
2. Token Identification:
o The lexical analyzer starts reading from the buffer one character at a time.
o It first identifies the token int, then skips whitespace to identify x, and so on.
3. Buffer Refill: When the lexical analyzer reaches the end of the buffer, it refills it with new data from the
input, and the process continues.
Buffer 1 holds the first portion of the input (say, the first 1024 characters).
Buffer 2 is initially empty. As the lexical analyzer consumes characters from Buffer 1, Buffer 2 is filled
with more input data (typically the next 1024 characters).
When Buffer 1 is exhausted, the analyzer switches to Buffer 2 and starts consuming from it, while
Buffer 1 is refilled.
This technique ensures that the lexer can continuously work without waiting for more data.
1. Efficiency: Reduces the number of system calls and input/output operations, leading to faster
compilation times.
2. Smooth Tokenization: Helps the lexical analyzer work efficiently by ensuring that there is always data
available for tokenization, avoiding delays when input is being read from slower sources like files or
networks.
3. Lookahead Capability: Enables the compiler to use techniques like lookahead without slowing down
input handling.
Conclusion:
In compiler design, input buffering is a critical optimization technique that helps the lexical analyzer (scanner)
process source code efficiently. It reduces the overhead of frequent I/O operations, speeds up tokenization, and
provides a smooth flow of input data for further processing. Through methods like double buffering and
chunking, the compiler can efficiently handle large amounts of source code and deliver high performance
during the compilation process.
Pros of Input Buffering in Compiler Design:
1. Improved Performance:
o Faster Input Handling: By reducing the number of I/O operations (such as system calls to read
one character at a time), input buffering enables the lexical analyzer to process larger chunks of
input at once. This significantly speeds up the process of tokenizing the source code.
o Reduced Latency: Since data is read into the buffer in larger chunks, the compiler can
immediately start processing the data, reducing the wait time between reading the input and
performing analysis.
2. Optimized Use of Resources:
o Efficient Memory Use: Input buffering allows the compiler to store multiple characters in a
buffer at once, rather than performing frequent I/O operations, which helps in better utilizing
memory resources.
o CPU Efficiency: By allowing the CPU to work on processing tokens while waiting for more
input to be buffered, the overall efficiency of the compilation process is improved.
3. Continuous Data Flow:
o Prevents I/O Bottlenecks: Input buffering allows the lexical analyzer to have a steady stream of
input data without the need to frequently pause and refill from the input source (file, network,
etc.). This is especially important for large programs, where pausing for every token or character
could lead to significant performance hits.
4. Enables Lookahead Techniques:
o Lookahead for Tokenization: Buffers can store extra characters that are useful for lookahead
operations. In lexical analysis, lookahead is often required to distinguish between similar tokens
(e.g., distinguishing if from an identifier). Input buffering facilitates this lookahead by holding
additional characters in memory.
5. Minimized Disk/Network Access:
o Less Frequent Disk Access: When reading from a file or network, input buffering reduces the
number of accesses to slower storage media. This minimizes the cost of accessing data from disk
or remote servers, which can be significantly slower than working with data already loaded into
memory.
1. Increased Complexity:
o Buffer Management Overhead: Implementing and managing buffering schemes, such as
double buffering or handling lookahead buffers, adds complexity to the design of the lexical
analyzer. It may require additional logic to handle cases like buffer exhaustion, switching
between buffers, or dealing with incomplete tokens at the buffer boundary.
o Error Handling Complexity: When errors occur near the boundary of buffers (e.g., at the end
of one buffer or the beginning of another), handling them appropriately can be tricky. For
example, if an incomplete token is encountered at the end of one buffer, it must be carefully
processed when the next buffer is read in.
2. Memory Overhead:
o Increased Memory Usage: Input buffers require additional memory to store the buffered data,
and this can be a concern in systems with limited memory resources. While the memory
requirement is typically small (a few kilobytes to megabytes), it can still become significant in
embedded systems or low-memory environments.
o Buffer Size Trade-Offs: The buffer size needs to be tuned for optimal performance. A larger
buffer may improve performance but increase memory consumption. A smaller buffer may
reduce memory usage but may result in more frequent buffer refills, potentially negating the
benefits of buffering.
3. Potential Overhead in Small Programs:
o Unnecessary for Small Input: In cases where the source code is small or the program being
compiled is simple, the overhead of buffering may not offer significant performance
improvements. The complexity added by the buffering mechanism might not justify its use if the
input size is small enough that frequent I/O operations are not a concern.
4. Buffering Latency in Certain Scenarios:
o Initial Delay for Buffer Refill: In scenarios where buffers need to be refilled frequently, there
might be a slight delay in the initial phase of reading input (especially if the input is being read
from disk or a network). This latency could cause a temporary slowdown, although it is often
compensated for by the long-term performance benefits of buffering.
o Buffering Overhead: The process of refilling buffers and managing them can add a slight
overhead to the system. While this overhead is generally small, it could become noticeable in
systems where every microsecond counts (e.g., real-time compilers or embedded systems).
5. Complicated Token Recognition at Buffer Boundaries:
o Difficulty at Buffer Edges: When a token is split across two buffers (i.e., part of the token is at
the end of one buffer and the rest is at the beginning of the next), extra care must be taken to
correctly identify the token. This can complicate the logic and make the lexical analyzer more
prone to errors.
Summary:
Pros:
Cons:
Conclusion:
Input buffering is a valuable technique for improving the efficiency of lexical analysis in a compiler, especially
when dealing with large input files. However, it comes with trade-offs in terms of complexity and memory
usage. Whether the benefits outweigh the drawbacks depends on the specific use case, the size of the input, and
the target environment (e.g., embedded systems vs. high-performance servers)
Input buffering in compiler design refers to the technique of reading the source code (or input data) into
memory in chunks (i.e., buffers) rather than reading it character by character. This technique is used to optimize
the performance of the lexical analysis phase of a compiler, reducing the number of I/O operations and speeding
up the tokenization process.
Input buffering is crucial because it helps improve the efficiency of the lexical analyzer (scanner). Without
input buffering, the lexical analyzer would need to read one character at a time from the input, which can be
very slow, especially for large programs. Buffering allows the compiler to work with chunks of data,
significantly speeding up the process and reducing the number of I/O operations.
Improved performance: By reducing I/O operations, input buffering speeds up the process of reading
and tokenizing source code.
Efficient resource usage: It minimizes the overhead of frequently accessing the disk or network.
Continuous data flow: The compiler can continue processing without waiting for more data to be read
from input.
Lookahead capabilities: Buffering supports techniques like lookahead, which are important for
distinguishing between similar tokens (e.g., if vs. an identifier).
4. What are the different types of input buffering techniques used in compilers?
Single Buffering: A single buffer is used to hold a chunk of input. When it is exhausted, the next chunk
is read into the buffer.
Double Buffering: Two buffers are used. While one buffer is being processed, the other is being filled
with new data. This allows continuous processing without waiting for new data to be read in.
Circular Buffering: A buffer is treated as a circular structure, allowing the analyzer to wrap around and
reuse space, which can be helpful for optimizing memory usage.
Buffer 1 is filled with data from the input source (e.g., file, network).
The lexical analyzer processes characters from Buffer 1.
While Buffer 1 is being processed, Buffer 2 is filled with new data from the input.
When Buffer 1 is exhausted, the analyzer switches to Buffer 2, and the process repeats, ensuring
continuous input processing.
Lookahead refers to the ability of the lexical analyzer to examine future characters in the input stream to decide
how to tokenize the current character. Input buffering helps by holding additional characters in the buffer,
allowing the lexer to look ahead at the next character(s) to make decisions about token boundaries (e.g.,
distinguishing between if and an identifier).
Buffer boundary issues: When a token is split across two buffers, additional logic is required to
properly identify the token when it crosses buffer boundaries.
Memory overhead: Buffers consume memory, and larger buffers may require more memory resources.
This can be a concern for systems with limited memory.
Complexity: Managing buffers, especially when implementing techniques like double buffering or
circular buffering, increases the complexity of the lexical analyzer and the overall compiler design.
8. Does input buffering work the same way for all types of input sources?
Not necessarily. Input buffering is typically used for files or streams where the input is sequential and can be
read in chunks. However, if the input source involves real-time data (such as interactive input from a user or
live data streams), the buffering strategy might need to be adjusted to accommodate those constraints. For
example, buffering might be less relevant when the input is being processed in real-time.
For small programs or when the source code is relatively small, the overhead of setting up input buffering might
not provide a noticeable performance benefit. In such cases, the compiler may read the input character by
character without significant impact on performance. However, for larger projects, input buffering is essential
for efficiency.
10. What are the trade-offs when choosing buffer sizes for input buffering?
Larger buffers: Larger buffers can lead to fewer I/O operations and more efficient processing, but they
consume more memory.
Smaller buffers: Smaller buffers consume less memory but result in more frequent buffer refills, which
can reduce performance. Finding an optimal buffer size is often a balance between memory usage and
performance.
While input buffering generally improves performance, it can cause issues in specific scenarios:
Memory Usage: For very large programs or systems with limited memory, large buffers might lead to
excessive memory consumption.
Latency: If the buffer is large, it might take longer to fill the buffer, which could cause delays before
processing begins, although this is usually compensated by long-term performance gains.
Different compilers and languages implement input buffering in various ways. For instance:
In C/C++, standard I/O functions like fgetc(), fgets(), or getc() work with buffered streams by
default.
In Python, the built-in input() function uses input buffering, while file operations like open() can use
buffered or unbuffered modes.
In Lex/Flex (a tool for lexical analysis), input buffering is handled explicitly through buffer size
configuration and built-in buffer management.
While input buffering is highly beneficial for performance, it is not always necessary. For very small programs
or in environments where resources are limited (e.g., embedded systems), it might be acceptable to read input
directly without buffering. However, for most real-world compilers, especially those processing large
codebases, input buffering is crucial.
Conclusion
Input buffering in compiler design is a powerful optimization that significantly improves the speed and
efficiency of the lexical analysis phase. It helps minimize I/O operations and supports more efficient processing
of source code. However, it does introduce some complexity, memory usage concerns, and boundary handling
challenges that need to be carefully managed. The trade-offs and implementation choices largely depend on the
size of the program, the target environment, and the performance requirements of the compiler.
Specification of Tokens
Specification of tokens depends on the pattern of the lexeme. Here we will
be using regular expressions to specify the different types of patterns that
can actually form tokens.
Although the regular expressions are inefficient in specifying all the
patterns forming tokens. Yet it reveals almost all types of pattern that
forms a token.
There are 3 specifications of tokens:
1. String
2. Language
3. Regular Expression
1. String
2. Suffix of String
Suffix of the string is the ending symbols of the string and the string(s)
itself.
For Example: s = abcd
Suffix of the string abcd: ∈, d, cd, bcd, abcd
5. Substring of String
You may also like...
Recognition of Tokens in Compiler Design - BtechVibes
Code Optimization in Compiler Design - BtechVibes
Type Checking in Compiler Design
7. Subsequence of String
The subsequence of the string is obtained by eliminating zero or more (not
necessarily consecutive) symbols from the string.
8. Concatenation of String
If s and t are two strings, then st denotes concatenation.
s = abc t = def
1. Union
Union is the most common set operation. Consider the two languages L
and M. Then the union of these two languages is denoted by:
L ∪ M = { s | s is in L or s is in M}
That means the string s from the union of two languages can either be
from language L or from language M.
If L = {a, b} and M = {c, d} Then L ∪ M = {a, b, c, d}
2. Concatenation
Concatenation links the string from one language to the string of another
language in a series in all possible ways. The concatenation of two
different languages is denoted by:
3. Kleene Closure
Kleene closure of a language L provides you with a set of strings. This set
of strings is obtained by concatenating L zero or more time. The Kleene
closure of the language L is denoted by:
4. Positive Closure
The positive closure on a language L provides a set of strings. This set of
strings is obtained by concatenating ‘L’ one or more times. It is denoted
by:
It is similar to the Kleene closure. Except for the term L0, i.e. L+ excludes
∈ until it is in L itself.
So, these are the four operations that can be performed on the languages in
the lexical analysis phase.
3. Regular Expression
A regular expression is a sequence of symbols used to specify lexeme
patterns. A regular expression is helpful in describing the languages that
can be built using operators such as union, concatenation, and closure over
the symbols.
The following rules define the regular expression over some alphabet Σ
and the languages denoted by these regular expressions.
1. ∈ is a regular expression that denotes a language L(∈). The language L(∈)
has a set of strings {∈} which means that this language has a single empty
string.
2. If there is a symbol ‘a’ in Σ then ‘a’ is a regular expression that denotes a
language L(a). The language L(a) = {a} i.e. the language has only one string
of length one and the string holds ‘a’ in the first position.
3. Consider the two regular expressions r and s then:
d1 → r1
d2 → r2
………
dn → rn
Each di is a distinct name.
Each ri is a regular expression over the alphabet Σ U {d1, d2,. . . , di-1}.
letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
Recognition of Tokens
1. Transition Table
2. Transition Diagram
1. Transition Table
It is a tabular representation that lists all possible transitions for each state and input symbol
combination.
EXAMPLE
Assume the following grammar fragment to generate a specific language
where the terminals if, then, else, relop, id and num generates sets of strings given by
following regular definitions.
where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-9])
For this language, the lexical analyzer will recognize the keywords if, then, and else, as
well as lexemes that match the patterns for relop, id, and number.
To simplify matters, we make the common assumption that keywords are also reserved
words: that is they cannot be used as identifiers.
The num represents the unsigned integer and real numbers of Pascal.
In addition, we assume lexemes are separated by white space, consisting of nonnull
sequences of blanks, tabs, and newlines.
Our lexical analyzer will strip out white space. It will do so by comparing a string against
the regular definition ws, below.
If a match for ws is found, the lexical analyzer does not return a token to the parser.
It is the following token that gets returned to the parser.
2. Transition Diagram
It is a directed labeled graph consisting of nodes and edges. Nodes represent states, while edges
represent state transitions.
Components of Transition Diagram
1. One state is labelled the Start State. It is the initial state of transition diagram where
6.
Here is the Finite Automata Transition Diagram for the Identifiers and Keywords.
Here is the Finite Automata Transition Diagram for recognizing white spaces.
Note:
These Finite Automata can be constructed using either the transition diagram or the transition
table representation. Both transition diagrams and transition tables serve the same purpose of
defining and representing the behavior of an FA. They provide different visual and structural
representations, allowing designers to choose the format that best suits their preferences or
requirements.
Lex in compiler design is a program used to generate scanners or lexical analyzers, also called
tokenizers. These tokenizers identify the lexical pattern in the input program and convert the input text
into the sequence of tokens. It is used with the YACC parser generator.
In this article, we will learn what Lex in compiler design is and its role in the compilation process. We
will also be learning about lexical analysis and terminologies related to it.
What is Lexical Analysis?
Compilation happens in multiple phases, and lexical analysis is the starting Phases of Compiler. It
gathers preprocessed source code, written in sentences, that comes as the preprocessor's output.
The lexical analyzer generates a stream of tokens from the preprocessed source code by removing
whitespace and comments. It generates an error if it gets any invalid token. The stream of character
is read by it, and it seeks the legal tokens, and then the data is passed to the syntax analyzer when
asked for it.
Uses of lexical analyzer
The following tasks are performed by a lexical analyzer-
The lexical analyzer removes the white spaces and comments from the source program.
It corresponds to the error messages with the source program.
Helps in the identification of the tokens.
A lexical analyzer reads the input characters from the source code.
Advantages of Lexical Analysis
Lexical analysis has the following advantages-
The lexical analysis allows the browsers to format and display a web page with the help of parsed data.
It is responsible for generating a compiled binary executable code.
The lexical analyzer generates a more efficient and specialized processor for the task.
Disadvantages of Lexical Analysis
Lexical analysis has the following disadvantages-
Additional runtime is required to create the symbol table and construct the token, generating overhead.
Debugging and developing the lexer and token description requires much effort.
A lot of time is required to read the source code and break it into tokens as the analyzer reads the
characters individually.
What is LEX in Compiler Design?
Lex is a tool or program that creates a lexical analyzer and helps us perform the task of lexical
analysis (It converts characters stream into tokens). The Lex tool is a compiler itself. The Lex
compiler transforms the input into input patterns.
Eric Schmidt and Mike Lesk initially developed the code for Lex, which was intended for Unix-based
systems.
Function of Lex
1. Lexical Analyzer Creation: The process begins by creating a program called lex.1 using Lex's language.
This program defines the rules and patterns for recognizing tokens in the source code
2. Lex Compiler Execution: The lex.1 program is then executed using the Lex compiler. This step
generates a C program named lex.yy.c
3. C Compiler Execution: The C compiler is then used to compile the generated lex.yy.c program. The
result is an object program referred to as a.out
4. Lexical Analysis: The a.out object program is essentially a lexical analyzer. When this program is run, it
takes an input stream of source code and transforms it into a sequence of tokens based on the rules
defined in the original lex.1 program
LEX Source Program
A LEX source program is a collection of instructions and patterns. These instructions and patterns are
written in the LEX programming language. LEX is a tool used for generating lexical analyzers. It
tokenizes input source code into meaningful units called tokens.
The LEX source program defines how these tokens are recognized and processed. It consists of
regular expressions that describe the patterns of tokens and corresponding actions to be taken when
those patterns are encountered in the source code. This program serves as a set of rules that
instructs LEX on how to break down a stream of characters from the input source code into tokens.
These tokens can represent keywords, identifiers, operators, and other language constructs. There
are two main components of this Lex source program:
1. Auxilary Definitions: These are often located in the "Definitions Section" of the LEX source program.
These are user-defined macros or variables that simplify the expression of regular expressions and
actions in the program
2. Translation Rules: These are commonly found in the "Rules Section" of the LEX source program. They
establish the mapping between patterns and actions. Each translation rule consists of a regular
expression pattern followed by the action to be executed when that pattern is matched in the input source
code
Working of Lex
The working of lex in compiler design as a lexical analysis takes place in multiple steps. Firstly we
create a file that describes the generation of the lex analyzer. This file is written in Lex language and
has a .l extension. The lex compiler converts this program into a C file called lex.yy.c. The C
compiler then runs this C file, and it is compiled into a.out file. This a.out file is our working Lexical
Analyzer which will produce the stream of tokens based on the input text.
Lookahead Operator
An additional character is read by Lex to differentiate between other patterns of a token. This is done
by Lex by reading an extra character ahead of the valid lexeme.
However, sometimes, we want a particular pattern to be matched to the input only when certain other
characters follow it. Then, we may use a slash in a pattern to indicate the end of the part of the
pattern that matches the lexeme.
For example:
The Lookahead operator is the addition operator read by Lex to distinguish different patterns for a
token. A lexical analyzer reads one character ahead of a valid lexeme and then retracts to produce a
token.
In some languages, keywords are not reserved. So the statements
IF (A, B) = 10 and IF(condition) THEN
Results in conflict about whether to produce IF as an array name or a keyword. To resolve this, the
lex rule for the Keyword IF can be written as,
IF/\ (.* \) { letter }
An example implementation of a Lexical Analyser using Lex
The following Lex Program counts the number of words-
Code
%{
#include<stdio.h>
#include<string.h>
int count = 0;
%}
/* Rules Section*/
%%
([a-zA-Z0-9])* {count++;} /* Rule for counting number of words*/
int yywrap(void){}
int main()
{
// The function that starts the analysis
yylex();
return 0;
}
How to execute
Type lex lexfile.l
Type ./a.exe
Output
Explanation
In the definition section of the program, we have included the standard library for input-output
operations and string operations and a variable count to keep count of the number of words.
In the rules section, we have specified the regular expression ([a-zA-Z0-9])*, which matches any
string formed by alphabets and numeric digits, such as “AYUSHI28”, “BOND007”, etc.
There is a rule for a newline too. When a newline character is encountered, the current count of
words is printed, and the counter is set to zero.
Frequently Asked Questions
How to make a compiler with lex?
Creating a compiler with Lex involves a few key steps. First, we need to define regular expressions
that describe the patterns of tokens in the source code. Then, we need to write a Lex specification file
where you associate these regular expressions with corresponding actions.
What is lex in compiler design with example?
Lex is a lexical analyzer generator tool used in compiler design to generate lexical analyzers. It takes
regular expressions as input and produces code for scanning input streams. For example, in lex,
defining rules to recognize keywords in a programming language.
What is the function of lex?
Lex generates lexical analyzers, which are programs that break input into tokens or lexemes based
on specified patterns. These lexemes serve as input for the parser. The function of lex is to automate
the generation of lexical analyzers from regular expressions.
What is the purpose of the lex tool?
The purpose of the lex tool is to simplify the process of creating lexical analyzers for compilers and
interpreters. It achieves this by allowing developers to specify token recognition patterns using regular
expressions, which lex then translates into efficient, executable code for tokenizing input streams.
What are the three basic sections of the lex program?
The three basic sections of the lex program are:
1. Definition Section: Defines macros and regular expression patterns.
2. Rules Section: Specifies patterns and corresponding actions.
3. C Code Section: Contains user-defined C code to be executed alongside lex-generated code.
What is Lex?
Compiler DesignProgramming LanguagesComputer Programming
Auxiliary Definition
It denotes the regular expression of the form.
Distinct Names [D1=R1\D2=R2\Dn=Rn][D1=R1\D2=R2\Dn=Rn] Regular Expressions
Where
Distinct Names (Di)→ Shortcut name of Regular Expression
Regular Expression (Ri)→ Notation to represent a collection of input symbols.
Example
Auxiliary Definition for Identifiers −
We can see that if Lexical Analyzer is given the input "begin", it will recognize the token
"begin" and Lexical Analyzer will return 1 as integer code to the parser.
Translation Rules for "Identifiers"
letter (letter + digit)* {Install ( );return 6}
If Lexical Analyzer is given the token which is an "identifier", then the Action taken by the
Lexical Analyzer is to install or store the name in the symbol table & return value 6 as
integer code to the parser.
In compiler design, Lex (short for Lexical Analyzer) tools are used to perform lexical analysis, which is the first phase of
a compiler. This phase involves reading the source code and breaking it into tokens, such as keywords, identifiers, literals,
and symbols.
🔧 What is Lex?
Lex is a lexical analyzer generator developed in the 1970s. You provide Lex with rules using regular expressions for
recognizing tokens, and it generates C code for a lexical analyzer.
%{
/* Definitions and header code (e.g., includes) */
%}
%%
/* Rules section */
[0-9]+ { printf("NUMBER\n"); }
[a-zA-Z]+ { printf("IDENTIFIER\n"); }
"if" { printf("IF KEYWORD\n"); }
%%
/* User code (main function etc.) */
int main() {
yylex();
return 0;
}
Lex is often used with Yacc (Yet Another Compiler Compiler), which performs syntax analysis (parsing). Lex handles
tokenization, and Yacc builds a parse tree or abstract syntax tree.
⚙️Alternatives to Lex
✅ Summary
Tool Purpose
Lex Generates lexical analyzers
Flex Faster, improved Lex
Yacc Parser generator
ANTLR Lexer + parser for Java etc.
Advantage Details
🔧 Automation of Lexer Saves time by automatically generating code for token recognition based on
Generation regex patterns.
⚙️Integration with Works seamlessly with parser generators like Yacc, Bison, making compiler
Yacc/Bison development smoother.
Lex-generated scanners are fast and optimized for performance in handling
🚀 Efficiency
large inputs.
🧩 Regular Expression
Makes it easy to define complex token patterns using standard regex syntax.
Support
Keeps lexical analysis separate from parsing and other compiler stages—
📦 Modularity
promotes clean design.
Advantage Details
🆓 Open Source Versions (like Tools like Flex are free, widely supported, and often faster than the original
Flex) Lex.
Disadvantage Details
Lex outputs C code, so it ties your project to C/C++ unless additional
📚 C Language Dependency
wrappers are used.
Beginners may struggle with Lex’s syntax and concepts like regular
🧠 Steep Learning Curve
expressions and states.
Lex is not ideal for handling complex lexical errors or ambiguous
🔄 Limited Error Handling
inputs on its own.
💬 Not Very Flexible for Context- Difficult to handle languages where token meaning depends heavily
sensitive Lexing on context.
Compared to modern tools like ANTLR, Lex feels outdated and
🧱 Old and Low-level
lower-level.
Generated C code can be hard to debug, especially when combined
🧪 Debugging Can Be Tricky
with Yacc/Bison.
📝 Summary
Lex is a lexical analyzer generator that takes a set of token patterns (written as regular expressions) and
produces C code that can recognize those tokens in source code.
Lex generates a function called yylex(), which reads the input stream, matches it against the defined patterns, and
returns tokens to the parser.
shell
CopyEdit
%{
C declarations (headers, includes)
%}
%%
Rules section (regex patterns and actions)
%%
User code (main function, helper functions)
Lex uses regular expressions to define token patterns. When input text matches a pattern, Lex executes the
corresponding action (usually C code).
No. Lex is based on finite automata, so it works best with regular languages. For context-sensitive analysis, you usually
pass the token to a parser like Yacc.
Lex generates a C file, usually named lex.yy.c, which contains the yylex() function and supporting code.
Lex handles lexical analysis (tokenizing the input), and Yacc/Bison handles syntax analysis (parsing the token stream).
They work together in compiler front-ends.
The action part is written in C (or C++), which is inserted directly into the generated yylex() function.
Interpreters
Code formatters
Log file analyzers
Text processors
Scripting language front-ends
Only handles regular languages (not suitable for parsing nested structures).
Not good at handling complex, context-sensitive input.
Tied to C/C++ unless wrappers are used.
What is the main purpose of Lex in compiler design?
A) Syntax analysis
B) Semantic analysis
C) Lexical analysis
D) Code optimization
✅ Answer: C) Lexical analysis
3. Lex is most commonly used with which other tool for parsing?
A) Lint
B) GCC
C) ANTLR
D) Yacc
✅ Answer: D) Yacc
4. Which of the following does Lex use to define token patterns?
A) Context-free grammar
B) Backus-Naur Form (BNF)
C) Regular expressions
D) Finite state machines
✅ Answer: C) Regular expressions
LEX is a lexical analyzer generator tool used in compiler design to automatically generate a lexical analyzer (lexer or
scanner) from a set of regular expressions and actions. It's often used in conjunction with YACC (Yet Another Compiler
Compiler) for parsing.
LEX plays a crucial role in the front end of a compiler. Here are its key real-time applications:
LEX is used to break down source code into tokens like keywords, identifiers, operators, literals, etc.
Example:
c
CopyEdit
int x = 10;
LEX identifies:
int → Keyword
x → Identifier
= → Assignment operator
10 → Integer literal
; → Delimiter
LEX-based tools are used in IDEs and editors (like VS Code, Eclipse, etc.) to tokenize code for syntax coloring.
3. Code Preprocessing
LEX can process macros, remove comments, handle conditional compilation (#ifdef, #endif) in C/C++
preprocessors.
4. Source Code Translation
In source-to-source compilers or transpilers, LEX helps read and reformat code for another language, e.g.,
translating Java to C++.
Beyond compilers: LEX is used in creating tools to parse log files, config files, or structured text (like JSON, CSV).
Languages like Python or Shell script interpreters use lexer generators like LEX to scan input code and feed it to
the parser.
When building DSLs (e.g., a language for describing data flows or AI models), LEX is used to implement the lexer
quickly.
LEX can detect lexical errors (invalid tokens or characters) and report them with line and column numbers—vital
for debugging.
Tools that analyze or manipulate code (e.g., static analyzers, linters) often rely on a lexer built with LEX to read
source code.
Let me know if you’d like a visual diagram of how LEX fits into the compiler pipeline or a mini project using LEX
and YACC!
Answer:
LEX is a lexical analyzer generator used to create programs (lexers or scanners) that recognize tokens in source code
using regular expressions.
Answer:
LEX generates a C program (lex.yy.c) that performs lexical analysis — it reads input text and identifies tokens defined
in the LEX specification file.
Answer:
LEX takes a .l file (LEX source code), processes it to create lex.yy.c, which can then be compiled with gcc or cc to
produce an executable that performs scanning.
Answer:
yytext is a built-in LEX variable that holds the current matched text — the token that the lexer has just recognized.
Answer:
yylex() is the main function generated by LEX. It reads input and returns the next token to the parser or calling
function.
lex
CopyEdit
[0-9]+ { printf("NUMBER\n"); }
lex
CopyEdit
%%
Definitions (optional)
%%
Rules (pattern-action pairs)
%%
User-defined C code (optional)
Answer:
Yes! LEX and YACC are often used together:
9. What is lex.yy.c?
Answer:
It’s the C source file automatically generated by LEX after processing your .l file. This file contains the yylex()
function and all the logic to tokenize input.
Answer:
Traditional LEX generates C code. However, there are modern equivalents for other languages:
Answer:
Answer:
Any language with well-defined token patterns — C, Java, Python, SQL, even custom DSLs.
Answer:
Answer:
Yes. You can write rules in LEX to match and either ignore or handle whitespace, comments, and other non-token
content.
Answer:
Yes, especially in academic learning, compiler projects, interpreters, and text processing tools. While more modern
tools exist, LEX concepts remain foundational.
We have seen that a lexical analyzer can identify tokens with the help of regular expressions
and pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to
the limitations of the regular expressions. Regular expressions cannot check balancing
tokens, such as parenthesis. Therefore, this phase uses context-free grammar (CFG), which
is recognized by push-down automata.
It implies that every Regular Grammar is also context-free, but there exists some problems,
which are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the
syntax of programming languages.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce
terminologies used in parsing technology.
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of
strings. The non-terminals define sets of strings that help define the language generated by
the grammar.
A set of tokens, known as terminal symbols (). Terminals are the basic symbols from which
strings are formed.
A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of
a non-terminal called the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.
One of the non-terminals is designated as the start symbol (S); from where the production
begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal
(initially the start symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of
Regular Expression. That is, L = { w | w = w R } is not a regular language. But it can be
described by means of CFG, as illustrated below:
G = ( V, , P, S )
Where:
V = { Q, Z, N }
= { 0, 1 }
P = { Q Z | Q N | Q | Z 0Q0 | N 1Q1 }
S = { Q }
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101,
11111, etc.
Advertisement
-
PauseSkip backward 5 secondsSkip forward 5 seconds
Mute
Fullscreen
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token
streams. The parser analyzes the source code (token stream) against the production rules to
detect any errors in the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program.
Parsers use error recovering strategies, which we will learn later in this chapter.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string.
During parsing, we take two decisions for some sentential form of input:
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-
most derivation. The sentential form derived by the left-most derivation is called the left-
sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-
most derivation. The sentential form derived from the right-most derivation is called the
right-sentential form.
Example
Production rules:
E E+E
E E*E
E id
Input string: id + id * id
E E*E
E E+E*E
E id + E * E
E id + id * E
E id + id * id
E E+E
E E+E*E
E E + E * id
E E + id * id
E id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are
derived from the start symbol. The start symbol of the derivation becomes the root of the
parse tree. Let us see this by an example from the last topic.
E E*E
E E+E*E
E id + E * E
E id + id * E
E id + id * id
Step 1:
EE*E
Step 2:
EE+E*E
Step 3:
E id + E * E
Step 4:
E id + id * E
Step 5:
E id + id * id
In a parse tree:
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is
traversed first, therefore the operator in that sub-tree gets precedence over the operator
which is in the parent nodes.
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right
derivation) for at least one string.
Example
E E+E
E E E
E id
For the string id + id id, the above grammar generates two parse trees:
The language generated by an ambiguous grammar is said to be inherently ambiguous.
Ambiguity in grammar is not good for a compiler construction. No method can detect and
remove ambiguity automatically, but it can be removed by either re-writing the whole
grammar without ambiguity, or by setting and following associativity and precedence
constraints.
Associativity
If an operand has operators on both sides, the side on which the operator takes this operand
is decided by the associativity of those operators. If the operation is left-associative, then the
operand will be taken by the left operator or if the operation is right-associative, the right
operator will take the operand.
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If
the expression contains:
id op id op id
(id op id) op id
Operations like Exponentiation are right associative, i.e., the order of evaluation in the same
expression will be:
id op (id op id)
2 + (3 * 4)
Left Recursion
A grammar becomes left-recursive if it has any non-terminal A whose derivation contains A
itself as the left-most symbol. Left-recursive grammar is considered to be a problematic
situation for top-down parsers. Top-down parsers start parsing from the Start symbol, which
in itself is non-terminal. So, when the parser encounters the same non-terminal in its
derivation, it becomes hard for it to judge when to stop parsing the left non-terminal and it
goes into an infinite loop.
Example:
(1) A => A |
(2) S => A |
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and
represents a string of non-terminals.
The production
A => A |
A => A'
A'=> A' | ε
This does not impact the strings derived from the grammar, but it removes immediate left
recursion.
Second method is to use the following algorithm, which should eliminate all direct and
indirect left recursions.
START
END
Example
S => A |
A => Ad | d
and then, remove immediate left recursion using the first technique.
A => dA'
A' => dA' | ε
Now none of the production has either direct or indirect left recursion.
Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down
parser cannot make a choice as to which of the production it should take to parse the string
in hand.
Example
A | |
Then it cannot determine which production to follow to parse the string as both productions
are starting from the same terminal (or non-terminal). To remove this confusion, we use a
technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this
technique, we make one production for each common prefixes and the rest of the derivation
is added by new productions.
Example
A => A'
A'=> | |
Now the parser has only one production per prefix which makes it easier to take decisions.
First and Follow Sets
An important part of parser table construction is to create first and follow sets. These sets
can provide the actual position of any terminal in the derivation. This is done to create the
parsing table where the decision of replacing T[A, t] = with some production rule.
First Set
This set is created to know what terminal symbol is derived in the first position by a non-
terminal. For example,
Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal in
production rules. We do not consider what the non-terminal can generate but instead, we
see what would be the next terminal symbol that follows the productions of a non-terminal.
These tasks are accomplished by the semantic analyzer, which we shall study in Semantic
Analysis.
Lexeme: A lexeme is the lowest level syntactic unit of a language (e.g., total,
start).
A parse also checks that the input string is well-formed, and if not, reject it.
Following are important tasks perform by the parser in compiler design:
Top-Down Parsing,
Bottom-Up Parsing
Top-Down Parsing
In the top-down parsing construction of the parse tree starts at the root and then
proceeds towards the leaves.
1. Predictive Parsing:
Predictive parse can predict which production should be used to replace the
specific input string. The predictive parser uses look-ahead point, which points
towards next input symbols. Backtracking is not an issue with this parsing
technique. It is known as LL(1) Parser
This parsing technique recursively parses the input to make a prase tree. It
consists of several small functions, one for each nonterminal in the grammar.
Bottom-Up Parsing
In the bottom up parsing in compiler design, the construction of the parse tree
starts with the leave, and then it processes towards its root. It is also called as
shift-reduce parsing. This type of parsing in compiler design is created with the
help of using some software tools.
A parser should able to detect and report any error found in the program. So,
whenever an error occurred the parser. It should be able to handle it and carry on
parsing the remaining input. A program can have following types of errors at
various compilation process stages. There are five common error-recovery
methods which can be implemented in the parser
Panic-Mode recovery
In the case when the parser encounters an error, this mode ignores the rest of
the statement and not process input from erroneous input to delimiter, like a
semi-colon. This is a simple error recovery method.
In this type of recovery method, the parser rejects input symbols one by one until
a single designated group of synchronizing tokens is found. The synchronizing
tokens generally using delimiters like or.
Phrase-Level Recovery
Compiler corrects the program by inserting or deleting tokens. This allows it to
proceed to parse from where it was. It performs correction on the remaining
input. It can replace a prefix of the remaining input with some string this helps the
parser to continue the process.
Error Productions
Error production recovery expands the grammar for the language which
generates the erroneous constructs. The parser then performs error diagnostic
about that construct.
Global Correction
The compiler should make less number of changes as possible while processing
an incorrect input string. Given incorrect input string a and grammar c, algorithms
will search for a parse tree for a related string b. Like some insertions, deletions,
and modification made of tokens needed to transform an into b is as little as
possible.
Grammar
A grammar is a set of structural rules which describe a language. Grammars
assign structure to any sentence. This term also refers to the study of these
rules, and this file includes morphology, phonology, and syntax. It is capable of
describing many, of the syntax of programming languages.
The non-terminal symbol should appear to the left of the at least one production
The goal symbol should never be displayed to the right of the::= of any
production
A rule is recursive if LHS appears in its RHS
Notational Conventions
Notational conventions symbol may be indicated by enclosing the element in
square brackets. It is an arbitrary sequence of instances of the element which
can be indicated by enclosing the element in braces followed by an asterisk
symbol, { … }*.
It is a choice of the alternative which may use the symbol within the single rule. It
may be enclosed by parenthesis ([,] ) when needed.
2.Nonterminals:
expression-> term
factor -> id
Grammar Derivation
Grammar derivation is a sequence of grammar rule which transforms the start
symbol into the string. A derivation proves that the string belongs to the
grammar’s language.
Left-most Derivation
When the sentential form of input is scanned and replaced in left to right
sequence, it is known as left-most derivation. The sentential form which is
derived by the left-most derivation is called the left-sentential form.
Right-most Derivation
Rightmost derivation scan and replace the input with production rules, from right
to left, sequence. It’s known as right-most derivation. The sentential form which is
derived from the rightmost derivation is known as right-sentential form.
Summary
Syntax analysis is a second phase of the compiler design process that comes
after lexical analysis
The syntactical analyser helps you to apply rules to the code
Sentence, Lexeme, Token, Keywords and reserved words, Noise words,
Comments, Delimiters, Character set, Identifiers are some important terms used
in the Syntax Analysis in Compiler construction
Parse checks that the input string is well-formed, and if not, reject it
Parsing techniques are divided into two different groups: Top-Down Parsing,
Bottom-Up Parsing
Lexical, Syntactical, Semantical, and logical are some common errors occurs
during parsing method
A grammar is a set of structural rules which describe a language
Notational conventions symbol may be indicated by enclosing the element in
square brackets
A CFG is a left-recursive grammar that has at least one production of the type
Grammar derivation is a sequence of grammar rule which transforms the start
symbol into the string
The syntax analyser mainly deals with recursive constructs of the language while
the lexical analyser eases the task of the syntax analyser in DBMS
The drawback of Syntax analyser method is that it will never determine if a token
is valid or not
Compiler design has many functional modules one of them is the parser which takes the
output of the lexical analyzer (often a set of tokens list) and builds a parse tree. The main
responsibility of the parser is to confirm whether the generated language can produce
the input string and helps in the analysis of syntax.
What is a Parser?
The parser is one of the phases of the compiler which takes a token of string as input
and converts it into the corresponding Intermediate Representation (IR) with the help
of an existing grammar. The parser is also known as Syntax Analyzer.
Classification of Parser
Types of Parsers
The parser is mainly classified into two categories.
1. Top-down Parser
2. Bottom-up Parser
Top-Down Parser
Top-down parser is the parser that generates parse tree for the given input string with
the help of grammar productions by expanding the non-terminals. It starts from the start
symbol and ends down on the terminals. It uses left most derivation.
Further Top-down parser is classified into 2 types: 1) Recursive descent parser and 2)
non-recursive descent parser.
1. Recursive descent parser is also known as the Brute force parser or the backtracking
parser. It basically generates the parse tree by using brute force and backtracking
techniques.
2. Non-recursive descent parser is also known as LL(1) parser or predictive parser or
without backtracking parser or dynamic parser. It uses a parsing table to generate the
parse tree instead of backtracking.
Bottom-Up Parser
Bottom-up Parser is the parser that generates the parse tree for the given input string
with the help of grammar productions by compressing the terminals. It starts from
terminals and ends upon the start symbol. It uses the rightmost derivation in reverse
order.
Bottom-up parser is classified into two types:
LR parser: This is a bottom-up parser that generates the parse tree for the given string
by using unambiguous grammar. It follows the reverse of the rightmost derivation. LR
parser is classified into four types:
LR(0)
SLR(1)
LALR(1)
CLR(1)
Operator precedence parser: Operator Precedence Parser generates the parse tree from
given grammar and string but it has the condition that two consecutive non-terminals and
epsilon will never appears on the right-hand side of any production. The operator
precedence parsing technique is applied to Operator grammars.
What is Operator Grammar?
An operator precedence grammar is a context-free grammar that has the property that
no production has either:
An empty on the right-hand side
Two adjacent non-terminals in its right-hand side.
Conclusion
Parser is one of the important phases in compiler design that takes a token of string as
input and converts into the intermediate representation. If the programming code does
not follow the rules, then the compiler will generate an error message and the
compilation process will be stopped. Parser is classified into two types namely Top-down
Parser and Bottom-up Parser. Based on the requirements and the goals of the
programming language the appropriate parsing technique is used.
Syntax analyzers follow production rules defined by means of context-free grammar. The
way the production rules are implemented (derivation) divides parsing into two types : top-
down parsing and bottom-up parsing.
Top-down Parsing
When the parser starts constructing the parse tree from the start symbol and then tries to
transform the start symbol to the input, it is called top-down parsing.
Recursive descent parsing : It is a common form of top-down parsing. It is called
recursive as it uses recursive procedures to process the input. Recursive descent parsing
suffers from backtracking.
Backtracking : It means, if one derivation of a production fails, the syntax analyzer restarts
the process using different rules of same production. This technique may process the input
string more than once to determine the right production.
Bottom-up Parsing
As the name suggests, bottom-up parsing starts with the input symbols and tries to
construct the parse tree up to the start symbol.
Example
Input string : a + b * c
Production rules:
S E
E E+T
E E*T
E T
T id
a+b*c
Read the input and check if any production matches with the input:
a+b*c
T+b*c
E+b*c
E+T*c
E*c
E*T
E
S
What is Parsing in Compiler Design?
The process of transforming the data from one format to another is called Parsing. This process can
be accomplished by the parser. The parser is a component of the translator that helps to organise
linear text structure following the set of defined rules which is known as grammar.
Types of Parsing:
Top-down Parsing: When the parser generates a parse with top-down expansion to the first trace, the left-
most derivation of input is called top-down parsing. The top-down parsing initiates with the start symbol and
ends on the terminals. Such parsing is also known as predictive parsing.
o
Bottom-up Parsing: The bottom-up parsing works just the reverse of the top-down parsing. It first traces
the rightmost derivation of the input until it reaches the start symbol.
o
Shift-Reduce Parsing: Shift-reduce parsing works on two steps: Shift step and Reduce step.
Shift step: The shift step indicates the increment of the input pointer to the next input
symbol that is shifted.
Reduce Step: When the parser has a complete grammar rule on the right-hand side
and replaces it with RHS.
LR Parsing: LR parser is one of the most efficient syntax analysis techniques as it works
with context-free grammar. In LR parsing L stands for the left to right tracing, and R stands
for the right to left tracing.
(A) Leftmost
(B) Leftmost in reverse
(C) Rightmost
(D) Rightmost in reverse
(A) I only
(B) II only
(C) III only
(D) II and III only
What is LR Parsing?
LR parser is one of the most efficient syntax analysis techniques as it works with context-free
grammar. In LR parsing L stands for the left to right tracing, and R stands for the right to left tracing.
Q2
Parser is a compiler that is used to break the data into smaller elements coming from lexical analysis
phase.
Q3
In compiler design, parsers are responsible for analyzing the syntax of the source code to ensure it follows the grammar
of the programming language. There are two main types of parsers based on the direction of parsing:
1. Top-Down Parsers
These start parsing from the start symbol and try to derive the input string by applying production rules.
Common types:
2. Bottom-Up Parsers
These start from the input string and try to reduce it to the start symbol using production rules.
Common types:
Shift-Reduce Parser
o Shifts input symbols onto a stack and reduces them to grammar rules.
Operator Precedence Parser
o Handles expressions with operators (like +, *, etc.) based on precedence and associativity.
LR Parser (Left-to-right, Rightmost derivation in reverse)
o Most powerful type of bottom-up parser.
o Several variants:
SLR (Simple LR)
LALR (Look-Ahead LR)
Canonical LR
Comparison at a Glance:
Top-down parsing is a strategy used in syntax analysis (parsing) of programming languages and formal
grammars. It starts from the start symbol and works down the grammar rules, attempting to produce the input
string. It's like trying to "guess" the structure of the input from the top of the grammar.
E → T + E | T
T → int
1. Start at E
2. Try rule E → T + E
3. T → int, matches int
4. Match +
5. E → T, T → int, matches the second int
✔️Success!
What are you working on it for? Class, compiler project, interview prep?
You said:
Great question! Top-down parsing is intuitive and commonly taught first in compiler design courses. Here's a
breakdown of its pros and cons to give you a solid overview:
✅ Pros of Top-Down Parsing
Advantage Description
Especially recursive descent parsers — they map naturally to the grammar
Simple to Implement
rules using recursive functions.
Easy to understand and follow the logic, especially for smaller or simpler
Readable & Intuitive
grammars.
No Need for Stack (in Recursive Uses the program’s call stack, making it simpler than some bottom-up
Descent) approaches.
Produces leftmost derivations, which are helpful in certain theoretical and
Leftmost Derivation
practical contexts.
Since parsing is done left-to-right and top-down, it's easier to detect errors
Easier Error Detection
early.
Disadvantage Description
Left-recursive rules (e.g., E → E + T) cause infinite recursion. These must
Can't Handle Left Recursion
be rewritten.
Basic top-down parsers may try many rules before finding the correct one.
Backtracking Can Be Inefficient
This can be slow.
Not all context-free grammars are suitable — only those that are LL(k)
Limited Grammar Support
(often LL(1)).
Most top-down parsers use a single-token lookahead, which can’t
Lookahead Limits
distinguish between complex cases.
More Manual Work for Large Predictive parsers may require first/ follow set calculations and rewriting to
Grammars fit LL(1) format.
🔍 Summary
If you're curious, I can also contrast this with bottom-up parsing (like LR parsers), which handle a wider range
of grammars but are trickier to build. Want to compare those?
You said:
Absolutely! Here’s a list of frequently asked questions (FAQs) about Top-Down Parsing, especially useful for
students, exam prep, or interview revision:
Answer:
Top-down parsing is a parsing strategy that starts from the start symbol and tries to rewrite it into the input
string by applying grammar rules. It builds the parse tree from the top (start symbol) to the leaves
(terminals).
Answer:
Answer:
Left recursion is when a non-terminal calls itself on the left side of a production, like E → E + T.
Top-down parsers can go into infinite recursion if left recursion isn't eliminated.
Answer:
By rewriting the grammar.
Example:
vbnet
CopyEdit
Original: A → Aα | β
Fixed: A → βA'
A' → αA' | ε
Answer:
Lookahead is how many tokens the parser checks in advance to decide which rule to apply.
In LL(1) parsing, it only uses 1 token of lookahead.
Answer:
Answer:
They help build parsing tables for predictive parsers:
Answer:
No. Only LL(k) grammars (often just LL(1)) are suitable for top-down parsing.
Answer:
Answer:
Want a cheat sheet for FIRST & FOLLOW sets or a step-by-step parsing example? I’ve got those too!
You said:
Sure! Here’s a solid set of MCQs (Multiple Choice Questions) on Top-Down Parsing, perfect for practice,
quizzes, or interviews.
A) Leaf nodes
B) Bottom of the tree
C) Start symbol of the grammar
D) Input string
✅ Answer: D) LR parser
(LR parsers are bottom-up)
3. What kind of derivation does a top-down parser produce?
A) Rightmost derivation
B) Leftmost derivation
C) Bottom-up derivation
D) None
A) It is hard to implement
B) Cannot handle right recursion
C) Cannot handle left recursion
D) Requires full input in memory
✅ Answer: C) LL(1)
✅ Answer: C) Backtracking
(Unless it's predictive)
A) Stack
B) Queue
C) Tree
D) Hash Table
✅ Answer: A) Stack
(via function calls and call stack)
A) Right to left
B) Left to right
C) Bottom-up
D) Randomly
Want a printable quiz version? Or do you want harder, concept-twisting MCQs (like for GATE/NET)?
You said:
real time applications on top down parsing
ChatGPT said:
Great question! 🔍 While top-down parsing is a theoretical concept from compiler design, it does have
practical, real-world applications, especially in systems that analyze, interpret, or validate structured
input.
Application: Parsing source code (e.g., C, Java, Python) into a syntax tree.
How it uses top-down parsing:
o Recursive descent or predictive parsers process code by walking through grammar rules.
o Often used in small or educational compilers due to simplicity.
✅ Languages with LL grammars (e.g., Pascal, early Java subsets) can be parsed this way.
Application: Reading and validating config files (like .ini, .toml, or DSLs).
How:
o Small domain-specific languages are often parsed using recursive descent (top-down).
✅ Great fit for hand-written parsers in embedded systems or tools.
🧠 Summary Table:
Use of Top-
Application Area Down
Parsing
Parsing and
Compilers syntax
checking
Evaluating
Calculators arithmetic
expressions
Grammar-
based
NLP
sentence
parsing
SQL/Query Engines Validating
Use of Top-
Application Area Down
Parsing
structured
query
syntax
Processing
Config Parsers structured
config files
Teaching
language
Education parsing and
compiler
design
Real-time
syntax
Text Editors/IDEs parsing for
features like
linting
We have learnt in the last chapter that the top-down parsing technique parses the
input, and starts constructing a parse tree from the root node gradually moving
down to the leaf nodes. The types of top-down parsing are depicted below:
Recursive descent is a top-down parsing technique that constructs the parse tree
Use of Top-
Application Area Down
Parsing
from the top and the input is read from left to right. It uses procedures for every
terminal and non-terminal entity. This parsing technique recursively parses the
input to make a parse tree, which may or may not require back-tracking. But the
grammar associated with it (if not left factored) cannot avoid back-tracking. A form
of recursive-descent parsing that does not require any back-tracking is known
as predictive parsing.
Back-tracking
Top- down parsers start from the root node (start symbol) and match the input
string against the production rules to replace them (if matched). To understand
this, take the following example of CFG:
S rXd | rZd
X oa | ea
Z ai
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most
letter of the input, i.e. r. The very production of S (S rXd) matches with it. So the
top-down parser advances to the next input letter (i.e. e). The parser tries to
expand non-terminal X and checks its production from the left (X oa). It does not
match with the next input symbol. So the top-down parser backtracks to obtain the
next production rule of X, (X ea).
Now the parser matches all the input letters in an ordered manner. The string is
accepted.
Use of Top-
Application Area Down
Parsing
Predictive Parser
Predictive parser is a recursive descent parser, which has the capability to predict
which production is to be used to replace the input string. The predictive parser
does not suffer from backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which
points to the next input symbols. To make the parser back-tracking free, the
predictive parser puts some constraints on the grammar and accepts only a class
of grammar known as LL(k) grammar.
Predictive parsing uses a stack and a parsing table to parse the input and generate
a parse tree. Both the stack and the input contains an end symbol $ to denote that
Use of Top-
Application Area Down
Parsing
the stack is empty and the input is consumed. The parser refers to the parsing
table to take any decision on the input and stack element combination.
In recursive descent parsing, the parser may have more than one production to
choose from for a single instance of input, whereas in predictive parser, each step
has at most one production to choose. There might be instances where there is no
production matching the input string, making the parsing procedure to fail.
LL Parser
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to
right, the second L in LL(k) stands for left-most derivation and k itself represents
the number of look aheads. Generally k = 1, so LL(k) may also be written as LL(1).
Use of Top-
Application Area Down
Parsing
LL Parsing Algorithm
We may stick to deterministic LL(1) for parser explanation, as the size of table
grows exponentially with the value of k. Secondly, if a given grammar is not LL(1),
then usually, it is not LL(k), for any given k.
Input:
string
parsing table M for grammar G
Output:
If is in L(G) then left-most derivation of ,
error otherwise.
repeat
let X be the top stack symbol and a the symbol pointed by ip.
if X∈ Vt or $
if X = a
POP X and advance ip.
else
error()
Use of Top-
Application Area Down
Parsing
endif
else /* X is non-terminal */
if M[X,a] = X Y1, Y2,... Yk
POP X
PUSH Yk, Yk-1,... Y1 /* Y1 on top */
Output the production X Y1, Y2,... Yk
else
error()
endif
endif
until X = $ /* empty stack */
Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it
reaches the root node. Here, we start from a sentence and then apply production rules in
reverse manner in order to reach the start symbol. The image given below depicts the
bottom-up parsers available.
Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as
shift-step and reduce-step.
Shift step: The shift step refers to the advancement of the input pointer to the next input
symbol, which is called the shifted symbol. This symbol is pushed onto the stack. The shifted
symbol is treated as a single node of the parse tree.
Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it to
(LHS), it is known as reduce-step. This occurs when the top of the stack contains a handle.
To reduce, a POP function is performed on the stack which pops off the handle and replaces
it with LHS non-terminal symbol.
LR Parser
There are three widely used algorithms available for constructing an LR parser:
LR Parsing Algorithm
token = next_token()
repeat forever
s = top of stack
else
error()
LL vs. LR
LL LR
Uses the stack for designating what is still to Uses the stack for designating what is already
be expected. seen.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Continuously pops a nonterminal off the Tries to recognize a right hand side on the
stack, and pushes the corresponding right stack, pops it, and pushes the corresponding
hand side. nonterminal.
Reads the terminals when it pops one off the Reads the terminals while it pushes them on
stack. the stack.
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
Bottom-up parsing begins at the terminal symbols (leaf nodes) of a parse tree and ascends towards
the root node, combining smaller units into larger ones until the entire structure is formed. This
method is commonly used in syntax analysis to construct parse trees for input strings.
This blog covers the concept of bottom-up parsing, Shift Reduce Parsing, Operator Precedence
Parsing, and table-driven L-R Parsing.
What is Bottom-up parsing?
This parser compresses the non-terminals where it starts and moves backward to the starting symbol
of the parse tree using grammar productions to draw the input string's parse tree.
In this technique, parsing starts from the leaf node of the parse tree to the start symbol of the parse tree in
a bottom-up manner.
Bottom-up parsing attempts to construct a parse tree by reducing input string and using Right Most
Derivation.
Bottom-up parsing starts with the input symbol and constructs the parse tree up to the start symbol using
a production rule to reduce the string to get the starting symbol.
Example
1. E → T
2. T→T*F
3. T → id
4. F→T
5. F → id
Parse Tree representation of input string "id * id" is as follows:
$ a1-(a2+a3)$ Shift a1
$S -(a2+a3)$ Shift -
$S-(S+S )$ shift)
$S $ Accept
There are two main categories in shift-reduce parsing as follows:
1. Operator-Precedence Parsing
2. LR-Parser
Operator Precedence Parsing
Operator precedence grammar is a category in the shift-reduce method of parsing. It is applied to a
class of grammar operators. Operator precedence grammar must have the following two properties:
No RHS of any product has a∈.
Two non-terminals must not be adjacent.
Operator precedence can only be fixed between the terminals of the grammar. It generally ignores
the non-terminal.
a ⋗ b means that terminal "a" has higher precedence than terminal "b."
There are three operator precedence relations:
a ⋖ b means that terminal "a" has lower priority than terminal "b."
a ≐ b means that the terminal "a" and "b" both have the same precedence.
Precedence table
Following is a precedence table according to which grammar of the operator precedence parsing
works.
Parsing Action
The sequence in which parsing action is performed in operator precedence parsing is as:
Now we scan the input string from left to right until the character ⋗ is encountered.
At first, the $ symbol is added to both ends of the string.
Scanning is done towards leftover all the equal precedence until the first leftmost ⋖ is encountered.
Types of LR parser
LR algorithm
The LR algorithm requires input, output, stack, and parsing tables. In all types of LR parsing, input,
output, and stack are the same, but the parsing table is different. The input buffer indicates the end of
input data, and it has the string to be parsed. The symbol of "$ follows that string."A stack is used to
contain the grammar symbols' sequence with the symbol $ at the stack's Bottom.
A parsing table can be defined as an array of two-dimensions. It usually contains two parts: the action
part and the go-to part.
LR (1) Parsing
The various steps involved in the LR (1) Parsing are as follows:
At first, context-free grammar is written for the given input.
The ambiguity of the grammar is checked.
Augment production is added in the given grammar.
A canonical collection of LR (0) items are created.
A data flow diagram is drawn.
An LR (1) parsing table is created.
Augment Grammar
Augmented grammar will be generated if we add one more product in the given grammar G. It helps
the parser identify when to stop the parsing and announce the acceptance of the input.
LL vs. LR
LL (Left-to-Right, Leftmost Derivation) and LR (Left-to-Right, Rightmost Derivation) are two types of
parsing techniques used in compiler construction and formal language theory. They belong to the
broader category of shift-reduce parsing algorithms. Here's a comparison between LL and LR
parsing:
Feature LL Parsing LR Parsing
Type Top-down parsing Bottom-up parsing
Derivation Leftmost Rightmost
Strategy Predictive Shift-reduce
Complexity Simpler, easier to understand and implement More complex, requires parsing tables
Power Limited power, handles simpler grammars More powerful, handles wider range of grammars
Lookahead Often requires more lookahead symbols Often requires fewer lookahead symbols
Error Handling Less efficient at error recovery More efficient at error recovery
Example Recursive descent parsing LR(0), SLR(1), LALR(1), CLR(1) parsing
Frequently Asked Questions
How does bottom-up parsing work?
Bottom-up parsers work by "shifting" symbols onto a stack till the stack top contains a right-hand side
of a production, and then the stack is reduced by replacing the right-hand side of production with the
left-hand side.
Write a drawback of the LR parser?
A major drawback of an LR parser is that it is too much work to construct an LR parser by hand.
What are the algorithms for constructing an LR parser?
There are mainly three such algorithms:
SLR(1) - Simple LR Parser
LR(1) - LR Parser
LALR(1) - Look-Ahead LR Parser
What is the reduce step in shift-reduce parsing?
When the parser finds a complete grammar rule in RHS and replaces it with LHS, it is a reduced step.
What are the features of a simple LR Parser?
It works on the smallest class of grammar, and they have simple and fast calculations.
Conclusion
In this article, we have discussed the concepts of bottom-up parsing, shift-reduce Parsing, operator
precedence parsing, and table-driven LR parsing and frequently asked questions.
Recommended Reading:
Bottom-Up parsing is applied in the syntax analysis phase of the compiler. Bottom-up
parsing parses the stream of tokens from the lexical analyzer. And after parsing the input
string it generates a parse tree.
The bottom-up parser builds a parse tree from the leaf nodes and proceeds towards the
root node of the tree. In this section, we will be discussing bottom-up parsing along with its
types.
What is Bottom-Up Parsing?
Bottom-up parsing pareses the input string from the lexical analyzer. And if there is no
error in the input string it constructs a parse tree as output. The bottom-up parsing starts
building parse trees from the leaf nodes i.e., from the bottom of the tree.
And gradually proceeds upwards to the root of the parse tree.
The bottom-up parsers are created for the largest class of LR grammars. As the bottom-
up parsing corresponds to the process of reducing the string to the starting symbol of the
grammar.
Step of Reducing a String:
The reduction process is just the reverse of derivation that we have seen in top-down
parsing. Thus, the bottom-up parsing derives the input string reverse.
Shift Reduce Parsing
Shift reduce parsing is the most general form of bottom-up parsing. Here we have an input
buffer that holds the input string that is scanned by the parser from left to right. There is
also a stack that is used to hold the grammar symbols during parsing.
The bottom of the stack and the right end of the input buffer is marked with the $. Initially,
before the parsing starts:
The input buffer holds the input string provided by the lexical analyzer.
The stack is empty.
As the parser parses the string from left to right then it shifts zero or more input symbols
onto the stack.
The parser continues to shift the input symbol onto the stack until it is filled with a
substring. A substring that matches the production body of a nonterminal in the grammar.
Then the substring is replaced or reduced by the appropriate nonterminal.
The parser continues shift-reducing until either of the following condition occurs:
It identifies an error
The stack contains the start symbol of the grammar and the input buffer becomes empty.
To perform the shift-reduce parsing we have to perform the following four functions.
Four Functions of Shift Reduce Parsing:
1. Shift: This action shifts the next input symbol present on the input buffer onto the top of
the stack.
2. Reduce: This action is performed if the top of the stack has an input symbol that denotes
a right end of a substring. And within the stack there exist the left end of the substring. The
reduce action replaces the entire substring with the appropriate non-terminal. The
production body of this non-terminal matches the replaced substring.
3. Accept: When at the end of parsing the input buffer becomes empty. And the stack has
left with the start symbol of the grammar. The parser announces the successful
completion of parsing.
4. Error: This action identifies the error and performs an error recovery routine.
Let us take an example of the shift-reduce parser. Consider that we have a string id * id +
id and the grammar for the input string is:
E ->E + T | T
T -> T * F | F
F -> ( E ) | id
Note: In the shift-reduce parsing a handle always appears on the top of the stack. The
handle is a substring that matches the body of production in the grammar. The handle
must never appear inside the stack.
Bottom-Up Parsing in LR Parser
The bottom-up parsing is based on the concept of LR(K) parsing.
3. K stands for the number of input symbols that the parser will look ahead for making a
parsing decision.
The input buffer has the input string that has to be parsed.
The stack maintains the sequence of grammar symbols while parsing the input string.
The parsing table is a two-dimensional array that has two entries ‘Go To’ and ‘Action’.
1. LR(0) Parsing
All kinds of LR parsers are the same they only differ in the construction of their parsing
table. We will discuss each of them in brief in our future contents.
So, this is all about the Bottom-Up parsing. We have discussed the most general form of
bottom-up parsing i.e., shift-reduce bottom-up parsing along with an example. We have
discussed the structure of the LR parser and its types.
It's the opposite of top-down parsing, which begins from the start symbol and tries to derive the string.
🧱 Key Concepts
1. Shift-Reduce Parsing
This is the core mechanism of bottom-up parsing. The parser uses a stack to hold grammar symbols.
2. Handle
A handle is a substring that matches the right-hand side of a production rule and can be reduced to the left-hand side.
✅ Advantages
❌ Disadvantages
Given Grammar:
E → E + T | T
T → id
Input: id + id
Parsing Steps:
Type of Bottom-Up
Tool Used For
Parser
YACC (Yet Another Compiler Widely used to generate parsers for programming languages
LALR(1)
Compiler) like C, SQL
Bison (GNU version of YACC) LALR(1) C/C++ compiler components, interpreters
Menhir (for OCaml) LR(1) Compiling OCaml and other functional languages
Earlier versions used bottom-up parsing (later switched to top-
ANTLR (older versions) SLR
down LL(*) parsers)
GCC (GNU Compiler Collection): Uses parsers generated by Bison, a bottom-up LALR parser.
Rust Compiler: Initially used Bison; though it's evolving, it started with LALR parsing.
SQL Parsers: Many SQL engines use bottom-up parsers for complex query grammar parsing.
3. Scripting Languages
Languages like Lua, Bash, and parts of Perl use parser generators that often build on LALR-based techniques.
In modern Integrated Development Environments (IDEs), real-time syntax checking or auto-completion engines often
use partial bottom-up parsing to:
Some use lightweight LALR-based parsing engines for this purpose due to their speed and precision.
Databases like MySQL, PostgreSQL, and SQLite use bottom-up parsers to interpret and compile complex SQL queries
efficiently.
✅ Fast and deterministic — suitable for compilers and tools that require real-time parsing
✅ Powerful enough to handle complex, ambiguous, or left-recursive grammar (which is common in real
languages)
✅ Can be precompiled into efficient table-driven parsers
Bottom-up parsing is a syntax analysis technique where parsing starts from the input tokens (leaves) and proceeds
upward to the start symbol (root) by applying grammar rules in reverse (reductions).
A shift-reduce parser is a bottom-up parser that uses a stack and performs two main operations:
Parsing tables guide the parser on whether to shift, reduce, accept, or throw an error based on the current state and
next input symbol. These tables are generated from the grammar.
Bottom-up parsers can handle a wider range of context-free grammars, including left-recursive and more ambiguous
grammars, which top-down parsers often cannot.
A. Top-down derivation
B. Leftmost derivation
C. Rightmost derivation in reverse
D. Rightmost derivation
A. Leftmost non-terminal
B. The next token to be parsed
C. A substring matching the LHS of a rule
D. A substring matching the RHS of a production to be reduced
A. LL(1)
B. Recursive Descent
C. SLR(1)
D. Predictive Parser
✅ Answer: C. SLR(1)
A. LR(0)
B. SLR(1)
C. LALR(1)
D. LL(1)
✅ Answer: D. LL(1)
A. Lexical analysis
B. Reduction
C. Shifting
D. Code generation
✅ Answer: B. Reduction
A. Lex
B. YACC
C. ANTLR (modern version)
D. Flex
✅ Answer: B. YACC
Answer:
Bottom-up parsing is a syntax analysis technique where parsing starts from the input symbols (tokens) and works upward
by reducing substrings to non-terminals until the start symbol is reached.
Answer:
A shift-reduce parser is a bottom-up parser that uses a stack to shift input tokens and reduce them by applying grammar
rules in reverse.
Answer:
A handle is a substring that matches the right-hand side of a production rule and can be reduced to a non-terminal during
parsing.
Answer:
LR(0)
SLR(1)
LALR(1)
Canonical LR(1)
Answer:
It follows rightmost derivation in reverse.
Answer:
Yes, bottom-up parsers can handle left-recursive grammars.
Answer:
Shift
Reduce
Accept
Error
Answer:
It can parse a wider range of grammars, including left-recursive grammars.
Answer:
Parsing tables guide the parser on when to shift, reduce, accept, or report an error based on the current state and input.
Answer:
YACC (Yet Another Compiler Compiler) uses bottom-up (LALR) parsing.
What is LR Parsing?
It is a bottom-up parsing method used for deterministic context-free languages and is very efficient for programming
languages.
📘 Types of LR Parsers
📊 Comparison Table
Feature LR(0) SLR(1) LALR(1) Canonical LR(1)
Lookahead Used None 1 token 1 token 1 token
Grammar Coverage Low Medium High Highest
Table Size Small Medium Smaller than LR(1) Very Large
Practical Use Rare Educational Very common (e.g., YACC, Bison) Less common due to size
SLR Parser
SLR represents "Simple LR Parser". It is very easy and cost-effective to execute. The SLR
parsing action and goto function from the deterministic finite automata that recognizes
viable prefixes. It will not make specificially defined parsing action tables for all grammars
but does succeed on several grammars for programming languages. Given a grammar G. It
augment G to make G’, and from G’ it can construct C, the canonical collection of a set of
items for G’. It can construct ACTION the parsing action function, and GOTO, the goto
function, from C using the following simple LR Parsing table construction technique. It
needed us to understand FOLLOW (A) for each non-terminal A of a grammar.
Advertisement
-
PauseSkip backward 5 secondsSkip forward 5 seconds
Mute
Fullscreen
CLR Parser
CLR refers to canonical lookahead. CLR parsing uses the canonical collection of LR (1) items
to construct the CLR (1) parsing table. CLR (1) parsing table make more number of states as
compared to the SLR (1) parsing. In the CLR (1), it can locate the reduce node only in the
lookahead symbols.
Explore our latest online courses and learn new skills at your own pace. Enroll and
become a certified expert to boost your career.
LALR Parser
LALR Parser is Look Ahead LR Parser. It is intermediate in power between SLR and CLR
parser. It is the compaction of CLR Parser, and hence tables obtained in this will be smaller
than CLR Parsing Table.
For constructing the LALR (1) parsing table, the canonical collection of LR (1) items is used.
In the LALR (1) parsing, the LR (1) items with the equal productions but have several look
ahead are grouped to form an individual set of items. It is frequently the similar as CLR
(1) parsing except for the one difference that is the parsing table.
The overall structure of all these LR Parsers is the same. There are some common factors
such as size, class of context-free grammar, which they support, and cost in terms of time
and space in which they differ.
Let us see the comparison between SLR, CLR, and LALR Parser.
SLR Parser LALR Parser CLR Parser
It requires less time It requires more time and It also requires more
SLR Parser LALR Parser CLR Parser
LALR Parser :
LALR Parser is lookahead LR parser. It is the most powerful parser which can handle
large classes of grammar. The size of CLR parsing table is quite large as compared to
other parsing table. LALR reduces the size of this table.LALR works similar to CLR. The
only difference is , it combines the similar states of CLR parsing table into one single
state.
The general syntax becomes [A->∝.B, a ]
where A->∝.B is production and a is a terminal or right end marker $
LR(1) items=LR(0) items + look ahead
How to add lookahead with the production?
CASE 1 –
A->∝.BC, a
Suppose this is the 0th production.Now, since ‘ . ‘ precedes B,so we have to write B’s
productions as well.
B->.D [1st production]
Suppose this is B’s production. The look ahead of this production is given as- we look at
previous production i.e. – 0th production. Whatever is after B, we find FIRST(of that
value) , that is the lookahead of 1st production. So, here in 0th production, after B, C is
there. Assume FIRST(C)=d, then 1st production become.
B->.D, d
CASE 2 –
Now if the 0th production was like this,
A->∝.B, a
Here,we can see there’s nothing after B. So the lookahead of 0th production will be the
lookahead of 1st production. ie-
B->.D, a
CASE 3 –
Assume a production A->a|b
A->a,$ [0th production]
A->b,$ [1st production]
Here, the 1st production is a part of the previous production, so the lookahead will be the
same as that of its previous production.
Steps for constructing the LALR parsing table :
1. Writing augmented grammar
2. LR(1) collection of items to be found
3. Defining 2 functions: goto[list of terminals] and action[list of non-terminals] in the
LALR parsing table
EXAMPLE
Construct CLR parsing table for the given context free grammar
S-->AA
A-->aA|b
Solution:
STEP1- Find augmented grammar
The augmented grammar of the given grammar is:-
S'-->.S ,$ [0th production]
S-->.AA ,$ [1st production]
A-->.aA ,a|b [2nd production]
A-->.b ,a|b [3rd production]
Let’s apply the rule of lookahead to the above productions.
The initial look ahead is always $
Now,the 1st production came into existence because of ‘ . ‘ before ‘S’ in 0th
production.There is nothing after ‘S’, so the lookahead of 0th production will be the
lookahead of 1st production. i.e. : S–>.AA ,$
Now,the 2nd production came into existence because of ‘ . ‘ before ‘A’ in the 1st
production.
After ‘A’, there’s ‘A’. So, FIRST(A) is a,b. Therefore, the lookahead of the 2nd
production becomes a|b.
Now,the 3rd production is a part of the 2nd production.So, the look ahead will be the
same.
STEP2 – Find LR(0) collection of items
Below is the figure showing the LR(0) collection of items. We will understand everything
one by one.
The terminals of this grammar are {a,b}
The non-terminals of this grammar are {S,A}
RULES –
1. If any non-terminal has ‘ . ‘ preceding it, we have to write all its production and add ‘ . ‘
preceding each of its production.
2. from each state to the next state, the ‘ . ‘ shifts to one place to the right.
In the figure, I0 consists of augmented grammar.
Io goes to I1 when ‘ . ‘ of 0th production is shifted towards the right of S(S’->S.). This
state is the accept state . S is seen by the compiler. Since I1 is a part of the 0th
production, the lookahead is same i.e. $
Io goes to I2 when ‘ . ‘ of 1st production is shifted towards right (S->A.A) . A is seen
by the compiler. Since I2 is a part of the 1st production, the lookahead is same i.e. $.
I0 goes to I3 when ‘ . ‘ of 2nd production is shifted towards the right (A->a.A) . a is
seen by the compiler.since I3 is a part of 2nd production, the lookahead is same i.e.
a|b.
I0 goes to I4 when ‘ . ‘ of 3rd production is shifted towards right (A->b.) . b is seen by
the compiler. Since I4 is a part of 3rd production, the lookahead is same i.e. a|b.
I2 goes to I5 when ‘ . ‘ of 1st production is shifted towards right (S->AA.) . A is seen
by the compiler. Since I5 is a part of the 1st production, the lookahead is same i.e. $.
I2 goes to I6 when ‘ . ‘ of 2nd production is shifted towards the right (A->a.A) . A is
seen by the compiler. Since I6 is a part of the 2nd production, the lookahead is same
i.e. $.
I2 goes to I7 when ‘ . ‘ of 3rd production is shifted towards right (A->b.) . A is seen by
the compiler. Since I6 is a part of the 3rd production, the lookahead is same i.e. $.
I3 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I3 is a part of the 2nd production, the lookahead is same
i.e. a|b.
I3 goes to I8 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler. Since I8 is a part of the 2nd production, the lookahead is same
i.e. a|b.
I6 goes to I9 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler. Since I9 is a part of the 2nd production, the lookahead is same
i.e. $.
I6 goes to I6 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I6 is a part of the 2nd production, the lookahead is same
i.e. $.
I6 goes to I7 when ‘ . ‘ of the 3rd production is shifted towards right (A->b.) . b is seen
by the compiler. Since I6 is a part of the 3rd production, the lookahead is same i.e. $.
STEP 3 –
Defining 2 functions: goto[list of terminals] and action[list of non-terminals] in the parsing
table.Below is the CLR parsing table
Once we make a CLR parsing table, we can easily make a LALR parsing table from it.
In the step2 diagram, we can see that
I3 and I6 are similar except their lookaheads.
I4 and I7 are similar except their lookaheads.
I8 and I9 are similar except their lookaheads.
In LALR parsing table construction , we merge these similar states.
Wherever there is 3 or 6, make it 36(combined form)
Wherever there is 4 or 7, make it 47(combined form)
Wherever there is 8 or 9, make it 89(combined form)
Below is the LALR parsing table.
LR parsers :
It is an efficient bottom-up syntax analysis technique that can be used to parse large
classes of context-free grammar is called LR(k) parsing.
L stands for the left to right scanning
R stands for rightmost derivation in reverse
k stands for no. of input symbols of lookahead
Advantages of LR parsing :
It recognizes virtually all programming language constructs for which CFG can be
written
It is able to detect syntactic errors
It is an efficient non-backtracking shift reducing parsing method.
Types of LR parsing methods :
1. SLR
2. CLR
3. LALR
CLR Parser :
The CLR parser stands for canonical LR parser.It is a more powerful LR parser.It makes
use of lookahead symbols. This method uses a large set of items called LR(1) items.The
main difference between LR(0) and LR(1) items is that, in LR(1) items, it is possible to
carry more information in a state, which will rule out useless reduction states.This extra
information is incorporated into the state by the lookahead symbol. The general syntax
becomes [A->∝.B, a ]
where A->∝.B is the production and a is a terminal or right end marker $
LR(1) items=LR(0) items + look ahead
How to add lookahead with the production?
CASE 1 –
A->∝.BC, a
Suppose this is the 0th production.Now, since ‘ . ‘ precedes B,so we have to write B’s
productions as well.
B->.D [1st production]
Suppose this is B’s production. The look ahead of this production is given as we look at
previous productions ie 0th production. Whatever is after B, we find FIRST(of that
value) , that is the lookahead of 1st production.So,here in 0th production, after B, C is
there. assume FIRST(C)=d, then 1st production become
B->.D, d
CASE 2 –
Now if the 0th production was like this,
A->∝.B, a
Here, we can see there’s nothing after B. So the lookahead of 0th production will be the
lookahead of 1st production. ie-
B->.D, a
CASE 3 –
Assume a production A->a|b
A->a,$ [0th production]
A->b,$ [1st production]
Here, the 1st production is a part of the previous production, so the lookahead will be the
same as that of its previous production.
These are the 2 rules of look ahead.
Steps for constructing CLR parsing table :
1. Writing augmented grammar
2. LR(1) collection of items to be found
3. Defining 2 functions: goto[list of terminals] and action[list of non-terminals] in the CLR
parsing table
EXAMPLE
Construct a CLR parsing table for the given context-free grammar
S-->AA
A-->aA|b
Solution :
STEP 1 – Find augmented grammar
The augmented grammar of the given grammar is:-
S'-->.S ,$ [0th production]
S-->.AA ,$ [1st production]
A-->.aA ,a|b [2nd production]
A-->.b ,a|b [3rd production]
Let’s apply the rule of lookahead to the above productions
The initial look ahead is always $
Now, the 1st production came into existence because of ‘ . ‘ Before ‘S’ in 0th
production.There is nothing after ‘S’, so the lookahead of 0th production will be the
lookahead of 1st production. ie: S–>.AA ,$
Now, the 2nd production came into existence because of ‘ . ‘ Before ‘A’ in the 1st
production.After ‘A’, there’s ‘A’. So, FIRST(A) is a,b
Therefore,the look ahead for the 2nd production becomes a|b.
Now, the 3rd production is a part of the 2nd production.So, the look ahead will be the
same.
STEP 2 – Find LR(1) collection of items
Below is the figure showing the LR(1) collection of items. We will understand everything
one by one.
The terminals of this grammar are {a,b}
The non-terminals of this grammar are {S,A}
RULE-
1. If any non-terminal has ‘ . ‘ preceding it, we have to write all its production and add ‘ . ‘
preceding each of its production.
2. from each state to the next state, the ‘ . ‘ shifts to one place to the right.
3. All the rules of lookahead apply here.
In the figure, I0 consists of augmented grammar.
Io goes to I1 when ‘ . ‘ of 0th production is shifted towards the right of S(S’->S.). This
state is the accept state . S is seen by the compiler. Since I1 is a part of the 0th
production, the lookahead is the same ie $
Io goes to I2 when ‘ . ‘ of 1st production is shifted towards right (S->A.A) . A is seen
by the compiler. Since I2 is a part of the 1st production, the lookahead is the same i.e.
$.
I0 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I3 is a part of the 2nd production, the lookahead is the
same ie a|b.
I0 goes to I4 when ‘ . ‘ of the 3rd production is shifted towards right (A->b.) . b is seen
by the compiler. Since I4 is a part of the 3rd production, the lookahead is the same
i.e. a | b.
I2 goes to I5 when ‘ . ‘ of 1st production is shifted towards right (S->AA.) . A is seen
by the compiler. Since I5 is a part of the 1st production, the lookahead is the same i.e.
$.
I2 goes to I6 when ‘ . ‘ of 2nd production is shifted towards the right (A->a.A) . A is
seen by the compiler. Since I6 is a part of the 2nd production, the lookahead is the
same i.e. $.
I2 goes to I7 when ‘ . ‘ of 3rd production is shifted towards right (A->b.) . A is seen by
the compiler. Since I6 is a part of the 3rd production, the lookahead is the same i.e. $.
I3 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I3 is a part of the 2nd production, the lookahead is the
same i.e. a|b.
I3 goes to I8 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler. Since I8 is a part of the 2nd production, the lookahead is the
same i.e. a|b.
I6 goes to I9 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler. Since I9 is a part of the 2nd production, the lookahead is the
same i.e. $.
I6 goes to I6 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I6 is a part of the 2nd production, the lookahead is the
same i.e. $.
I6 goes to I7 when ‘ . ‘ of the 3rd production is shifted towards right (A->b.) . b is seen
by the compiler. Since I6 is a part of the 3rd production, the lookahead is the same ie
$.
STEP 3- defining 2 functions:goto[list of terminals] and action[list of non-terminals] in the
parsing table.Below is the CLR parsing table
$ is by default a non terminal which takes accepting state.
0,1,2,3,4,5,6,7,8,9 denotes I0,I1,I2,I3,I4,I5,I6,I7,I8,I9
I0 gives A in I2, so 2 is added to the A column and 0 row.
I0 gives S in I1,so 1 is added to the S column and 1st row.
similarly 5 is written in A column and 2nd row, 8 is written in A column and 3rd row,
9 is written in A column and 6th row.
I0 gives a in I3, so S3(shift 3) is added to a column and 0 row.
I0 gives b in I4, so S4(shift 4) is added to the b column and 0 row.
Similarly, S6(shift 6) is added on ‘a’ column and 2,6 row ,S7(shift 7) is added on b
column and 2,6 row,S3(shift 3) is added on ‘a’ column and 3 row ,S4(shift 4) is added
on b column and 3 row.
I4 is reduced as ‘ . ‘ is at the end. I4 is the 3rd production of grammar. So write
r3(reduce 3) in lookahead columns. The lookahead of I4 are a and b, so write R3 in a
and b column.
I5 is reduced as ‘ . ‘ is at the end. I5 is the 1st production of grammar. So write
r1(reduce 1) in lookahead columns. The lookahead of I5 is $ so write R1 in $ column.
Similarly, write R2 in a,b column and 8th row, write R2 in $ column and 9th row.
1. What does "LR" stand for in LR parsing?
Answer:
L stands for scanning the input Left-to-right, and R stands for constructing a Rightmost derivation in reverse.
Answer:
LR parsing is used for bottom-up syntax analysis in compilers to parse deterministic context-free languages efficiently.
Answer:
LR(0) – No lookahead
SLR(1) – Uses FOLLOW sets
LALR(1) – Optimized, used in practice (e.g., YACC)
Canonical LR(1) – Most powerful, uses full context
Answer:
A parsing table guides the parser on what action to take (shift, reduce, accept, error) based on the current state and
lookahead symbol.
Answer:
It occurs when the parser table can't decide whether to shift the next input symbol or reduce the symbols on the stack.
This usually indicates a grammar ambiguity or a limitation of the LR parser type.
7. Which LR parser type is most commonly used in practice?
Answer:
LALR(1) is most commonly used in practice because it balances power and efficiency. Tools like YACC and Bison use
it.
Answer:
No, LR parsers require unambiguous grammars. Ambiguities must be resolved manually or by rewriting the grammar.
Answer:
Canonical LR(1) uses full lookahead and context, making it able to resolve more conflicts and handle more grammars
than LALR(1), but at the cost of much larger parsing tables.
Answer:
Lookahead helps the parser decide the correct action (shift or reduce) when multiple options are available — especially
important in SLR(1), LALR(1), and LR(1).