Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views129 pages

Input Buffering

Input buffering is a crucial technique in compiler design that improves performance by reducing the number of I/O operations during lexical analysis. It involves using a two-buffer system to manage large lookaheads and employing sentinels to mark buffer ends, allowing for efficient token identification. The lexical analyzer reads input characters, arranges them into lexemes, and outputs tokens, while also performing pre-processing tasks such as removing comments and whitespace.

Uploaded by

basinaveen2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views129 pages

Input Buffering

Input buffering is a crucial technique in compiler design that improves performance by reducing the number of I/O operations during lexical analysis. It involves using a two-buffer system to manage large lookaheads and employing sentinels to mark buffer ends, allowing for efficient token identification. The lexical analyzer reads input characters, arranges them into lexemes, and outputs tokens, while also performing pre-processing tasks such as removing comments and whitespace.

Uploaded by

basinaveen2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 129

INPUT BUFFERING

 We often have to look one or more characters beyond the next lexeme before we can be sure we have
the right lexeme.
 As characters are read from left to right, each character is stored in the buffer to form a meaningful
token. We introduce a two-buffer scheme that handles large look a head’s safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.
Buffer Pairs:
 A buffer is divided into two N-character halves, as shown below.
 Each buffer is of the same size N, and N is usually the number of characters on one disk block. E.g.,
1024 or 4096 bytes.
 Using one system read command we can read N characters into a buffer.
 If fewer than N characters remain in the input file, then a special character, represented by eof, marks
the end of the source file.

Two pointers to the input are maintained:


 Pointer lexeme_beginning, marks the beginning of the current lexeme, whose extent we are attempting
to determine.
 Pointer forward scans ahead until a pattern match is found.
Sentinels:
 For each character read, we make two tests: one for the end of the buffer, and one to determine what
character is read. We can combine the buffer-end test with the test for the current character if we
extend each buffer to hold a sentinel character at the end.
 The sentinel is a special character that cannot be part of the source program, and a natural choice is the
character eof.
 The sentinel arrangement is as shown below:

SPECIFICATION OF TOKEN:
 Regular expressions are notation for specifying patterns.
 Each pattern matches a set of strings.
 Regular expressions will serve as names for sets of strings.
 Strings and Languages String means a finite sequence of symbols.
For example
computer ( c, o, m, p, u, t, e, r)
CS6660 ( C, S, 6, 6, 6, 0)
101001 (1, 0)
 Symbols are given through alphabet. An alphabet is a finite set of symbols
 The term alphabet or character class denotes any finite set of symbols. e.g., set {0,1} is the binary
alphabet.
 The term sentence and word are often used as synonyms for the term string.
 The length of a string s is written as | s | - is the number of occurrences of symbols
 The empty string denoted by ε – length of empty string is zero.
 The term language denotes any set of strings over some fixed alphabet.
Operations on Languages:
There are several operations that can be applied to languages:
Definitions of operations on languages L and M:

Regular Expressions:
 It allows defining the sets to form tokens.
 Defines a Pascal identifier –identifier is formed by a letter followed by zero or more letters or digits.
e.g., letter (letter | digit) *
 A regular expression is formed using a set of defining rules.
 Each regular expression r denotes a language L(r).
Order of evaluate Regular expression:
As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop certain pairs
of parentheses if we adopt the conventions that:
 The unary operator * has highest precedence and is left associative.
 Concatenation has second highest precedence and is left associative.
 | has lowest precedence and is left associative.
RECOGNITION OF TOKENS:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examines the input string and finds a prefix
that is a lexeme matching one of the patterns.
Grammar for branching statements:

The terminal of grammar, which are if, then, else, relop,id and numbers are the names of tokens as far
as the lexical analyzer is concerned, the patterns for the tokens are described using regular definitions.
Advantages of Input Buffering
 It usually just does one test to determine if the forward pointer is pointing to an eof.
 It only runs further tests until it reaches the halfway point of the buffer or eof.
 The average number of tests per input character is extremely close to 1 since N input characters are
encountered between eofs.
Disadvantages of Input Buffering
 The majority of the time, this approach works effectively, however, the amount of lookahead is
restricted.
 In circumstances when the forward pointer must travel a distance greater than the buffer length, this
restricted lookahead may make it difficult to detect tokens.
 It must wait until the character that follows the right parenthesis decides whether the DECLARE is a
keyword or an array name.

Introduction

Input buffering is an most important technique in compiler design that helps to improve performance and reduce
expenses, and it must be used carefully and appropriately to avoid problems.

To understand the input buffering in compiler design in detail, a few terms need to be understood. We will
discuss those terms before moving to input buffering in compiler design.

Lexical Analyser
The lexical analyzer's main purpose is to read the source program's input characters, arrange them into lexemes,
and output a sequence of tokens for each lexeme in the source program.
When the lexical analyzer finds a lexeme part of an identifier, it must add it to the symbol table.

The lexical analyzer not only recognizes lexemes but also does pre-processing on the source text, such as
deleting comments and white spaces.

Lexeme
A lexeme is a sequence of characters in the source program that fits the pattern for a token and is recognized as
an instance of that token by the lexical analyzer.

Token
A Token is a pair that consists of a token name and a value for an optional attribute.

The token name is a symbol that denotes the type of lexical unit.

Lexeme

Lexeme Token
= EQUAL_OP
* MULT_OP
, COMMA
( LEFT_PAREN
Pattern
A pattern is a description of the various forms that a token's lexemes can take. The pattern for a keyword as a
token is just a series of characters that make up the term..

What is Input Buffering?

One or more characters beyond the following lexeme must be searched up to guarantee that the correct lexeme
is identified.

As a result, a two-buffer system is used to safely manage huge lookaheads.

Using sentinels to mark the buffer end has been embraced as a technique for speeding up the lexical analyzer
process.

Three general techniques for implementing a lexical analyzer


 With the help of a lexical-analyzer generator, you can: The generator does this by providing methods for
reading and buffering the input.
 By building the lexical analyzer in a traditional systems programming language and reading the input with
that language's I/O tools.
 Programming the lexical analyzer in assembly language and explicitly controlling the input reading.
Buffer Pairs
Specialized buffering techniques decrease the overhead required to process an input character in

moving characters.

Buffer Pairs

 It consists of two buffers, each of which has an N-character size and is alternately reloaded.
 There are two pointers: lexemeBegin and forward.
 Lexeme Begin denotes the start of the current lexeme, which has yet to be discovered.
 Forward scans until it finds a match for a pattern.
 When a lexeme is discovered, lexeme begin is set to the character immediately after the newly discovered
lexeme, and forward is set to the character at the right end of the lexeme.
 The collection of characters between two points is the current lexeme.

Sentinels
Sentinels are used to performing a check each time the forward pointer is shifted to guarantee
that one-half of the buffer has not gone off. If it's finished, the other half will need to be
reloaded.

As a result, each advance of the forward pointer necessitates two checks at the ends of the
buffer halves. Test 1: Check for the buffer's end. Test 2: To figure out which character is being
read. By expanding each buffer in half to store a sentinel character at the end, sentinel reduces
the two checks to one.

The sentinel is a unique character that isn't included in the source code. (The character of serves
as a sentinel.)
What is Input Buffering in compiler design?

To identify tokens, Lexical Analysis must visit secondary memory each time. It takes a long
time and costs a lot of money. As a result, the input strings are buffered before being examined
by Lexical Analysis.

Lexical analysis reads the input string one character at a time from left to right to detect tokens.
To scan tokens, it employs two pointers.

 The Begin Pointer (bptr) is a pointer that points to the start of the string to be read.
 Look Ahead Pointer(lptr) continues its hunt for the token's end.

Sample Example
Example: For the statement int a,b;

 Both points begin at the start of the string that is saved in the buffer.

 The Look Ahead Pointer examines the buffer until it finds the token.

 Before the token ("int") can be identified, the character ("blank space") beyond the token ("int") must be
checked.

 Both pointers will be set to the next token ('a') after processing the token ("int"), and this procedure will be
continued throughout the program.

Two portions of a buffer can be separated. If you move the look Ahead cursor halfway through
the first half, the second half will be filled with fresh characters to read. If you shift the look
Ahead cursor to the right end of the second half's buffer, the first half will be filled with new
characters, and so on.

Advantages of Input Buffering


 It usually just does one test to determine if the forward pointer is pointing to an eof.
 It only runs further tests until it reaches the halfway point of the buffer or eof.
 The average number of tests per input character is extremely close to 1 since N input characters are
encountered between eofs.
Disadvantages of Input Buffering
 The majority of the time, this approach works effectively, however, the amount of lookahead is restricted.
 In circumstances when the forward pointer must travel a distance greater than the buffer length, this
restricted lookahead may make it difficult to detect tokens.
 It must wait until the character that follows the right parenthesis decides whether the DECLARE is a
keyword or an array name.
Frequently Asked Questions
What is the need for input buffering in compiler design?
The purpose of input buffering is to reduce the overhead of I/O operations. Input buffering allows the compiler
to read input characters in chunks and store them in a buffer, which can be accessed by the parser as needed.

What is the concept of input buffering?


Input buffering in compiler design involves reading and storing a portion of the source code in memory before
processing. It reduces the number of slow I/O operations, improving efficiency.

What is input buffering in compiler design?


Input buffering in compiler design involves reading characters or symbols from the source code into a buffer to
reduce the number of I/O operations, improving efficiency during lexical analysis.

What is the role of sentinel in input buffering?


In input buffering, a sentinel is a special character or value used to signify the end of a buffer, facilitating
efficient processing by indicating when to stop reading data without needing to check the buffer size constantly.

What is lexical analysis in compiler design code?


Lexical analysis in compiler design is the initial phase that scans the source code, breaking it into tokens or
lexemes. It identifies keywords, operators, and literals for further processing

In the context of compiler design, input buffering is a crucial technique that optimizes the process of reading
and processing source code (or other input) efficiently. A compiler typically works by reading source code and
breaking it down into tokens, which are then analyzed and processed to generate the output. Efficiently reading
the input stream, especially when dealing with large programs, is important to avoid performance bottlenecks.

Role of Input Buffering in Compiler Design:

In a compiler, input buffering refers to the process of reading the source code (or input) in a way that
minimizes the number of I/O operations, making the lexical analysis (tokenization) phase more efficient.

The typical stages in a compiler include:


1. Lexical Analysis (Scanner): This stage reads the raw input, tokenizes it into a sequence of tokens (e.g.,
keywords, operators, identifiers), and removes unnecessary whitespace or comments.
2. Syntax Analysis (Parser): The parser analyzes the token stream generated by the lexical analyzer to
build a syntax tree or abstract syntax tree (AST).
3. Semantic Analysis and further stages for optimization, code generation, etc.

Input buffering is especially important in the lexical analysis phase, where large amounts of source code are
read and tokenized.

Why Input Buffering is Needed:

 Performance: Reading each character individually (e.g., via system calls or directly from the input
stream) is much slower than reading chunks of data at once.
 Minimizing I/O Operations: Accessing memory or disk is relatively slow, so minimizing the frequency
of these accesses is essential for efficient processing.
 Buffering Reduces Overhead: By reading data into a buffer and working with it in chunks, the
compiler can reduce the overhead of I/O operations, making the lexical analysis more efficient.

Techniques for Input Buffering in Compilers:

1. Double Buffering:
o Double buffering is a common technique used in input buffering. It involves maintaining two
buffers of input data.
o One buffer is used to feed the lexical analyzer (scanner) while the other buffer is being filled
with more input data. When one buffer is exhausted, the roles of the buffers are switched.
o This allows the compiler to continuously read the input without needing to wait for new data,
thereby improving speed.
2. Buffering with a Lookahead Mechanism:
o Lexical analyzers often use lookahead techniques to help identify tokens more efficiently. For
example, if a language construct requires looking ahead at one or more characters (e.g.,
distinguishing between if and identifier), the buffer can store additional characters in advance to
aid this lookahead.
o A typical technique is to use a fixed-size buffer and implement a lookahead pointer that moves
across the buffer to identify the next token.
3. Buffer Size and Chunking:
o A common practice is to read in chunks (often 1024 or 4096 bytes) from the input file into the
buffer.
o The lexical analyzer then processes these chunks one character at a time, moving through the
buffer, until it reaches the end, at which point it refills the buffer with more data from the input
stream.
4. Finite State Machines (FSMs) with Buffering:
o The lexical analyzer can be implemented using a finite state machine (FSM) where each state
corresponds to a specific part of the lexing process. As the FSM moves through its states, it reads
characters from the input buffer and recognizes tokens. This allows it to efficiently handle
tokenization while minimizing the need for frequent I/O operations.

Example of Input Buffering in Lexical Analysis:


Consider a simple example where a source program is being tokenized. Suppose the program contains the code:

c
Copy
int x = 10;

The steps the lexical analyzer might follow using input buffering could look like this:

1. Buffer Fill: The input is read into a buffer, say BUFFER[1024], with the first few characters (e.g., int x =
10;).
2. Token Identification:
o The lexical analyzer starts reading from the buffer one character at a time.
o It first identifies the token int, then skips whitespace to identify x, and so on.
3. Buffer Refill: When the lexical analyzer reaches the end of the buffer, it refills it with new data from the
input, and the process continues.

Example: Double Buffering

Here is a more concrete explanation of double buffering:

 Buffer 1 holds the first portion of the input (say, the first 1024 characters).
 Buffer 2 is initially empty. As the lexical analyzer consumes characters from Buffer 1, Buffer 2 is filled
with more input data (typically the next 1024 characters).
 When Buffer 1 is exhausted, the analyzer switches to Buffer 2 and starts consuming from it, while
Buffer 1 is refilled.
 This technique ensures that the lexer can continuously work without waiting for more data.

Benefits of Input Buffering in Compiler Design:

1. Efficiency: Reduces the number of system calls and input/output operations, leading to faster
compilation times.
2. Smooth Tokenization: Helps the lexical analyzer work efficiently by ensuring that there is always data
available for tokenization, avoiding delays when input is being read from slower sources like files or
networks.
3. Lookahead Capability: Enables the compiler to use techniques like lookahead without slowing down
input handling.

Conclusion:

In compiler design, input buffering is a critical optimization technique that helps the lexical analyzer (scanner)
process source code efficiently. It reduces the overhead of frequent I/O operations, speeds up tokenization, and
provides a smooth flow of input data for further processing. Through methods like double buffering and
chunking, the compiler can efficiently handle large amounts of source code and deliver high performance
during the compilation process.
Pros of Input Buffering in Compiler Design:

1. Improved Performance:
o Faster Input Handling: By reducing the number of I/O operations (such as system calls to read
one character at a time), input buffering enables the lexical analyzer to process larger chunks of
input at once. This significantly speeds up the process of tokenizing the source code.
o Reduced Latency: Since data is read into the buffer in larger chunks, the compiler can
immediately start processing the data, reducing the wait time between reading the input and
performing analysis.
2. Optimized Use of Resources:
o Efficient Memory Use: Input buffering allows the compiler to store multiple characters in a
buffer at once, rather than performing frequent I/O operations, which helps in better utilizing
memory resources.
o CPU Efficiency: By allowing the CPU to work on processing tokens while waiting for more
input to be buffered, the overall efficiency of the compilation process is improved.
3. Continuous Data Flow:
o Prevents I/O Bottlenecks: Input buffering allows the lexical analyzer to have a steady stream of
input data without the need to frequently pause and refill from the input source (file, network,
etc.). This is especially important for large programs, where pausing for every token or character
could lead to significant performance hits.
4. Enables Lookahead Techniques:
o Lookahead for Tokenization: Buffers can store extra characters that are useful for lookahead
operations. In lexical analysis, lookahead is often required to distinguish between similar tokens
(e.g., distinguishing if from an identifier). Input buffering facilitates this lookahead by holding
additional characters in memory.
5. Minimized Disk/Network Access:
o Less Frequent Disk Access: When reading from a file or network, input buffering reduces the
number of accesses to slower storage media. This minimizes the cost of accessing data from disk
or remote servers, which can be significantly slower than working with data already loaded into
memory.

Cons of Input Buffering in Compiler Design:

1. Increased Complexity:
o Buffer Management Overhead: Implementing and managing buffering schemes, such as
double buffering or handling lookahead buffers, adds complexity to the design of the lexical
analyzer. It may require additional logic to handle cases like buffer exhaustion, switching
between buffers, or dealing with incomplete tokens at the buffer boundary.
o Error Handling Complexity: When errors occur near the boundary of buffers (e.g., at the end
of one buffer or the beginning of another), handling them appropriately can be tricky. For
example, if an incomplete token is encountered at the end of one buffer, it must be carefully
processed when the next buffer is read in.
2. Memory Overhead:
o Increased Memory Usage: Input buffers require additional memory to store the buffered data,
and this can be a concern in systems with limited memory resources. While the memory
requirement is typically small (a few kilobytes to megabytes), it can still become significant in
embedded systems or low-memory environments.
o Buffer Size Trade-Offs: The buffer size needs to be tuned for optimal performance. A larger
buffer may improve performance but increase memory consumption. A smaller buffer may
reduce memory usage but may result in more frequent buffer refills, potentially negating the
benefits of buffering.
3. Potential Overhead in Small Programs:
o Unnecessary for Small Input: In cases where the source code is small or the program being
compiled is simple, the overhead of buffering may not offer significant performance
improvements. The complexity added by the buffering mechanism might not justify its use if the
input size is small enough that frequent I/O operations are not a concern.
4. Buffering Latency in Certain Scenarios:
o Initial Delay for Buffer Refill: In scenarios where buffers need to be refilled frequently, there
might be a slight delay in the initial phase of reading input (especially if the input is being read
from disk or a network). This latency could cause a temporary slowdown, although it is often
compensated for by the long-term performance benefits of buffering.
o Buffering Overhead: The process of refilling buffers and managing them can add a slight
overhead to the system. While this overhead is generally small, it could become noticeable in
systems where every microsecond counts (e.g., real-time compilers or embedded systems).
5. Complicated Token Recognition at Buffer Boundaries:
o Difficulty at Buffer Edges: When a token is split across two buffers (i.e., part of the token is at
the end of one buffer and the rest is at the beginning of the next), extra care must be taken to
correctly identify the token. This can complicate the logic and make the lexical analyzer more
prone to errors.

Summary:

Pros:

 Significant performance improvements by reducing I/O operations.


 Efficient use of CPU and memory resources.
 Enables continuous data flow and lookahead for better tokenization.
 Reduces disk/network access, improving efficiency.

Cons:

 Increased complexity in buffer management and error handling.


 Additional memory overhead for storing buffered data.
 Potential for unnecessary overhead in small programs or environments with limited resources.
 Possible delays in buffer refill and managing token boundaries across buffers.

Conclusion:

Input buffering is a valuable technique for improving the efficiency of lexical analysis in a compiler, especially
when dealing with large input files. However, it comes with trade-offs in terms of complexity and memory
usage. Whether the benefits outweigh the drawbacks depends on the specific use case, the size of the input, and
the target environment (e.g., embedded systems vs. high-performance servers)

1. What is input buffering in compiler design?

Input buffering in compiler design refers to the technique of reading the source code (or input data) into
memory in chunks (i.e., buffers) rather than reading it character by character. This technique is used to optimize
the performance of the lexical analysis phase of a compiler, reducing the number of I/O operations and speeding
up the tokenization process.

2. Why is input buffering important in compiler design?

Input buffering is crucial because it helps improve the efficiency of the lexical analyzer (scanner). Without
input buffering, the lexical analyzer would need to read one character at a time from the input, which can be
very slow, especially for large programs. Buffering allows the compiler to work with chunks of data,
significantly speeding up the process and reducing the number of I/O operations.

3. What are the advantages of input buffering in compilers?

 Improved performance: By reducing I/O operations, input buffering speeds up the process of reading
and tokenizing source code.
 Efficient resource usage: It minimizes the overhead of frequently accessing the disk or network.
 Continuous data flow: The compiler can continue processing without waiting for more data to be read
from input.
 Lookahead capabilities: Buffering supports techniques like lookahead, which are important for
distinguishing between similar tokens (e.g., if vs. an identifier).

4. What are the different types of input buffering techniques used in compilers?

 Single Buffering: A single buffer is used to hold a chunk of input. When it is exhausted, the next chunk
is read into the buffer.
 Double Buffering: Two buffers are used. While one buffer is being processed, the other is being filled
with new data. This allows continuous processing without waiting for new data to be read in.
 Circular Buffering: A buffer is treated as a circular structure, allowing the analyzer to wrap around and
reuse space, which can be helpful for optimizing memory usage.

5. How does double buffering work in a compiler?

In double buffering, two buffers are maintained:

 Buffer 1 is filled with data from the input source (e.g., file, network).
 The lexical analyzer processes characters from Buffer 1.
 While Buffer 1 is being processed, Buffer 2 is filled with new data from the input.
 When Buffer 1 is exhausted, the analyzer switches to Buffer 2, and the process repeats, ensuring
continuous input processing.

6. What is the role of lookahead in input buffering?

Lookahead refers to the ability of the lexical analyzer to examine future characters in the input stream to decide
how to tokenize the current character. Input buffering helps by holding additional characters in the buffer,
allowing the lexer to look ahead at the next character(s) to make decisions about token boundaries (e.g.,
distinguishing between if and an identifier).

7. What challenges are associated with input buffering in compilers?

 Buffer boundary issues: When a token is split across two buffers, additional logic is required to
properly identify the token when it crosses buffer boundaries.
 Memory overhead: Buffers consume memory, and larger buffers may require more memory resources.
This can be a concern for systems with limited memory.
 Complexity: Managing buffers, especially when implementing techniques like double buffering or
circular buffering, increases the complexity of the lexical analyzer and the overall compiler design.

8. Does input buffering work the same way for all types of input sources?

Not necessarily. Input buffering is typically used for files or streams where the input is sequential and can be
read in chunks. However, if the input source involves real-time data (such as interactive input from a user or
live data streams), the buffering strategy might need to be adjusted to accommodate those constraints. For
example, buffering might be less relevant when the input is being processed in real-time.

9. Can input buffering be skipped for small programs or projects?

For small programs or when the source code is relatively small, the overhead of setting up input buffering might
not provide a noticeable performance benefit. In such cases, the compiler may read the input character by
character without significant impact on performance. However, for larger projects, input buffering is essential
for efficiency.

10. What are the trade-offs when choosing buffer sizes for input buffering?

 Larger buffers: Larger buffers can lead to fewer I/O operations and more efficient processing, but they
consume more memory.
 Smaller buffers: Smaller buffers consume less memory but result in more frequent buffer refills, which
can reduce performance. Finding an optimal buffer size is often a balance between memory usage and
performance.

11. Can input buffering cause any problems in compiler performance?

While input buffering generally improves performance, it can cause issues in specific scenarios:

 Memory Usage: For very large programs or systems with limited memory, large buffers might lead to
excessive memory consumption.
 Latency: If the buffer is large, it might take longer to fill the buffer, which could cause delays before
processing begins, although this is usually compensated by long-term performance gains.

12. How is input buffering handled in different programming languages or compilers?

Different compilers and languages implement input buffering in various ways. For instance:

 In C/C++, standard I/O functions like fgetc(), fgets(), or getc() work with buffered streams by
default.
 In Python, the built-in input() function uses input buffering, while file operations like open() can use
buffered or unbuffered modes.
 In Lex/Flex (a tool for lexical analysis), input buffering is handled explicitly through buffer size
configuration and built-in buffer management.

13. Is input buffering always necessary for a compiler?

While input buffering is highly beneficial for performance, it is not always necessary. For very small programs
or in environments where resources are limited (e.g., embedded systems), it might be acceptable to read input
directly without buffering. However, for most real-world compilers, especially those processing large
codebases, input buffering is crucial.

Conclusion

Input buffering in compiler design is a powerful optimization that significantly improves the speed and
efficiency of the lexical analysis phase. It helps minimize I/O operations and supports more efficient processing
of source code. However, it does introduce some complexity, memory usage concerns, and boundary handling
challenges that need to be carefully managed. The trade-offs and implementation choices largely depend on the
size of the program, the target environment, and the performance requirements of the compiler.
Specification of Tokens
Specification of tokens depends on the pattern of the lexeme. Here we will
be using regular expressions to specify the different types of patterns that
can actually form tokens.
Although the regular expressions are inefficient in specifying all the
patterns forming tokens. Yet it reveals almost all types of pattern that
forms a token.
There are 3 specifications of tokens:
1. String
2. Language
3. Regular Expression

1. String

 An alphabet or character class is a finite set of symbols.


 A string over an alphabet is a finite sequence of symbols drawn from that
alphabet.
 In language theory, the term "word" is often used as synonyms for "string".
 The length of a string s, usually written |s|, is the number of occurrences of
symbols in s. For example, "banana" is a string of length six.
 The empty string, denoted ε, is the string of length zero.
Operations on String:
1. Prefix of String
The prefix of the string is the preceding symbols present in the string and
the string(s) itself.

For Example: s = abcd

The prefix of the string abcd: ∈, a, ab, abc, abcd

2. Suffix of String
Suffix of the string is the ending symbols of the string and the string(s)
itself.
For Example: s = abcd
Suffix of the string abcd: ∈, d, cd, bcd, abcd

3. Proper Prefix of String


The proper prefix of the string includes all the prefixes of the string
excluding ∈ and the string(s) itself.

Proper Prefix of the string abcd: a, ab, abc

4. Proper Suffix of String


The proper suffix of the string includes all the suffixes excluding ∈ and
the string(s) itself.

Proper Suffix of the string abcd: d, cd, bcd

5. Substring of String
You may also like...
 Recognition of Tokens in Compiler Design - BtechVibes
 Code Optimization in Compiler Design - BtechVibes
 Type Checking in Compiler Design

The substring of a string s is obtained by deleting any prefix or suffix from


the string.

Substring of the string abcd: ∈, abcd, bcd, abc, …

6. Proper Substring of String


The proper substring of a string s includes all the substrings of s excluding
∈ and the string(s) itself.

Proper Substring of the string abcd: bcd, abc, cd, ab…

7. Subsequence of String
The subsequence of the string is obtained by eliminating zero or more (not
necessarily consecutive) symbols from the string.

A subsequence of the string abcd: abd, bcd, bd, …

8. Concatenation of String
If s and t are two strings, then st denotes concatenation.

s = abc t = def

Concatenation of string s and t i.e. st = abcdef


2. Language

A language is any countable set of strings over some fixed alphabet.


Operation on Language
As we have learnt language is a set of strings that are constructed over
some fixed alphabets. Now the operation that can be performed on
languages are:

1. Union
Union is the most common set operation. Consider the two languages L
and M. Then the union of these two languages is denoted by:

L ∪ M = { s | s is in L or s is in M}

That means the string s from the union of two languages can either be
from language L or from language M.
If L = {a, b} and M = {c, d} Then L ∪ M = {a, b, c, d}

2. Concatenation
Concatenation links the string from one language to the string of another
language in a series in all possible ways. The concatenation of two
different languages is denoted by:

L ⋅ M = {st | s is in L and t is in M} If L = {a, b} and M = {c, d}

Then L ⋅ M = {ac, ad, bc, bd}

3. Kleene Closure
Kleene closure of a language L provides you with a set of strings. This set
of strings is obtained by concatenating L zero or more time. The Kleene
closure of the language L is denoted by:

If L = {a, b} then L* = {∈, a, b, aa, bb, aaa, bbb, …}

4. Positive Closure
The positive closure on a language L provides a set of strings. This set of
strings is obtained by concatenating ‘L’ one or more times. It is denoted
by:

It is similar to the Kleene closure. Except for the term L0, i.e. L+ excludes
∈ until it is in L itself.

If L = {a, b} then L+ = {a, b, aa, bb, aaa, bbb, …}

So, these are the four operations that can be performed on the languages in
the lexical analysis phase.
3. Regular Expression
A regular expression is a sequence of symbols used to specify lexeme
patterns. A regular expression is helpful in describing the languages that
can be built using operators such as union, concatenation, and closure over
the symbols.

A regular expression ‘r’ that denotes a language L(r) is built recursively


over the smaller regular expression using the rules given below.

The following rules define the regular expression over some alphabet Σ
and the languages denoted by these regular expressions.
1. ∈ is a regular expression that denotes a language L(∈). The language L(∈)
has a set of strings {∈} which means that this language has a single empty
string.
2. If there is a symbol ‘a’ in Σ then ‘a’ is a regular expression that denotes a
language L(a). The language L(a) = {a} i.e. the language has only one string
of length one and the string holds ‘a’ in the first position.
3. Consider the two regular expressions r and s then:

 r|s denotes the language L(r) ∪ L(s).


 (r) (s) denotes the language L(r) ⋅ L(s).
 (r)* denotes the language (L(r))*.
 (r)+ denotes the language L(r).
Regular Definitions:
Giving names to regular expressions is referred to as a Regular definition.
If Σ is an alphabet of basic symbols, then a regular definition is a sequence
of definitions of the form.

d1 → r1
d2 → r2
………
dn → rn
 Each di is a distinct name.
 Each ri is a regular expression over the alphabet Σ U {d1, d2,. . . , di-1}.

Example: Identifiers is the set of strings of letters and string beginning


with a letter. Regular definition for this set:

letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9

id → letter ( letter | digit ) *.

Recognition of Tokens in Compiler Design 𝐀𝐝𝐦𝐢𝐧


Table of content
Recognition of Tokens in Compiler Design is a crucial step. It breaks down the source code into
understandable parts, like words and symbols. This helps the compiler understand the code's
structure and meaning.

Recognition of Tokens

 Tokens obtained during lexical analysis are recognized by Finite Automata.


 Finite Automata (FA) is a simple idealized machine that can be used to recognize patterns
within input taken from a character set or alphabet (denoted as C). The primary task of an
FA is to accept or reject an input based on whether the defined pattern occurs within the
input.
 There are two notations for representing Finite Automata. They are:

1. Transition Table
2. Transition Diagram

1. Transition Table

It is a tabular representation that lists all possible transitions for each state and input symbol
combination.
EXAMPLE
Assume the following grammar fragment to generate a specific language
where the terminals if, then, else, relop, id and num generates sets of strings given by
following regular definitions.

You may also like...

 Input Buffering in Compiler Design


 Type Checking in Compiler Design
 Specification of Tokens in Compiler Design

 where letter and digits are defined as - (letter → [A-Z a-z] & digit → [0-9])
 For this language, the lexical analyzer will recognize the keywords if, then, and else, as
well as lexemes that match the patterns for relop, id, and number.
 To simplify matters, we make the common assumption that keywords are also reserved
words: that is they cannot be used as identifiers.
 The num represents the unsigned integer and real numbers of Pascal.
 In addition, we assume lexemes are separated by white space, consisting of nonnull
sequences of blanks, tabs, and newlines.
 Our lexical analyzer will strip out white space. It will do so by comparing a string against
the regular definition ws, below.

 If a match for ws is found, the lexical analyzer does not return a token to the parser.
 It is the following token that gets returned to the parser.

2. Transition Diagram

It is a directed labeled graph consisting of nodes and edges. Nodes represent states, while edges
represent state transitions.
Components of Transition Diagram

1. One state is labelled the Start State. It is the initial state of transition diagram where

control resides when we begin to recognize a token.


2. Position is a transition diagram are drawn as circles and are called states.
3. The states are connected by Arrows called edges. Labels on edges are indicating the
input characters
4. Zero or more final states or Accepting states are represented by double circle in which
the tokens has been found.
5. Example:

6.

 Where state "1" is initial state and state 3 is final state.


Here is the transition diagram of Finite Automata that recognizes the lexemes matching the
token relop.

Here is the Finite Automata Transition Diagram for the Identifiers and Keywords.
Here is the Finite Automata Transition Diagram for recognizing white spaces.
Note:
These Finite Automata can be constructed using either the transition diagram or the transition
table representation. Both transition diagrams and transition tables serve the same purpose of
defining and representing the behavior of an FA. They provide different visual and structural
representations, allowing designers to choose the format that best suits their preferences or
requirements.
Lex in compiler design is a program used to generate scanners or lexical analyzers, also called
tokenizers. These tokenizers identify the lexical pattern in the input program and convert the input text
into the sequence of tokens. It is used with the YACC parser generator.

In this article, we will learn what Lex in compiler design is and its role in the compilation process. We
will also be learning about lexical analysis and terminologies related to it.
What is Lexical Analysis?
Compilation happens in multiple phases, and lexical analysis is the starting Phases of Compiler. It
gathers preprocessed source code, written in sentences, that comes as the preprocessor's output.
The lexical analyzer generates a stream of tokens from the preprocessed source code by removing
whitespace and comments. It generates an error if it gets any invalid token. The stream of character
is read by it, and it seeks the legal tokens, and then the data is passed to the syntax analyzer when
asked for it.
Uses of lexical analyzer
The following tasks are performed by a lexical analyzer-
 The lexical analyzer removes the white spaces and comments from the source program.
 It corresponds to the error messages with the source program.
 Helps in the identification of the tokens.
 A lexical analyzer reads the input characters from the source code.
Advantages of Lexical Analysis
Lexical analysis has the following advantages-
 The lexical analysis allows the browsers to format and display a web page with the help of parsed data.
 It is responsible for generating a compiled binary executable code.
 The lexical analyzer generates a more efficient and specialized processor for the task.
Disadvantages of Lexical Analysis
Lexical analysis has the following disadvantages-
 Additional runtime is required to create the symbol table and construct the token, generating overhead.
 Debugging and developing the lexer and token description requires much effort.
 A lot of time is required to read the source code and break it into tokens as the analyzer reads the
characters individually.
What is LEX in Compiler Design?
Lex is a tool or program that creates a lexical analyzer and helps us perform the task of lexical
analysis (It converts characters stream into tokens). The Lex tool is a compiler itself. The Lex
compiler transforms the input into input patterns.
Eric Schmidt and Mike Lesk initially developed the code for Lex, which was intended for Unix-based
systems.
Function of Lex
1. Lexical Analyzer Creation: The process begins by creating a program called lex.1 using Lex's language.
This program defines the rules and patterns for recognizing tokens in the source code
2. Lex Compiler Execution: The lex.1 program is then executed using the Lex compiler. This step
generates a C program named lex.yy.c
3. C Compiler Execution: The C compiler is then used to compile the generated lex.yy.c program. The
result is an object program referred to as a.out
4. Lexical Analysis: The a.out object program is essentially a lexical analyzer. When this program is run, it
takes an input stream of source code and transforms it into a sequence of tokens based on the rules
defined in the original lex.1 program
LEX Source Program
A LEX source program is a collection of instructions and patterns. These instructions and patterns are
written in the LEX programming language. LEX is a tool used for generating lexical analyzers. It
tokenizes input source code into meaningful units called tokens.
The LEX source program defines how these tokens are recognized and processed. It consists of
regular expressions that describe the patterns of tokens and corresponding actions to be taken when
those patterns are encountered in the source code. This program serves as a set of rules that
instructs LEX on how to break down a stream of characters from the input source code into tokens.
These tokens can represent keywords, identifiers, operators, and other language constructs. There
are two main components of this Lex source program:
1. Auxilary Definitions: These are often located in the "Definitions Section" of the LEX source program.
These are user-defined macros or variables that simplify the expression of regular expressions and
actions in the program

2. Translation Rules: These are commonly found in the "Rules Section" of the LEX source program. They
establish the mapping between patterns and actions. Each translation rule consists of a regular
expression pattern followed by the action to be executed when that pattern is matched in the input source
code
Working of Lex
The working of lex in compiler design as a lexical analysis takes place in multiple steps. Firstly we
create a file that describes the generation of the lex analyzer. This file is written in Lex language and
has a .l extension. The lex compiler converts this program into a C file called lex.yy.c. The C
compiler then runs this C file, and it is compiled into a.out file. This a.out file is our working Lexical
Analyzer which will produce the stream of tokens based on the input text.

Lex File Format


Any lex file consists of the following three parts-
 Definitions: This section of lex files contains declarations of constant, variable, and regular definitions.
 Rules: This section of the lex files defines the rules in the form of the regular expressions and
corresponding actions that the lexer should take to identify tokens in the input stream. Each rule consists
of a regular expression followed by a code block that specifies the action to take when the regular
expression matches. The rule is in the form of- p1 {action1} p2 {action2} ... pn {action}. Where Pi is the ith
Regular expression and action is the action that must be taken when the regular expression matches.
 User Subroutines: This section contains the user-defined functions that can be used by the action code
block in the rules section of the lex file.
The formal lex file has the following format-
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Regular Expressions
As we have discussed, we use the Regular expression to define the rules in Lex to see if the Input
String matches the pattern. So regular expressions represent a finite pattern in the language
containing a finite set of symbols. Regular expressions define a regular grammar, and regular
language is defined by the regular grammar.
Conflicts in Lex
Conflicts arise in Lex if a given string matches two or more different rules in Lex. Or when an input
string has a common prefix with two or more rules. This causes uncertainty for the lexical analyzer, so
it needs to be resolved.
Conflict resolution
Conflicts in Lex can be resolved by following two rules-
 The longer prefix should be preferred over a shorter prefix.
 Pick the pattern listed first in the Lex program if the longest possible prefix corresponds to two or more
patterns.
The Architecture of Lexical Analyzer
The task of the lexical analyzer is to read the input characters in the source code and produce tokens
one by one. The scanner produces tokens when it is requested by the parser. The lexical analyzer
also avoids any whitespace and comments while creating tokens. If any error occurs, the analyzer
correlates these errors with the source file and the line number.

Lookahead Operator
An additional character is read by Lex to differentiate between other patterns of a token. This is done
by Lex by reading an extra character ahead of the valid lexeme.
However, sometimes, we want a particular pattern to be matched to the input only when certain other
characters follow it. Then, we may use a slash in a pattern to indicate the end of the part of the
pattern that matches the lexeme.
For example:
The Lookahead operator is the addition operator read by Lex to distinguish different patterns for a
token. A lexical analyzer reads one character ahead of a valid lexeme and then retracts to produce a
token.
In some languages, keywords are not reserved. So the statements
IF (A, B) = 10 and IF(condition) THEN
Results in conflict about whether to produce IF as an array name or a keyword. To resolve this, the
lex rule for the Keyword IF can be written as,
IF/\ (.* \) { letter }
An example implementation of a Lexical Analyser using Lex
The following Lex Program counts the number of words-
Code
%{
#include<stdio.h>
#include<string.h>
int count = 0;
%}

/* Rules Section*/
%%
([a-zA-Z0-9])* {count++;} /* Rule for counting number of words*/

"\n" {printf("Total Number of Words : %d\n", count); count = 0;}


%%

int yywrap(void){}

int main()
{
// The function that starts the analysis
yylex();

return 0;
}
How to execute
Type lex lexfile.l

Type gcc lex.yy.c.

Type ./a.exe

Output

Explanation
In the definition section of the program, we have included the standard library for input-output
operations and string operations and a variable count to keep count of the number of words.
In the rules section, we have specified the regular expression ([a-zA-Z0-9])*, which matches any
string formed by alphabets and numeric digits, such as “AYUSHI28”, “BOND007”, etc.
There is a rule for a newline too. When a newline character is encountered, the current count of
words is printed, and the counter is set to zero.
Frequently Asked Questions
How to make a compiler with lex?
Creating a compiler with Lex involves a few key steps. First, we need to define regular expressions
that describe the patterns of tokens in the source code. Then, we need to write a Lex specification file
where you associate these regular expressions with corresponding actions.
What is lex in compiler design with example?
Lex is a lexical analyzer generator tool used in compiler design to generate lexical analyzers. It takes
regular expressions as input and produces code for scanning input streams. For example, in lex,
defining rules to recognize keywords in a programming language.
What is the function of lex?
Lex generates lexical analyzers, which are programs that break input into tokens or lexemes based
on specified patterns. These lexemes serve as input for the parser. The function of lex is to automate
the generation of lexical analyzers from regular expressions.
What is the purpose of the lex tool?
The purpose of the lex tool is to simplify the process of creating lexical analyzers for compilers and
interpreters. It achieves this by allowing developers to specify token recognition patterns using regular
expressions, which lex then translates into efficient, executable code for tokenizing input streams.
What are the three basic sections of the lex program?
The three basic sections of the lex program are:
1. Definition Section: Defines macros and regular expression patterns.
2. Rules Section: Specifies patterns and corresponding actions.
3. C Code Section: Contains user-defined C code to be executed alongside lex-generated code.

What is Lex?
Compiler DesignProgramming LanguagesComputer Programming

It is a tool or software which automatically generates a lexical analyzer (finite Automata). It


takes as its input a LEX source program and produces lexical Analyzer as its output. Lexical
Analyzer will convert the input string entered by the user into tokens as its output.
LEX is a program generator designed for lexical processing of character input/output stream.
Anything from simple text search program that looks for pattern in its input-output file to a C
compiler that transforms a program into optimized code.
In program with structure input-output two tasks occurs over and over. It can divide the
input-output into meaningful units and then discovering the relationships among the units
for C program (the units are variable names, constants, and strings). This division into
units (called tokens) is known as lexical analyzer or LEXING. LEX helps by taking a set of
descriptions of possible tokens n producing a routine called a lexical analyzer or LEXER or
Scanner.
LEX Source Program
It is a language used for specifying or representing Lexical Analyzer.
There are two parts of the LEX source program −
 Auxiliary Definitions
 Translation Rules

 Auxiliary Definition
It denotes the regular expression of the form.
Distinct Names [D1=R1\D2=R2\Dn=Rn][D1=R1\D2=R2\Dn=Rn] Regular Expressions
Where
 Distinct Names (Di)→ Shortcut name of Regular Expression
 Regular Expression (Ri)→ Notation to represent a collection of input symbols.
Example
Auxiliary Definition for Identifiers −

Auxiliary Definition for Signed Numbers


integer=digit digit*
sign = + | -
signedinteger = sign integer
Auxiliary Definition for Decimal Numbers
decimal = signedinteger . integer | sign.integer
Auxiliary Definition for Exponential Numbers
Exponential – No = (decimal | signedinteger) E signedinteger
Auxiliary Definition for Real Numbers
Real-No. = decimal | Exponential – No
 Translation Rules
It is a set of rules or actions which tells Lexical Analyzer what it has to do or what it has to
return to parser on encountering the token.
It consists of statements of the form −
P1 {Action1}
P2 {Action2}
.
.
.
Pn {Actionn}
Where
Pi → Pattern or Regular Expression consisting of input alphabets and Auxiliary definition
names.
Actioni → It is a piece of code that gets executed whenever a token is Recognized. Each
Actioni specifies a set of statements to be executed whenever each regular expression or
pattern Pi matches with the input string.
Example
Translation Rules for "Keywords"

We can see that if Lexical Analyzer is given the input "begin", it will recognize the token
"begin" and Lexical Analyzer will return 1 as integer code to the parser.
Translation Rules for "Identifiers"
letter (letter + digit)* {Install ( );return 6}

If Lexical Analyzer is given the token which is an "identifier", then the Action taken by the
Lexical Analyzer is to install or store the name in the symbol table & return value 6 as
integer code to the parser.

In compiler design, Lex (short for Lexical Analyzer) tools are used to perform lexical analysis, which is the first phase of
a compiler. This phase involves reading the source code and breaking it into tokens, such as keywords, identifiers, literals,
and symbols.

🔧 What is Lex?

Lex is a lexical analyzer generator developed in the 1970s. You provide Lex with rules using regular expressions for
recognizing tokens, and it generates C code for a lexical analyzer.

🧩 How Lex Works:

1. Input: Lex specification file (.l file), which contains:


o Definitions
o Rules (patterns and actions)
o User subroutines (optional C code)
2. Processing: Lex translates this into a C file (usually lex.yy.c) with a function yylex().
3. Output: A compiled lexer that reads input and identifies tokens based on your rules.

🧱 Structure of a Lex Program

A Lex program is divided into three sections:

%{
/* Definitions and header code (e.g., includes) */
%}

%%
/* Rules section */
[0-9]+ { printf("NUMBER\n"); }
[a-zA-Z]+ { printf("IDENTIFIER\n"); }
"if" { printf("IF KEYWORD\n"); }
%%
/* User code (main function etc.) */
int main() {
yylex();
return 0;
}

📦 Tokens and Patterns

 Lex uses regular expressions to define patterns.


 Actions (usually written in C) specify what to do when a pattern is matched.

🔄 Lex and Yacc

Lex is often used with Yacc (Yet Another Compiler Compiler), which performs syntax analysis (parsing). Lex handles
tokenization, and Yacc builds a parse tree or abstract syntax tree.

⚙️Alternatives to Lex

 Flex: Fast Lexical Analyzer, an open-source replacement for Lex.


 ANTLR: Supports lexer and parser generation for multiple languages.
 JLex: Lex for Java-based compilers.

✅ Summary

Tool Purpose
Lex Generates lexical analyzers
Flex Faster, improved Lex
Yacc Parser generator
ANTLR Lexer + parser for Java etc.

✅ Pros of Lex Tools

Advantage Details
🔧 Automation of Lexer Saves time by automatically generating code for token recognition based on
Generation regex patterns.
⚙️Integration with Works seamlessly with parser generators like Yacc, Bison, making compiler
Yacc/Bison development smoother.
Lex-generated scanners are fast and optimized for performance in handling
🚀 Efficiency
large inputs.
🧩 Regular Expression
Makes it easy to define complex token patterns using standard regex syntax.
Support
Keeps lexical analysis separate from parsing and other compiler stages—
📦 Modularity
promotes clean design.
Advantage Details
🆓 Open Source Versions (like Tools like Flex are free, widely supported, and often faster than the original
Flex) Lex.

❌ Cons of Lex Tools

Disadvantage Details
Lex outputs C code, so it ties your project to C/C++ unless additional
📚 C Language Dependency
wrappers are used.
Beginners may struggle with Lex’s syntax and concepts like regular
🧠 Steep Learning Curve
expressions and states.
Lex is not ideal for handling complex lexical errors or ambiguous
🔄 Limited Error Handling
inputs on its own.
💬 Not Very Flexible for Context- Difficult to handle languages where token meaning depends heavily
sensitive Lexing on context.
Compared to modern tools like ANTLR, Lex feels outdated and
🧱 Old and Low-level
lower-level.
Generated C code can be hard to debug, especially when combined
🧪 Debugging Can Be Tricky
with Yacc/Bison.

📝 Summary

Lex Tools Verdict


Quick and efficient generation of lexers for C/C++ compilers, especially in academic or legacy
Best for
projects.
Not ideal for Projects needing language flexibility, object-oriented design, or deep context-aware tokenization.

❓1. What is Lex?

Lex is a lexical analyzer generator that takes a set of token patterns (written as regular expressions) and
produces C code that can recognize those tokens in source code.

❓2. What is the main function generated by Lex?

Lex generates a function called yylex(), which reads the input stream, matches it against the defined patterns, and
returns tokens to the parser.

❓3. What are the main sections of a Lex program?

A Lex program is divided into three parts:

shell
CopyEdit
%{
C declarations (headers, includes)
%}

%%
Rules section (regex patterns and actions)
%%
User code (main function, helper functions)

❓4. How does Lex recognize tokens?

Lex uses regular expressions to define token patterns. When input text matches a pattern, Lex executes the
corresponding action (usually C code).

❓5. Can Lex handle context-sensitive tokens?

No. Lex is based on finite automata, so it works best with regular languages. For context-sensitive analysis, you usually
pass the token to a parser like Yacc.

❓6. What file does Lex generate?

Lex generates a C file, usually named lex.yy.c, which contains the yylex() function and supporting code.

❓7. What’s the difference between Lex and Flex?

 Lex: Original Unix tool.


 Flex (Fast Lex): A modern, open-source, and faster version of Lex, widely used today.

❓8. How is Lex used with Yacc or Bison?

Lex handles lexical analysis (tokenizing the input), and Yacc/Bison handles syntax analysis (parsing the token stream).
They work together in compiler front-ends.

❓9. What are common tokens recognized by Lex?

 Keywords: if, while, return


 Identifiers: variable names like x, total
 Operators: +, -, =
 Literals: numbers like 123, strings like "hello"

❓10. What language is used for actions in Lex?

The action part is written in C (or C++), which is inserted directly into the generated yylex() function.

❓11. Can Lex be used outside of compiler design?

Yes! Lex (or Flex) can be used in:

 Interpreters
 Code formatters
 Log file analyzers
 Text processors
 Scripting language front-ends

❓12. What are the limitations of Lex?

 Only handles regular languages (not suitable for parsing nested structures).
 Not good at handling complex, context-sensitive input.
 Tied to C/C++ unless wrappers are used.
 What is the main purpose of Lex in compiler design?
 A) Syntax analysis
B) Semantic analysis
C) Lexical analysis
D) Code optimization
 ✅ Answer: C) Lexical analysis

 2. What function is automatically generated by Lex?


 A) main()
B) parse()
C) yylex()
D) scanf()
 ✅ Answer: C) yylex()

 3. Lex is most commonly used with which other tool for parsing?
 A) Lint
B) GCC
C) ANTLR
D) Yacc
 ✅ Answer: D) Yacc

 4. Which of the following does Lex use to define token patterns?
 A) Context-free grammar
B) Backus-Naur Form (BNF)
C) Regular expressions
D) Finite state machines
 ✅ Answer: C) Regular expressions

 5. What file is produced after running Lex on a .l file?


 A) lex.yy.cpp
B) yylex.c
C) lex.yy.c
D) y.tab.c
 ✅ Answer: C) lex.yy.c

 6. Which programming language is primarily used in Lex actions?


 A) Java
B) C
C) Python
D) Assembly
 ✅ Answer: B) C

 7. Which type of grammar can Lex handle?


 A) Regular grammar
B) Context-free grammar
C) Context-sensitive grammar
D) Unrestricted grammar
 ✅ Answer: A) Regular grammar

 8. Which tool is considered the modern replacement of Lex?


 A) Yacc
B) Bison
C) Flex
D) LALR
 ✅ Answer: C) Flex

 9. What does the %% symbol represent in a Lex file?


 A) Start of user code
B) End of file
C) Separator between sections
D) End of main function
 ✅ Answer: C) Separator between sections

 10. Which of the following cannot be recognized by Lex?


 A) Identifiers
B) Nested parentheses
C) Integer constants
D) Operators
 ✅ Answer: B) Nested parentheses
(Lex can’t recognize nested patterns because it deals only with regular languages.)

LEX is a lexical analyzer generator tool used in compiler design to automatically generate a lexical analyzer (lexer or
scanner) from a set of regular expressions and actions. It's often used in conjunction with YACC (Yet Another Compiler
Compiler) for parsing.

🔧 Real-Time Applications of LEX in Compiler Design

LEX plays a crucial role in the front end of a compiler. Here are its key real-time applications:

1. Tokenization (Lexical Analysis)

LEX is used to break down source code into tokens like keywords, identifiers, operators, literals, etc.

Example:

c
CopyEdit
int x = 10;

LEX identifies:

 int → Keyword
 x → Identifier
 = → Assignment operator
 10 → Integer literal
 ; → Delimiter

2. Syntax Highlighting in IDEs

LEX-based tools are used in IDEs and editors (like VS Code, Eclipse, etc.) to tokenize code for syntax coloring.

3. Code Preprocessing

LEX can process macros, remove comments, handle conditional compilation (#ifdef, #endif) in C/C++
preprocessors.
4. Source Code Translation

In source-to-source compilers or transpilers, LEX helps read and reformat code for another language, e.g.,
translating Java to C++.

5. Log File Analyzers / Text Processing Tools

Beyond compilers: LEX is used in creating tools to parse log files, config files, or structured text (like JSON, CSV).

6. Interpreters & Scripting Languages

Languages like Python or Shell script interpreters use lexer generators like LEX to scan input code and feed it to
the parser.

7. Custom Domain-Specific Languages (DSLs)

When building DSLs (e.g., a language for describing data flows or AI models), LEX is used to implement the lexer
quickly.

8. Error Detection & Reporting

LEX can detect lexical errors (invalid tokens or characters) and report them with line and column numbers—vital
for debugging.

9. Language Analysis Tools

Tools that analyze or manipulate code (e.g., static analyzers, linters) often rely on a lexer built with LEX to read
source code.

Example LEX Specification:


lex
CopyEdit
%%
[0-9]+ { printf("NUMBER: %s\n", yytext); }
[a-zA-Z]+ { printf("IDENTIFIER: %s\n", yytext); }
[ \t\n] ; // Ignore whitespace
. { printf("UNKNOWN CHARACTER: %s\n", yytext); }
%%

Let me know if you’d like a visual diagram of how LEX fits into the compiler pipeline or a mini project using LEX
and YACC!

1. What is LEX in compiler design?

Answer:
LEX is a lexical analyzer generator used to create programs (lexers or scanners) that recognize tokens in source code
using regular expressions.

2. What is the output of a LEX program?

Answer:
LEX generates a C program (lex.yy.c) that performs lexical analysis — it reads input text and identifies tokens defined
in the LEX specification file.

3. How does LEX work?

Answer:
LEX takes a .l file (LEX source code), processes it to create lex.yy.c, which can then be compiled with gcc or cc to
produce an executable that performs scanning.

4. What is yytext in LEX?

Answer:
yytext is a built-in LEX variable that holds the current matched text — the token that the lexer has just recognized.

5. What is the role of yylex()?

Answer:
yylex() is the main function generated by LEX. It reads input and returns the next token to the parser or calling
function.

6. How do you define tokens in LEX?


Answer:
Tokens are defined using regular expressions in the rules section of the LEX file. For example:

lex
CopyEdit
[0-9]+ { printf("NUMBER\n"); }

7. What is the structure of a LEX file?

Answer: A LEX file has 3 sections:

lex
CopyEdit
%%
Definitions (optional)
%%
Rules (pattern-action pairs)
%%
User-defined C code (optional)

8. Can LEX be used with YACC?

Answer:
Yes! LEX and YACC are often used together:

 LEX does lexical analysis (tokenization)


 YACC handles parsing (syntax analysis)

9. What is lex.yy.c?

Answer:
It’s the C source file automatically generated by LEX after processing your .l file. This file contains the yylex()
function and all the logic to tokenize input.

10. Is LEX limited to C language?

Answer:
Traditional LEX generates C code. However, there are modern equivalents for other languages:

 Flex (Fast LEX) → C/C++


 ANTLR → Java, Python, etc.
11. What’s the difference between LEX and Flex?

Answer:

 LEX is the original tool.


 Flex is a faster, more modern version that’s widely used today.

12. What types of languages can use LEX?

Answer:
Any language with well-defined token patterns — C, Java, Python, SQL, even custom DSLs.

13. What are some alternatives to LEX?

Answer:

 Flex (most common alternative)


 ANTLR (lexing + parsing)
 JFlex (for Java)
 PLY (for Python)

14. Can LEX handle whitespace and comments?

Answer:
Yes. You can write rules in LEX to match and either ignore or handle whitespace, comments, and other non-token
content.

15. Is LEX still relevant today?

Answer:
Yes, especially in academic learning, compiler projects, interpreters, and text processing tools. While more modern
tools exist, LEX concepts remain foundational.

Compiler Design - Syntax Analysis


Previous
Quiz
Next
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn
the basic concepts used in the construction of a parser.

We have seen that a lexical analyzer can identify tokens with the help of regular expressions
and pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to
the limitations of the regular expressions. Regular expressions cannot check balancing
tokens, such as parenthesis. Therefore, this phase uses context-free grammar (CFG), which
is recognized by push-down automata.

CFG, on the other hand, is a superset of Regular Grammar, as depicted below:

It implies that every Regular Grammar is also context-free, but there exists some problems,
which are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the
syntax of programming languages.

Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce
terminologies used in parsing technology.

A context-free grammar has four components:

 A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of
strings. The non-terminals define sets of strings that help define the language generated by
the grammar.
 A set of tokens, known as terminal symbols (). Terminals are the basic symbols from which
strings are formed.
 A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of
a non-terminal called the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.
 One of the non-terminals is designated as the start symbol (S); from where the production
begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal
(initially the start symbol) by the right side of a production, for that non-terminal.

Example
We take the problem of palindrome language, which cannot be described by means of
Regular Expression. That is, L = { w | w = w R } is not a regular language. But it can be
described by means of CFG, as illustrated below:

G = ( V, , P, S )

Where:

V = { Q, Z, N }
= { 0, 1 }
P = { Q Z | Q N | Q | Z 0Q0 | N 1Q1 }
S = { Q }

This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101,
11111, etc.

Advertisement

-
PauseSkip backward 5 secondsSkip forward 5 seconds
Mute

Fullscreen

Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token
streams. The parser analyzes the source code (token stream) against the production rules to
detect any errors in the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and
generating a parse tree as the output of the phase.

Parsers are expected to parse the whole code even if some errors exist in the program.
Parsers use error recovering strategies, which we will learn later in this chapter.

Derivation
A derivation is basically a sequence of production rules, in order to get the input string.
During parsing, we take two decisions for some sentential form of input:

 Deciding the non-terminal which is to be replaced.


 Deciding the production rule, by which, the non-terminal will be replaced.

To decide which non-terminal to be replaced with production rule, we can have two options.

Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-
most derivation. The sentential form derived by the left-most derivation is called the left-
sentential form.

Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-
most derivation. The sentential form derived from the right-most derivation is called the
right-sentential form.

Example

Production rules:

E E+E
E E*E
E id

Input string: id + id * id

The left-most derivation is:

E E*E
E E+E*E
E id + E * E
E id + id * E
E id + id * id

Notice that the left-most side non-terminal is always processed first.

The right-most derivation is:

E E+E
E E+E*E
E E + E * id
E E + id * id
E id + id * id

Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are
derived from the start symbol. The start symbol of the derivation becomes the root of the
parse tree. Let us see this by an example from the last topic.

We take the left-most derivation of a + b * c

The left-most derivation is:

E E*E
E E+E*E
E id + E * E
E id + id * E
E id + id * id

Step 1:

EE*E

Step 2:
EE+E*E

Step 3:

E id + E * E

Step 4:

E id + id * E

Step 5:
E id + id * id

In a parse tree:

 All leaf nodes are terminals.


 All interior nodes are non-terminals.
 In-order traversal gives original input string.

A parse tree depicts associativity and precedence of operators. The deepest sub-tree is
traversed first, therefore the operator in that sub-tree gets precedence over the operator
which is in the parent nodes.

Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right
derivation) for at least one string.

Example

E E+E
E E E
E id

For the string id + id id, the above grammar generates two parse trees:
The language generated by an ambiguous grammar is said to be inherently ambiguous.
Ambiguity in grammar is not good for a compiler construction. No method can detect and
remove ambiguity automatically, but it can be removed by either re-writing the whole
grammar without ambiguity, or by setting and following associativity and precedence
constraints.

Associativity
If an operand has operators on both sides, the side on which the operator takes this operand
is decided by the associativity of those operators. If the operation is left-associative, then the
operand will be taken by the left operator or if the operation is right-associative, the right
operator will take the operand.

Example

Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If
the expression contains:

id op id op id

it will be evaluated as:

(id op id) op id

For example, (id + id) + id

Operations like Exponentiation are right associative, i.e., the order of evaluation in the same
expression will be:

id op (id op id)

For example, id ^ (id ^ id)


Precedence
If two different operators share a common operand, the precedence of operators decides
which will take the operand. That is, 2+3*4 can have two different parse trees, one
corresponding to (2+3)*4 and another corresponding to 2+(3*4). By setting precedence
among operators, this problem can be easily removed. As in the previous example,
mathematically * (multiplication) has precedence over + (addition), so the expression 2+3*4
will always be interpreted as:

2 + (3 * 4)

These methods decrease the chances of ambiguity in a language or its grammar.

Left Recursion
A grammar becomes left-recursive if it has any non-terminal A whose derivation contains A
itself as the left-most symbol. Left-recursive grammar is considered to be a problematic
situation for top-down parsers. Top-down parsers start parsing from the Start symbol, which
in itself is non-terminal. So, when the parser encounters the same non-terminal in its
derivation, it becomes hard for it to judge when to stop parsing the left non-terminal and it
goes into an infinite loop.

Example:

(1) A => A |

(2) S => A |
A => Sd

(1) is an example of immediate left recursion, where A is any non-terminal symbol and
represents a string of non-terminals.

(2) is an example of indirect-left recursion.


A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself
and the parser may go into a loop forever.

Removal of Left Recursion


One way to remove left recursion is to use the following technique:

The production

A => A |

is converted into following productions

A => A'
A'=> A' | ε

This does not impact the strings derived from the grammar, but it removes immediate left
recursion.

Second method is to use the following algorithm, which should eliminate all direct and
indirect left recursions.

START

Arrange non-terminals in some order like A1, A2, A3,, A n

for each i from 1 to n


{
for each j from 1 to i-1
{
replace each production of form Ai Aj
with Ai 1 | 2 | 3 ||
where Aj 1 | 2|| n are current Aj productions
}
}
eliminate immediate left-recursion

END

Example

The production set


S => A |
A => Sd

after applying the above algorithm, should become

S => A |
A => Ad | d

and then, remove immediate left recursion using the first technique.

A => dA'
A' => dA' | ε

Now none of the production has either direct or indirect left recursion.

Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down
parser cannot make a choice as to which of the production it should take to parse the string
in hand.

Example

If a top-down parser encounters a production like

A | |

Then it cannot determine which production to follow to parse the string as both productions
are starting from the same terminal (or non-terminal). To remove this confusion, we use a
technique called left factoring.

Left factoring transforms the grammar to make it useful for top-down parsers. In this
technique, we make one production for each common prefixes and the rest of the derivation
is added by new productions.

Example

The above productions can be written as

A => A'
A'=> | |

Now the parser has only one production per prefix which makes it easier to take decisions.
First and Follow Sets
An important part of parser table construction is to create first and follow sets. These sets
can provide the actual position of any terminal in the derivation. This is done to create the
parsing table where the decision of replacing T[A, t] = with some production rule.

First Set
This set is created to know what terminal symbol is derived in the first position by a non-
terminal. For example,

That is derives t (terminal) in the very first position. So, t FIRST().

Algorithm for calculating First set


Look at the definition of FIRST() set:

 if is a terminal, then FIRST() = { }.


 if is a non-terminal and is a production, then FIRST() = { }.
 if is a non-terminal and 1 2 3 n and any FIRST() contains t then t is in FIRST().

First set can be seen as:

Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal in
production rules. We do not consider what the non-terminal can generate but instead, we
see what would be the next terminal symbol that follows the productions of a non-terminal.

Algorithm for calculating Follow set:


 if is a start symbol, then FOLLOW() = $
 if is a non-terminal and has a production AB, then FIRST(B) is in FOLLOW(A) except .
 if is a non-terminal and has a production AB, where B , then FOLLOW(A) is in FOLLOW().

Follow set can be seen as: FOLLOW() = { t | S *t*}

Limitations of Syntax Analyzers


Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical
analyzers are responsible for the validity of a token supplied by the syntax analyzer. Syntax
analyzers have the following drawbacks -

 it cannot determine if a token is valid,


 it cannot determine if a token is declared before it is being used,
 it cannot determine if a token is initialized before it is being used,
 it cannot determine if an operation performed on a token type is valid or not.

These tasks are accomplished by the semantic analyzer, which we shall study in Semantic
Analysis.

Syntax Analysis: Compiler Top Down &


Bottom Up Parsing Types
What is Syntax Analysis?
Syntax Analysis is a second phase of the compiler design process in which the
given input string is checked for the confirmation of rules and structure of the
formal grammar. It analyses the syntactical structure and checks if the given
input is in the correct syntax of the programming language or not.
Syntax Analysis in Compiler Design process comes after the Lexical analysis
phase. It is also known as the Parse Tree or Syntax Tree. The Parse Tree is
developed with the help of pre-defined grammar of the language. The syntax
analyser also checks whether a given program fulfills the rules implied by a
context-free grammar. If it satisfies, the parser then creates the parse tree of that
source program. Otherwise, it will display error messages.
:09
01:50

Why do you need Syntax Analyser?


 Check if the code is valid grammatically
 The syntactical analyser helps you to apply rules to the code
 Helps you to make sure that each opening brace has a corresponding closing
balance
 Each declaration has a type and that the type must be exists

Important Syntax Analyser Terminology


Important terminologies used in syntax analysis process:

 Sentence: A sentence is a group of character over some alphabet.

 Lexeme: A lexeme is the lowest level syntactic unit of a language (e.g., total,
start).

 Token: A token is just a category of lexemes.

 Keywords and reserved words – It is an identifier which is used as a fixed part


of the syntax of a statement. It is a reserved word which you can’t use as a
variable name or identifier.
 Noise words – Noise words are optional which are inserted in a statement to
enhance the readability of the sentence.

 Comments – It is a very important part of the documentation. It mostly display


by, /* */, or//Blank (spaces)

 Delimiters – It is a syntactic element which marks the start or end of some


syntactic unit. Like a statement or expression, “begin”…”end”, or {}.

 Character set – ASCII, Unicode

 Identifiers – It is a restrictions on the length which helps you to reduce the


readability of the sentence.

 Operator symbols – + and – performs two basic arithmetic operations.


 Syntactic elements of the Language
Why do we need Parsing?

A parse also checks that the input string is well-formed, and if not, reject it.
Following are important tasks perform by the parser in compiler design:

 Helps you to detect all types of Syntax errors


 Find the position at which error has occurred
 Clear & accurate description of the error.
 Recovery from an error to continue and find further errors in the code.
 Should not affect compilation of “correct” programs.
 The parse must reject invalid texts by reporting syntax errors
Parsing Techniques
Parsing techniques are divided into two different groups:

 Top-Down Parsing,
 Bottom-Up Parsing

Top-Down Parsing
In the top-down parsing construction of the parse tree starts at the root and then
proceeds towards the leaves.

Two types of Top-down parsing are:

1. Predictive Parsing:

Predictive parse can predict which production should be used to replace the
specific input string. The predictive parser uses look-ahead point, which points
towards next input symbols. Backtracking is not an issue with this parsing
technique. It is known as LL(1) Parser

2. Recursive Descent Parsing:

This parsing technique recursively parses the input to make a prase tree. It
consists of several small functions, one for each nonterminal in the grammar.

Bottom-Up Parsing
In the bottom up parsing in compiler design, the construction of the parse tree
starts with the leave, and then it processes towards its root. It is also called as
shift-reduce parsing. This type of parsing in compiler design is created with the
help of using some software tools.

Error – Recovery Methods


Common Errors that occur in Parsing in System Software
 Lexical: Name of an incorrectly typed identifier
 Syntactical: unbalanced parenthesis or a missing semicolon
 Semantical: incompatible value assignment
 Logical: Infinite loop and not reachable code

A parser should able to detect and report any error found in the program. So,
whenever an error occurred the parser. It should be able to handle it and carry on
parsing the remaining input. A program can have following types of errors at
various compilation process stages. There are five common error-recovery
methods which can be implemented in the parser

Statement mode recovery


 In the case when the parser encounters an error, it helps you to take corrective
steps. This allows rest of inputs and states to parse ahead.
 For example, adding a missing semicolon is comes in statement mode recover
method. However, parse designer need to be careful while making these
changes as one wrong correction may lead to an infinite loop.

Panic-Mode recovery
 In the case when the parser encounters an error, this mode ignores the rest of
the statement and not process input from erroneous input to delimiter, like a
semi-colon. This is a simple error recovery method.
 In this type of recovery method, the parser rejects input symbols one by one until
a single designated group of synchronizing tokens is found. The synchronizing
tokens generally using delimiters like or.

Phrase-Level Recovery
 Compiler corrects the program by inserting or deleting tokens. This allows it to
proceed to parse from where it was. It performs correction on the remaining
input. It can replace a prefix of the remaining input with some string this helps the
parser to continue the process.

Error Productions
 Error production recovery expands the grammar for the language which
generates the erroneous constructs. The parser then performs error diagnostic
about that construct.

Global Correction
 The compiler should make less number of changes as possible while processing
an incorrect input string. Given incorrect input string a and grammar c, algorithms
will search for a parse tree for a related string b. Like some insertions, deletions,
and modification made of tokens needed to transform an into b is as little as
possible.

Grammar
A grammar is a set of structural rules which describe a language. Grammars
assign structure to any sentence. This term also refers to the study of these
rules, and this file includes morphology, phonology, and syntax. It is capable of
describing many, of the syntax of programming languages.

Rules of Form Grammar

 The non-terminal symbol should appear to the left of the at least one production
 The goal symbol should never be displayed to the right of the::= of any
production
 A rule is recursive if LHS appears in its RHS

Notational Conventions
Notational conventions symbol may be indicated by enclosing the element in
square brackets. It is an arbitrary sequence of instances of the element which
can be indicated by enclosing the element in braces followed by an asterisk
symbol, { … }*.

It is a choice of the alternative which may use the symbol within the single rule. It
may be enclosed by parenthesis ([,] ) when needed.

Two types of Notational conventions area Terminal and Non-terminals


1.Terminals:

 Lower-case letters in the alphabet such as a, b, c,


 Operator symbols such as +,-, *, etc.
 Punctuation symbols such as parentheses, hash, comma
 0, 1, …, 9 digits
 Boldface strings like id or if, anything which represents a single terminal symbol

2.Nonterminals:

 Upper-case letters such as A, B, C


 Lower-case italic names: the expression or some

Context Free Grammar


A CFG is a left-recursive grammar that has at least one production of the type.
The rules in a context-free grammar are mainly recursive. A syntax analyser
checks that specific program satisfies all the rules of Context-free grammar or
not. If it does meet, these rules syntax analysers may create a parse tree for that
programme.

expression -> expression -+ term

expression -> expression – term

expression-> term

term -> term * factor

term -> expression/ factor

term -> factor factor

factor -> ( expression )

factor -> id
Grammar Derivation
Grammar derivation is a sequence of grammar rule which transforms the start
symbol into the string. A derivation proves that the string belongs to the
grammar’s language.

Left-most Derivation

When the sentential form of input is scanned and replaced in left to right
sequence, it is known as left-most derivation. The sentential form which is
derived by the left-most derivation is called the left-sentential form.

Right-most Derivation

Rightmost derivation scan and replace the input with production rules, from right
to left, sequence. It’s known as right-most derivation. The sentential form which is
derived from the rightmost derivation is known as right-sentential form.

Syntax vs. Lexical Analyser


Syntax Analyser Lexical Analyser

The syntax analyser mainly deals with recursive constructs of the


The lexical analyser eases the task of the syntax analyser.
language.

The syntax analyser works on tokens in a source program to


The lexical analyser recognizes the token in a source program.
recognize meaningful structures in the programming language.

It is responsible for the validity of a token supplied by


It receives inputs, in the form of tokens, from lexical analysers.

the syntax analyser

Disadvantages of using Syntax Analysers


 It will never determine if a token is valid or not
 Not helps you to determine if an operation performed on a token type is valid or
not
 You can’t decide that token is declared & initialized before it is being used

Summary
 Syntax analysis is a second phase of the compiler design process that comes
after lexical analysis
 The syntactical analyser helps you to apply rules to the code
 Sentence, Lexeme, Token, Keywords and reserved words, Noise words,
Comments, Delimiters, Character set, Identifiers are some important terms used
in the Syntax Analysis in Compiler construction
 Parse checks that the input string is well-formed, and if not, reject it
 Parsing techniques are divided into two different groups: Top-Down Parsing,
Bottom-Up Parsing
 Lexical, Syntactical, Semantical, and logical are some common errors occurs
during parsing method
 A grammar is a set of structural rules which describe a language
 Notational conventions symbol may be indicated by enclosing the element in
square brackets
 A CFG is a left-recursive grammar that has at least one production of the type
 Grammar derivation is a sequence of grammar rule which transforms the start
symbol into the string
 The syntax analyser mainly deals with recursive constructs of the language while
the lexical analyser eases the task of the syntax analyser in DBMS
 The drawback of Syntax analyser method is that it will never determine if a token
is valid or not

Types of Parsers in Compiler Design



Compiler design has many functional modules one of them is the parser which takes the
output of the lexical analyzer (often a set of tokens list) and builds a parse tree. The main
responsibility of the parser is to confirm whether the generated language can produce
the input string and helps in the analysis of syntax.
What is a Parser?
The parser is one of the phases of the compiler which takes a token of string as input
and converts it into the corresponding Intermediate Representation (IR) with the help
of an existing grammar. The parser is also known as Syntax Analyzer.

Classification of Parser

Types of Parsers
The parser is mainly classified into two categories.
1. Top-down Parser
2. Bottom-up Parser
Top-Down Parser
Top-down parser is the parser that generates parse tree for the given input string with
the help of grammar productions by expanding the non-terminals. It starts from the start
symbol and ends down on the terminals. It uses left most derivation.
Further Top-down parser is classified into 2 types: 1) Recursive descent parser and 2)
non-recursive descent parser.
1. Recursive descent parser is also known as the Brute force parser or the backtracking
parser. It basically generates the parse tree by using brute force and backtracking
techniques.
2. Non-recursive descent parser is also known as LL(1) parser or predictive parser or
without backtracking parser or dynamic parser. It uses a parsing table to generate the
parse tree instead of backtracking.
Bottom-Up Parser
Bottom-up Parser is the parser that generates the parse tree for the given input string
with the help of grammar productions by compressing the terminals. It starts from
terminals and ends upon the start symbol. It uses the rightmost derivation in reverse
order.
Bottom-up parser is classified into two types:
LR parser: This is a bottom-up parser that generates the parse tree for the given string
by using unambiguous grammar. It follows the reverse of the rightmost derivation. LR
parser is classified into four types:
 LR(0)
 SLR(1)
 LALR(1)
 CLR(1)
Operator precedence parser: Operator Precedence Parser generates the parse tree from
given grammar and string but it has the condition that two consecutive non-terminals and
epsilon will never appears on the right-hand side of any production. The operator
precedence parsing technique is applied to Operator grammars.
What is Operator Grammar?
An operator precedence grammar is a context-free grammar that has the property that
no production has either:
 An empty on the right-hand side
 Two adjacent non-terminals in its right-hand side.
Conclusion
Parser is one of the important phases in compiler design that takes a token of string as
input and converts into the intermediate representation. If the programming code does
not follow the rules, then the compiler will generate an error message and the
compilation process will be stopped. Parser is classified into two types namely Top-down
Parser and Bottom-up Parser. Based on the requirements and the goals of the
programming language the appropriate parsing technique is used.

Syntax analyzers follow production rules defined by means of context-free grammar. The
way the production rules are implemented (derivation) divides parsing into two types : top-
down parsing and bottom-up parsing.

Top-down Parsing

When the parser starts constructing the parse tree from the start symbol and then tries to
transform the start symbol to the input, it is called top-down parsing.
 Recursive descent parsing : It is a common form of top-down parsing. It is called
recursive as it uses recursive procedures to process the input. Recursive descent parsing
suffers from backtracking.
 Backtracking : It means, if one derivation of a production fails, the syntax analyzer restarts
the process using different rules of same production. This technique may process the input
string more than once to determine the right production.

Bottom-up Parsing

As the name suggests, bottom-up parsing starts with the input symbols and tries to
construct the parse tree up to the start symbol.

Example
Input string : a + b * c

Production rules:

S E
E E+T
E E*T
E T
T id

Let us start bottom-up parsing

a+b*c

Read the input and check if any production matches with the input:

a+b*c
T+b*c
E+b*c
E+T*c
E*c
E*T
E
S
What is Parsing in Compiler Design?
The process of transforming the data from one format to another is called Parsing. This process can
be accomplished by the parser. The parser is a component of the translator that helps to organise
linear text structure following the set of defined rules which is known as grammar.

Types of Parsing:

There are two types of Parsing:

 The Top-down Parsing


 The Bottom-up Parsing

 Top-down Parsing: When the parser generates a parse with top-down expansion to the first trace, the left-
most derivation of input is called top-down parsing. The top-down parsing initiates with the start symbol and
ends on the terminals. Such parsing is also known as predictive parsing.


o

 Recursive Descent Parsing: Recursive descent parsing is a type of top-down parsing


technique. This technique follows the process for every terminal and non-terminal entity. It
reads the input from left to right and constructs the parse tree from right to left. As the
technique works recursively, it is called recursive descent parsing.
 Back-tracking: The parsing technique that starts from the initial pointer, the root node. If the
derivation fails, then it restarts the process with different rules.

 Bottom-up Parsing: The bottom-up parsing works just the reverse of the top-down parsing. It first traces
the rightmost derivation of the input until it reaches the start symbol.


o

 Shift-Reduce Parsing: Shift-reduce parsing works on two steps: Shift step and Reduce step.

 Shift step: The shift step indicates the increment of the input pointer to the next input
symbol that is shifted.
 Reduce Step: When the parser has a complete grammar rule on the right-hand side
and replaces it with RHS.
 LR Parsing: LR parser is one of the most efficient syntax analysis techniques as it works
with context-free grammar. In LR parsing L stands for the left to right tracing, and R stands
for the right to left tracing.

Why is parsing useful in compiler designing?


In the world of software, every different entity has its criteria for the data to be processed. So parsing
is the process that transforms the data in such a way so that it can be understood by any specific
software.

The Technologies Use Parsers:

 The programming languages like Java.


 The database languages like SQL.
 The Scripting languages.
 The protocols like HTTP.
 The XML and HTML.

Practice Problem Related To Parsing in Compiler Design


Q. Which one of the following kinds of derivation is used by LR parsers?

 (A) Leftmost
 (B) Leftmost in reverse
 (C) Rightmost
 (D) Rightmost in reverse

Q. Which of the following statements about parser is/are CORRECT?

I. Canonical LR is more powerful than SLR

II. SLR is more powerful than LALR.

III. SLR is more powerful than Canonical LR.

 (A) I only
 (B) II only
 (C) III only
 (D) II and III only

Frequently Asked Questions on Parsing in Compiler Design


Q1

What is LR Parsing?
LR parser is one of the most efficient syntax analysis techniques as it works with context-free
grammar. In LR parsing L stands for the left to right tracing, and R stands for the right to left tracing.
Q2

What is parsing and its types?

Parser is a compiler that is used to break the data into smaller elements coming from lexical analysis
phase.
Q3

What is parsing in programming?

To parse, in computer science, is where a string of commands – usually a program – is separated


into more easily processed components, which are analyzed for correct syntax and then attached to
tags that define each component.

In compiler design, parsers are responsible for analyzing the syntax of the source code to ensure it follows the grammar
of the programming language. There are two main types of parsers based on the direction of parsing:

1. Top-Down Parsers

These start parsing from the start symbol and try to derive the input string by applying production rules.

Common types:

 Recursive Descent Parser


o Hand-written, recursive functions for each non-terminal.
o Simple but may not handle left recursion.
 Predictive Parser
o A type of recursive descent parser without backtracking.
o Uses lookahead (typically 1 token) and parsing table.
o Grammar must be LL(1).

2. Bottom-Up Parsers

These start from the input string and try to reduce it to the start symbol using production rules.

Common types:

 Shift-Reduce Parser
o Shifts input symbols onto a stack and reduces them to grammar rules.
 Operator Precedence Parser
o Handles expressions with operators (like +, *, etc.) based on precedence and associativity.
 LR Parser (Left-to-right, Rightmost derivation in reverse)
o Most powerful type of bottom-up parser.
o Several variants:
 SLR (Simple LR)
 LALR (Look-Ahead LR)
 Canonical LR

Comparison at a Glance:

Parser Type Parsing Direction Method Grammar Type Power


Recursive Descent Top-down Recursive funcs LL (non-left rec.) Low-Medium
Predictive (LL(1)) Top-down Table-driven LL(1) Medium
SLR Bottom-up Table-driven LR(0) Medium
LALR Bottom-up Table-driven LR(1) simplified High
Canonical LR Bottom-up Table-driven Full LR(1) Very High

Top-down parsing is a strategy used in syntax analysis (parsing) of programming languages and formal
grammars. It starts from the start symbol and works down the grammar rules, attempting to produce the input
string. It's like trying to "guess" the structure of the input from the top of the grammar.

🧠 Key Concepts in Top-Down Parsing

1. Start at the Start Symbol


You begin parsing from the root (start symbol) of the grammar.
2. Apply Grammar Rules
You apply production rules to expand non-terminals.
3. Match Input
Try to match the terminal symbols with the input string from left to right.
4. Backtracking (optional)
In basic top-down parsers, if a rule doesn't work, it backtracks and tries a different one.

🔄 Types of Top-Down Parsers

Parser Type Description


Recursive
Uses a set of recursive functions; straightforward to implement but may involve backtracking.
Descent
Predictive A type of recursive descent parser without backtracking; uses lookahead (usually one token)
Parser to decide which rule to apply.
A common predictive parser that reads Left-to-right, produces a Leftmost derivation, and
LL(1) Parser
uses 1 token lookahead.
✅ Example

Given the grammar:

E → T + E | T
T → int

Input: int + int

Top-down parsing would look like:

1. Start at E
2. Try rule E → T + E
3. T → int, matches int
4. Match +
5. E → T, T → int, matches the second int

✔️Success!

❌ Problems in Top-Down Parsing

 Left Recursion: Causes infinite loops in top-down parsers.


o For example, E → E + T is left-recursive and must be eliminated.
 Ambiguity: If a grammar has multiple valid parse trees, top-down parsers may struggle.

Let me know if you want to see:

 How to eliminate left recursion


 LL(1) parsing table example
 Code example in Python or another language
 Bottom-up parsing for comparison

What are you working on it for? Class, compiler project, interview prep?

You said:

pros and cons of top down parsing


ChatGPT said:

Great question! Top-down parsing is intuitive and commonly taught first in compiler design courses. Here's a
breakdown of its pros and cons to give you a solid overview:
✅ Pros of Top-Down Parsing

Advantage Description
Especially recursive descent parsers — they map naturally to the grammar
Simple to Implement
rules using recursive functions.
Easy to understand and follow the logic, especially for smaller or simpler
Readable & Intuitive
grammars.
No Need for Stack (in Recursive Uses the program’s call stack, making it simpler than some bottom-up
Descent) approaches.
Produces leftmost derivations, which are helpful in certain theoretical and
Leftmost Derivation
practical contexts.
Since parsing is done left-to-right and top-down, it's easier to detect errors
Easier Error Detection
early.

❌ Cons of Top-Down Parsing

Disadvantage Description
Left-recursive rules (e.g., E → E + T) cause infinite recursion. These must
Can't Handle Left Recursion
be rewritten.
Basic top-down parsers may try many rules before finding the correct one.
Backtracking Can Be Inefficient
This can be slow.
Not all context-free grammars are suitable — only those that are LL(k)
Limited Grammar Support
(often LL(1)).
Most top-down parsers use a single-token lookahead, which can’t
Lookahead Limits
distinguish between complex cases.
More Manual Work for Large Predictive parsers may require first/ follow set calculations and rewriting to
Grammars fit LL(1) format.

🔍 Summary

Feature Top-Down Parsing


Parsing Direction Left to right
Derivation Leftmost
Suitable Grammar LL(k) (often LL(1))
Handles Left Recursion? ❌ No
Backtracking? ✅ Sometimes (unless predictive)
Implementation ✅ Simple (especially recursive descent)

If you're curious, I can also contrast this with bottom-up parsing (like LR parsers), which handle a wider range
of grammars but are trickier to build. Want to compare those?
You said:

faqs on topdown parsing


ChatGPT said:

Absolutely! Here’s a list of frequently asked questions (FAQs) about Top-Down Parsing, especially useful for
students, exam prep, or interview revision:

📘 Top-Down Parsing – FAQs

1. What is top-down parsing?

Answer:
Top-down parsing is a parsing strategy that starts from the start symbol and tries to rewrite it into the input
string by applying grammar rules. It builds the parse tree from the top (start symbol) to the leaves
(terminals).

2. What types of parsers use top-down parsing?

Answer:

 Recursive Descent Parser (can use backtracking)


 Predictive Parser (no backtracking, uses lookahead)
 LL(1) Parser (a type of predictive parser with 1-token lookahead)

3. What is left recursion and why is it a problem?

Answer:
Left recursion is when a non-terminal calls itself on the left side of a production, like E → E + T.
Top-down parsers can go into infinite recursion if left recursion isn't eliminated.

4. How do you eliminate left recursion?

Answer:
By rewriting the grammar.
Example:

vbnet
CopyEdit
Original: A → Aα | β
Fixed: A → βA'
A' → αA' | ε

5. What is lookahead in top-down parsing?

Answer:
Lookahead is how many tokens the parser checks in advance to decide which rule to apply.
In LL(1) parsing, it only uses 1 token of lookahead.

6. What does LL(1) stand for?

Answer:

 L: Left-to-right scan of input


 L: Leftmost derivation
 1: One-token lookahead

7. What are FIRST and FOLLOW sets used for?

Answer:
They help build parsing tables for predictive parsers:

 FIRST: What terminals can begin a derivation from a non-terminal.


 FOLLOW: What terminals can appear immediately after a non-terminal.

8. Can top-down parsers handle all context-free grammars?

Answer:
No. Only LL(k) grammars (often just LL(1)) are suitable for top-down parsing.

9. What are the advantages of top-down parsing?

Answer:

 Simple to implement (especially recursive descent)


 Intuitive and easy to debug
 Good for small grammars or educational use
10. What are the disadvantages?

Answer:

 Can’t handle left recursion


 Limited to LL grammars
 Backtracking (if used) can be inefficient

Want a cheat sheet for FIRST & FOLLOW sets or a step-by-step parsing example? I’ve got those too!

You said:

mcqs on top down parsing


ChatGPT said:

Sure! Here’s a solid set of MCQs (Multiple Choice Questions) on Top-Down Parsing, perfect for practice,
quizzes, or interviews.

📘 Top-Down Parsing MCQs

1. Top-down parsing starts from:

A) Leaf nodes
B) Bottom of the tree
C) Start symbol of the grammar
D) Input string

✅ Answer: C) Start symbol of the grammar

2. Which of the following is NOT a type of top-down parser?

A) Recursive descent parser


B) Predictive parser
C) LL parser
D) LR parser

✅ Answer: D) LR parser
(LR parsers are bottom-up)
3. What kind of derivation does a top-down parser produce?

A) Rightmost derivation
B) Leftmost derivation
C) Bottom-up derivation
D) None

✅ Answer: B) Leftmost derivation

4. What is a major limitation of top-down parsing?

A) It is hard to implement
B) Cannot handle right recursion
C) Cannot handle left recursion
D) Requires full input in memory

✅ Answer: C) Cannot handle left recursion

5. In LL(1) parsing, the '1' represents:

A) One line of code


B) One grammar rule
C) One lookahead symbol
D) One level of recursion

✅ Answer: C) One lookahead symbol

6. Which of the following is used to remove left recursion?

A) First and Follow sets


B) Left factoring
C) Grammar rewriting
D) Backtracking

✅ Answer: C) Grammar rewriting

7. Predictive parsing requires the grammar to be:


A) Left-recursive
B) Ambiguous
C) LL(1)
D) Non-deterministic

✅ Answer: C) LL(1)

8. Recursive descent parsing can involve:

A) Only leftmost derivation


B) No backtracking
C) Backtracking
D) Stack-based parsing

✅ Answer: C) Backtracking
(Unless it's predictive)

9. Which data structure is implicitly used in recursive descent parsing?

A) Stack
B) Queue
C) Tree
D) Hash Table

✅ Answer: A) Stack
(via function calls and call stack)

10. Top-down parsing reads the input:

A) Right to left
B) Left to right
C) Bottom-up
D) Randomly

✅ Answer: B) Left to right

Want a printable quiz version? Or do you want harder, concept-twisting MCQs (like for GATE/NET)?

You said:
real time applications on top down parsing
ChatGPT said:

Great question! 🔍 While top-down parsing is a theoretical concept from compiler design, it does have
practical, real-world applications, especially in systems that analyze, interpret, or validate structured
input.

🚀 Real-Time Applications of Top-Down Parsing

1. 🔧 Compilers & Interpreters

 Application: Parsing source code (e.g., C, Java, Python) into a syntax tree.
 How it uses top-down parsing:
o Recursive descent or predictive parsers process code by walking through grammar rules.
o Often used in small or educational compilers due to simplicity.
 ✅ Languages with LL grammars (e.g., Pascal, early Java subsets) can be parsed this way.

2. 📝 Expression Evaluators / Calculators

 Application: Evaluating mathematical expressions (like 3 + 4 * (2 - 1)).


 How:
o Top-down parsers are great for defining and evaluating grammar-based math expressions
recursively.
 ✅ Used in many command-line tools and scripting engines.

3. 🧠 Natural Language Processing (NLP)

 Application: Parsing sentences to understand structure and meaning.


 How:
o Syntax trees in NLP (like in context-free grammars for English) are sometimes built using top-
down strategies.
 ✅ Found in older NLP systems and educational parsers.

4. 🌐 Query Parsers (SQL / Search Engines)

 Application: Parsing SQL queries or search input.


 How:
o SQL queries follow a structured grammar. Top-down parsers can process and validate them.
 ✅ Used in some database front-ends or lightweight SQL engines.

5. 📜 Config File Parsers

 Application: Reading and validating config files (like .ini, .toml, or DSLs).
 How:
o Small domain-specific languages are often parsed using recursive descent (top-down).
 ✅ Great fit for hand-written parsers in embedded systems or tools.

6. 🧪 Educational Tools / Teaching Compilers

 Application: Teaching compiler construction in CS courses.


 Why:
o Easy to implement, debug, and demonstrate syntax errors.
 ✅ Widely used in university projects.

7. 🔍 Syntax Highlighting in Editors

 Application: Identifying code structure for highlighting.


 How:
o Some lightweight editors use top-down parsers (esp. for smaller languages).
 ✅ Can work in real time with fast recursive descent parsing.

🧠 Summary Table:

Use of Top-
Application Area Down
Parsing
Parsing and
Compilers syntax
checking
Evaluating
Calculators arithmetic
expressions
Grammar-
based
NLP
sentence
parsing
SQL/Query Engines Validating
Use of Top-
Application Area Down
Parsing
structured
query
syntax
Processing
Config Parsers structured
config files
Teaching
language
Education parsing and
compiler
design
Real-time
syntax
Text Editors/IDEs parsing for
features like
linting

Compiler Design - Top-Down Parser


Next

We have learnt in the last chapter that the top-down parsing technique parses the
input, and starts constructing a parse tree from the root node gradually moving
down to the leaf nodes. The types of top-down parsing are depicted below:

Recursive Descent Parsing

Recursive descent is a top-down parsing technique that constructs the parse tree
Use of Top-
Application Area Down
Parsing

from the top and the input is read from left to right. It uses procedures for every
terminal and non-terminal entity. This parsing technique recursively parses the
input to make a parse tree, which may or may not require back-tracking. But the
grammar associated with it (if not left factored) cannot avoid back-tracking. A form
of recursive-descent parsing that does not require any back-tracking is known
as predictive parsing.

This parsing technique is regarded recursive as it uses context-free grammar


which is recursive in nature.

Back-tracking

Top- down parsers start from the root node (start symbol) and match the input
string against the production rules to replace them (if matched). To understand
this, take the following example of CFG:

S rXd | rZd
X oa | ea
Z ai

For an input string: read, a top-down parser, will behave like this:

It will start with S from the production rules and will match its yield to the left-most
letter of the input, i.e. r. The very production of S (S rXd) matches with it. So the
top-down parser advances to the next input letter (i.e. e). The parser tries to
expand non-terminal X and checks its production from the left (X oa). It does not
match with the next input symbol. So the top-down parser backtracks to obtain the
next production rule of X, (X ea).

Now the parser matches all the input letters in an ordered manner. The string is
accepted.
Use of Top-
Application Area Down
Parsing

Predictive Parser

Predictive parser is a recursive descent parser, which has the capability to predict
which production is to be used to replace the input string. The predictive parser
does not suffer from backtracking.

To accomplish its tasks, the predictive parser uses a look-ahead pointer, which
points to the next input symbols. To make the parser back-tracking free, the
predictive parser puts some constraints on the grammar and accepts only a class
of grammar known as LL(k) grammar.

Predictive parsing uses a stack and a parsing table to parse the input and generate
a parse tree. Both the stack and the input contains an end symbol $ to denote that
Use of Top-
Application Area Down
Parsing

the stack is empty and the input is consumed. The parser refers to the parsing
table to take any decision on the input and stack element combination.

In recursive descent parsing, the parser may have more than one production to
choose from for a single instance of input, whereas in predictive parser, each step
has at most one production to choose. There might be instances where there is no
production matching the input string, making the parsing procedure to fail.

LL Parser

An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar


but with some restrictions to get the simplified version, in order to achieve easy
implementation. LL grammar can be implemented by means of both algorithms
namely, recursive-descent or table-driven.

LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to
right, the second L in LL(k) stands for left-most derivation and k itself represents
the number of look aheads. Generally k = 1, so LL(k) may also be written as LL(1).
Use of Top-
Application Area Down
Parsing

LL Parsing Algorithm

We may stick to deterministic LL(1) for parser explanation, as the size of table
grows exponentially with the value of k. Secondly, if a given grammar is not LL(1),
then usually, it is not LL(k), for any given k.

Given below is an algorithm for LL(1) Parsing:

Input:
string
parsing table M for grammar G

Output:
If is in L(G) then left-most derivation of ,
error otherwise.

Initial State : $S on stack (with S being start symbol)


$ in the input buffer

SET ip to point the first symbol of $.

repeat
let X be the top stack symbol and a the symbol pointed by ip.

if X∈ Vt or $
if X = a
POP X and advance ip.
else
error()
Use of Top-
Application Area Down
Parsing
endif

else /* X is non-terminal */
if M[X,a] = X Y1, Y2,... Yk
POP X
PUSH Yk, Yk-1,... Y1 /* Y1 on top */
Output the production X Y1, Y2,... Yk
else
error()
endif
endif
until X = $ /* empty stack */

A grammar G is LL(1) if A → | are two distinct productions of G:

for no terminal, both and derive strings beginning with a.


at most one of and can derive empty string.
if → t, then does not derive any string beginning with a terminal in FOLLOW(A).

Compiler Design - Bottom-Up Parser


Next

Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it
reaches the root node. Here, we start from a sentence and then apply production rules in
reverse manner in order to reach the start symbol. The image given below depicts the
bottom-up parsers available.
Shift-Reduce Parsing

Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as
shift-step and reduce-step.

 Shift step: The shift step refers to the advancement of the input pointer to the next input
symbol, which is called the shifted symbol. This symbol is pushed onto the stack. The shifted
symbol is treated as a single node of the parse tree.
 Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it to
(LHS), it is known as reduce-step. This occurs when the top of the stack contains a handle.
To reduce, a POP function is performed on the stack which pops off the handle and replaces
it with LHS non-terminal symbol.

LR Parser

The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of


context-free grammar which makes it the most efficient syntax analysis technique. LR
parsers are also known as LR(k) parsers, where L stands for left-to-right scanning of the input
stream; R stands for the construction of right-most derivation in reverse, and k denotes the
number of lookahead symbols to make decisions.

There are three widely used algorithms available for constructing an LR parser:

 SLR(1) Simple LR Parser:


o Works on smallest class of grammar
o Few number of states, hence very small table
o Simple and fast construction
 LR(1) LR Parser:
o Works on complete set of LR(1) Grammar
o Generates large table and large number of states
o Slow construction
 LALR(1) Look-Ahead LR Parser:
o Works on intermediate size of grammar
o Number of states are same as in SLR(1)

LR Parsing Algorithm

Here we describe a skeleton algorithm of an LR parser:

token = next_token()

repeat forever
s = top of stack

if action[s, token] = shift si then


PUSH token
PUSH si
token = next_token()

else if action[s, token] = reduce A::= then


POP 2 * || symbols
s = top of stack
PUSH A
PUSH goto[s,A]

else if action[s, token] = accept then


return

else
error()

LL vs. LR
LL LR

Does a leftmost derivation. Does a rightmost derivation in reverse.

Starts with the root nonterminal on the


Ends with the root nonterminal on the stack.
stack.

Ends when the stack is empty. Starts with an empty stack.

Uses the stack for designating what is still to Uses the stack for designating what is already
be expected. seen.
Builds the parse tree top-down. Builds the parse tree bottom-up.

Continuously pops a nonterminal off the Tries to recognize a right hand side on the
stack, and pushes the corresponding right stack, pops it, and pushes the corresponding
hand side. nonterminal.

Expands the non-terminals. Reduces the non-terminals.

Reads the terminals when it pops one off the Reads the terminals while it pushes them on
stack. the stack.

Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
Bottom-up parsing begins at the terminal symbols (leaf nodes) of a parse tree and ascends towards
the root node, combining smaller units into larger ones until the entire structure is formed. This
method is commonly used in syntax analysis to construct parse trees for input strings.
This blog covers the concept of bottom-up parsing, Shift Reduce Parsing, Operator Precedence
Parsing, and table-driven L-R Parsing.
What is Bottom-up parsing?
This parser compresses the non-terminals where it starts and moves backward to the starting symbol
of the parse tree using grammar productions to draw the input string's parse tree.
 In this technique, parsing starts from the leaf node of the parse tree to the start symbol of the parse tree in
a bottom-up manner.
 Bottom-up parsing attempts to construct a parse tree by reducing input string and using Right Most
Derivation.
 Bottom-up parsing starts with the input symbol and constructs the parse tree up to the start symbol using
a production rule to reduce the string to get the starting symbol.
Example
1. E → T
2. T→T*F
3. T → id
4. F→T
5. F → id
Parse Tree representation of input string "id * id" is as follows:

Reduction Process in Bottom-Up Parsing


The reduction process in bottom-up parsing is a key operation where the parser works from input
tokens towards the start symbol by reducing a sequence of symbols to a non-terminal using
grammar rules. This process is essential in constructing a parse tree for a given input.
How Reduction Works in Bottom-Up Parsing?
1. Shift – Read the next input symbol and push it onto the stack.
2. Reduce – Replace a sequence of symbols on the stack with a non-terminal if it matches a production rule
in the grammar.
3. Repeat – Continue shifting and reducing until the entire input is reduced to the start symbol of the
grammar.
Classification of Bottom-up parsing
Bottom-up parsing has been classified into various parsing. These are as follows:
1. Shift-Reduce Parsing
2. Operator Precedence Parsing
3. Table Driven LR Parsing

Classification of Bottom-up parsing


Shift Reduce Parsing
Shift reduce parsing is a process to reduce a string to its grammar start symbol. It uses a stack to
hold the grammar and an input tape to hold the string. Shift reduce parsing performs the two actions:
shift and reduce. This is why it is known as shift-reduce parsing. The current symbol in the input string
is pushed to a stack at the shift action. At each reduction, the symbols will be replaced by the non-
terminals. The non-terminal is the left side of the output, and the symbol is the right side of the
production.
Example
Grammar:
1. S → S+S
2. S → S-S
3. S → (S)
4. S → a
Input string: a1-(a2+a3)
Parsing table
Stack contents Input String Actions

$ a1-(a2+a3)$ Shift a1

$a1 -(a2+a3)$ Reduce by s->a

$S -(a2+a3)$ Shift -

$S- (a2+a3)$ Shift (

$S-( a2+a3)$ Shift a2

$S-(a2 +a3)$ Reduce by S->a

$S-(S +a3)$ Shift +

$S-(S+ a3)$ Shift a3

$S-(S+a3 )$ Reduce by S->a

$S-(S+S )$ shift)

$S-(S+S) $ Reduce by S->S+S

$S-(S) $ Reduce by S->(S)

$S-S $ Reduce by S->S-S

$S $ Accept
There are two main categories in shift-reduce parsing as follows:
1. Operator-Precedence Parsing
2. LR-Parser
Operator Precedence Parsing
Operator precedence grammar is a category in the shift-reduce method of parsing. It is applied to a
class of grammar operators. Operator precedence grammar must have the following two properties:
 No RHS of any product has a∈.
 Two non-terminals must not be adjacent.
Operator precedence can only be fixed between the terminals of the grammar. It generally ignores
the non-terminal.

a ⋗ b means that terminal "a" has higher precedence than terminal "b."
There are three operator precedence relations:

a ⋖ b means that terminal "a" has lower priority than terminal "b."
a ≐ b means that the terminal "a" and "b" both have the same precedence.
Precedence table
Following is a precedence table according to which grammar of the operator precedence parsing
works.
Parsing Action
The sequence in which parsing action is performed in operator precedence parsing is as:

Now we scan the input string from left to right until the character ⋗ is encountered.
 At first, the $ symbol is added to both ends of the string.

Scanning is done towards leftover all the equal precedence until the first leftmost ⋖ is encountered.

Everything between leftmost ⋖ and rightmost ⋗ is a handle.




 If it is $ on $, then it means parsing is successful.
Table-driven LR Parsing
LR parsing is also a category of Bottom-up parsing. It is generally used to parse the class of
grammars whose size is large. In the LR parsing, "L" stands for scanning of the input left-to-right, and
"R" stands for constructing a rightmost derivation in a reverse way.
"K" stands for the count of input symbols of the look-ahead that are used to make many parsing
decisions.
Types of LR Parsers:
1. LR( 1 )
2. SLR( 1 )
3. CLR ( 1 )
4. LALR( 1 )
The L stands for scan operations of Bottom-up parsing left-to-right, and R stands for scan right-to-left
in Bottom-up parsing.
LR bottom-up parsing parsers have several advantages of Bottom-up parsing like:
 Used in many programming languages except Perl and C.
 Efficient implementation.
 In L-operations, they detect quickly any syntactic errors present in a bottom-up parsing example.
LR parsing is divided into four categories of Parsing:
 LR (0) parsing
 SLR parsing
 CLR parsing
 LALR parsing

Types of LR parser
LR algorithm
The LR algorithm requires input, output, stack, and parsing tables. In all types of LR parsing, input,
output, and stack are the same, but the parsing table is different. The input buffer indicates the end of
input data, and it has the string to be parsed. The symbol of "$ follows that string."A stack is used to
contain the grammar symbols' sequence with the symbol $ at the stack's Bottom.
A parsing table can be defined as an array of two-dimensions. It usually contains two parts: the action
part and the go-to part.
LR (1) Parsing
The various steps involved in the LR (1) Parsing are as follows:
 At first, context-free grammar is written for the given input.
 The ambiguity of the grammar is checked.
 Augment production is added in the given grammar.
 A canonical collection of LR (0) items are created.
 A data flow diagram is drawn.
 An LR (1) parsing table is created.
Augment Grammar
Augmented grammar will be generated if we add one more product in the given grammar G. It helps
the parser identify when to stop the parsing and announce the acceptance of the input.
LL vs. LR
LL (Left-to-Right, Leftmost Derivation) and LR (Left-to-Right, Rightmost Derivation) are two types of
parsing techniques used in compiler construction and formal language theory. They belong to the
broader category of shift-reduce parsing algorithms. Here's a comparison between LL and LR
parsing:
Feature LL Parsing LR Parsing
Type Top-down parsing Bottom-up parsing
Derivation Leftmost Rightmost
Strategy Predictive Shift-reduce
Complexity Simpler, easier to understand and implement More complex, requires parsing tables
Power Limited power, handles simpler grammars More powerful, handles wider range of grammars
Lookahead Often requires more lookahead symbols Often requires fewer lookahead symbols
Error Handling Less efficient at error recovery More efficient at error recovery
Example Recursive descent parsing LR(0), SLR(1), LALR(1), CLR(1) parsing
Frequently Asked Questions
How does bottom-up parsing work?
Bottom-up parsers work by "shifting" symbols onto a stack till the stack top contains a right-hand side
of a production, and then the stack is reduced by replacing the right-hand side of production with the
left-hand side.
Write a drawback of the LR parser?
A major drawback of an LR parser is that it is too much work to construct an LR parser by hand.
What are the algorithms for constructing an LR parser?
There are mainly three such algorithms:
SLR(1) - Simple LR Parser
LR(1) - LR Parser
LALR(1) - Look-Ahead LR Parser
What is the reduce step in shift-reduce parsing?
When the parser finds a complete grammar rule in RHS and replaces it with LHS, it is a reduced step.
What are the features of a simple LR Parser?
It works on the smallest class of grammar, and they have simple and fast calculations.
Conclusion
In this article, we have discussed the concepts of bottom-up parsing, shift-reduce Parsing, operator
precedence parsing, and table-driven LR parsing and frequently asked questions.
Recommended Reading:

Difference Between Top Down and Bottom Up Parsing


Below is a table differentiating between top-down and bottom-up parsing.
Feature Top-Down Parsing Bottom-Up Parsing
Begins parsing from the start symbol and Begins with the input symbols (terminals) and works
Definition works down to the terminals by applying to the start symbol by recognizing and reducing
production rules. patterns.
Tries to derive the input string from the Constructs a parse tree by starting with the input an
Approach
grammar's start symbol using production rules. attempting to reach the grammar's start symbol.
Shift-Reduce Parsing (e.g., LR, SLR, LALR, or CLR
Parsing Method Predictive or Recursive Descent Parsing.
parsing).
Constructed from the top (root) down to the Constructed from the leaves (input symbols) up to t
Use of Parse Tree
leaves. root.
Generally easier to implement, especially for
Ease of More complex to implement. Typically relies on pars
simple grammars. Recursive descent parsers
Implementation generators like Yacc or Bison.
can be written manually.
Can be less efficient due to backtracking.
Usually more efficient and can handle a broader ran
Efficiency However, LL parsers (a subset of top-down
of grammars, including all LR grammars.
parsers) are efficient for LL grammars.
Handling Left Cannot directly handle left-recursive grammars Can naturally handle left-recursive grammars witho
Recursion without transformation or augmentation. needing to modify the grammar.
Often uses lookahead tokens to decide when to shi
May use lookahead tokens to decide which
and when to reduce. The number of lookahead toke
Lookahead production rule to apply next. LL parsers, for
varies by the type of parser (LR(0), SLR(1), LALR(1
instance, use 1 or more lookahead tokens.
LR(1)).
Can struggle with ambiguous grammars. Better equipped to handle ambiguities through its
Ambiguity
Modifications to the parser or grammar may be pattern recognition and reduction steps, though con
Resolution
required to resolve ambiguities. may still require resolution.
Bottom-Up Parsing in Compiler Design

Bottom-Up parsing is applied in the syntax analysis phase of the compiler. Bottom-up
parsing parses the stream of tokens from the lexical analyzer. And after parsing the input
string it generates a parse tree.
The bottom-up parser builds a parse tree from the leaf nodes and proceeds towards the
root node of the tree. In this section, we will be discussing bottom-up parsing along with its
types.
What is Bottom-Up Parsing?
Bottom-up parsing pareses the input string from the lexical analyzer. And if there is no
error in the input string it constructs a parse tree as output. The bottom-up parsing starts
building parse trees from the leaf nodes i.e., from the bottom of the tree.
And gradually proceeds upwards to the root of the parse tree.
The bottom-up parsers are created for the largest class of LR grammars. As the bottom-
up parsing corresponds to the process of reducing the string to the starting symbol of the
grammar.
Step of Reducing a String:

1. A specific substring from the input string is identified.


2. A non-terminal in a grammar whose production body matches the substring is identified.
3. The substring in the input string is replaced by the non-terminal identified in step 2.

The main problem of bottom-up parsing is to decide:

 When to reduce which substring from the input string.


 Which production from the grammar must be applied.

The reduction process is just the reverse of derivation that we have seen in top-down
parsing. Thus, the bottom-up parsing derives the input string reverse.
Shift Reduce Parsing
Shift reduce parsing is the most general form of bottom-up parsing. Here we have an input
buffer that holds the input string that is scanned by the parser from left to right. There is
also a stack that is used to hold the grammar symbols during parsing.

The bottom of the stack and the right end of the input buffer is marked with the $. Initially,
before the parsing starts:

 The input buffer holds the input string provided by the lexical analyzer.
 The stack is empty.

As the parser parses the string from left to right then it shifts zero or more input symbols
onto the stack.
The parser continues to shift the input symbol onto the stack until it is filled with a
substring. A substring that matches the production body of a nonterminal in the grammar.
Then the substring is replaced or reduced by the appropriate nonterminal.
The parser continues shift-reducing until either of the following condition occurs:

 It identifies an error
 The stack contains the start symbol of the grammar and the input buffer becomes empty.

To perform the shift-reduce parsing we have to perform the following four functions.
Four Functions of Shift Reduce Parsing:
1. Shift: This action shifts the next input symbol present on the input buffer onto the top of
the stack.
2. Reduce: This action is performed if the top of the stack has an input symbol that denotes
a right end of a substring. And within the stack there exist the left end of the substring. The
reduce action replaces the entire substring with the appropriate non-terminal. The
production body of this non-terminal matches the replaced substring.
3. Accept: When at the end of parsing the input buffer becomes empty. And the stack has
left with the start symbol of the grammar. The parser announces the successful
completion of parsing.
4. Error: This action identifies the error and performs an error recovery routine.

Let us take an example of the shift-reduce parser. Consider that we have a string id * id +
id and the grammar for the input string is:
E ->E + T | T
T -> T * F | F
F -> ( E ) | id

Note: In the shift-reduce parsing a handle always appears on the top of the stack. The
handle is a substring that matches the body of production in the grammar. The handle
must never appear inside the stack.
Bottom-Up Parsing in LR Parser
The bottom-up parsing is based on the concept of LR(K) parsing.

1. L stands for left-to-right parsing of the input string.

2. R stands for construction of the rightmost derivation in reverse.

3. K stands for the number of input symbols that the parser will look ahead for making a
parsing decision.

The LR parser has the following structure:


Similar to predictive parsing the end of the input buffer and end of stack has $.

 The input buffer has the input string that has to be parsed.
 The stack maintains the sequence of grammar symbols while parsing the input string.
 The parsing table is a two-dimensional array that has two entries ‘Go To’ and ‘Action’.

The LR parsing can be classified as:

1. LR(0) Parsing

2. Simple LR Parsing (SLR)

3. Canonical LR Parsing (CLR)

4. Look-Ahead LR Parsing (LALR)

All kinds of LR parsers are the same they only differ in the construction of their parsing
table. We will discuss each of them in brief in our future contents.
So, this is all about the Bottom-Up parsing. We have discussed the most general form of
bottom-up parsing i.e., shift-reduce bottom-up parsing along with an example. We have
discussed the structure of the LR parser and its types.

What is Bottom-Up Parsing?

In bottom-up parsing, the parser:

 Starts with the input string


 Reduces the string to the start symbol by inverting productions of the grammar
 Uses a method called "shift-reduce parsing"

It's the opposite of top-down parsing, which begins from the start symbol and tries to derive the string.

🧱 Key Concepts

1. Shift-Reduce Parsing

This is the core mechanism of bottom-up parsing. The parser uses a stack to hold grammar symbols.

 Shift: Push the next input token onto the stack


 Reduce: Replace the symbols on the stack that match the RHS of a grammar rule with the LHS (i.e., apply a
production in reverse)
 Accept: If the stack contains only the start symbol and input is consumed
 Error: If no rule applies and input can’t be parsed

2. Handle

A handle is a substring that matches the right-hand side of a production rule and can be reduced to the left-hand side.

🔧 Types of Bottom-Up Parsers


Parser Type Description
LR(0) Basic type, no lookahead
SLR(1) Simple LR, uses FOLLOW sets
LALR(1) Look-Ahead LR, more efficient, used in tools like YACC
Canonical LR(1) Most powerful, uses full lookahead

✅ Advantages

 Works with a larger class of grammars than top-down parsers


 Can parse left-recursive grammars
 Efficient for programming languages

❌ Disadvantages

 More complex to implement


 Error reporting is less intuitive compared to top-down parsing

⚙️Example: Shift-Reduce Steps

Given Grammar:

E → E + T | T
T → id

Input: id + id

Parsing Steps:

Stack Input Action


[] id+id Shift
[id] +id Reduce T → id
[T] +id Reduce E → T
[E] +id Shift
[E +] id Shift
[E + id] - Reduce T → id
[E + T] - Reduce E → E + T
[E] - Accept

Pros of Bottom-Up Parsing


Advantage Description
Can parse a larger class of grammars, including left-recursive grammars, which top-
Handles Complex Grammars
down parsers struggle with.
Efficient for Programming Most real-world programming language grammars are more naturally parsed bottom-up.
Languages Tools like YACC use this approach.
Many bottom-up parsers (e.g., LR parsers) use parsing tables for decisions, making them
Systematic and Table-Driven
more structured and automatable.
With proper tables, LR parsers never backtrack, making parsing predictable and
Deterministic Parsing
efficient.
Accurate Syntax Error Because of the rich state information, bottom-up parsers can often pinpoint syntax
Detection errors more precisely during parsing.

❌ Cons of Bottom-Up Parsing


Disadvantage Description
Complex Constructing LR parsing tables (especially Canonical LR) is difficult and time-consuming.
Disadvantage Description
Implementation
Harder to Understand Conceptually more challenging for beginners compared to top-down parsers.
Although error detection is good, error recovery (i.e., continuing after an error) is harder to
Poor Error Recovery
implement cleanly.
Less Intuitive Parse Parse trees are constructed in reverse order, making them less intuitive compared to top-down
Trees parsing.
Some types (e.g., Canonical LR(1)) require huge parsing tables, which may be inefficient in
Large Table Size
terms of memory.

Real-Time Applications of Bottom-Up Parsing

1. Compiler Construction Tools (Parser Generators)

Many well-known parser generators use bottom-up parsing:

Type of Bottom-Up
Tool Used For
Parser
YACC (Yet Another Compiler Widely used to generate parsers for programming languages
LALR(1)
Compiler) like C, SQL
Bison (GNU version of YACC) LALR(1) C/C++ compiler components, interpreters
Menhir (for OCaml) LR(1) Compiling OCaml and other functional languages
Earlier versions used bottom-up parsing (later switched to top-
ANTLR (older versions) SLR
down LL(*) parsers)

2. Programming Language Compilers

Many real compilers internally use bottom-up parsing techniques:

 GCC (GNU Compiler Collection): Uses parsers generated by Bison, a bottom-up LALR parser.
 Rust Compiler: Initially used Bison; though it's evolving, it started with LALR parsing.
 SQL Parsers: Many SQL engines use bottom-up parsers for complex query grammar parsing.

3. Scripting Languages

Languages like Lua, Bash, and parts of Perl use parser generators that often build on LALR-based techniques.

4. IDE Syntax Checking & Highlighting

In modern Integrated Development Environments (IDEs), real-time syntax checking or auto-completion engines often
use partial bottom-up parsing to:

 Validate user code on the fly


 Highlight syntax errors
 Suggest corrections or completions

Some use lightweight LALR-based parsing engines for this purpose due to their speed and precision.

5. Database Query Parsing

Databases like MySQL, PostgreSQL, and SQLite use bottom-up parsers to interpret and compile complex SQL queries
efficiently.

🎯 Why It's Used in Real-Time Systems

 ✅ Fast and deterministic — suitable for compilers and tools that require real-time parsing
 ✅ Powerful enough to handle complex, ambiguous, or left-recursive grammar (which is common in real
languages)
 ✅ Can be precompiled into efficient table-driven parsers

FAQs on Bottom-Up Parsing in Compiler Design

1. What is bottom-up parsing?

Bottom-up parsing is a syntax analysis technique where parsing starts from the input tokens (leaves) and proceeds
upward to the start symbol (root) by applying grammar rules in reverse (reductions).

2. How is bottom-up parsing different from top-down parsing?

Bottom-Up Parsing Top-Down Parsing


Builds parse tree from leaves to root Builds parse tree from root to leaves
Uses shift-reduce mechanism Uses recursive descent or predictive parsing
Can handle left recursion Cannot handle left recursion
More complex to implement Easier to implement

3. What is a shift-reduce parser?

A shift-reduce parser is a bottom-up parser that uses a stack and performs two main operations:

 Shift: Push the next input token onto the stack


 Reduce: Replace a group of symbols on the stack with a non-terminal based on a grammar rule

4. What is a handle in bottom-up parsing?


A handle is a substring that matches the right-hand side (RHS) of a grammar production and is the next reducible part
of the input during parsing.

5. What types of bottom-up parsers exist?

 LR(0): Basic form without lookahead


 SLR(1): Simple LR using FOLLOW sets
 LALR(1): Look-Ahead LR, optimized and widely used
 Canonical LR(1): Most powerful but large table size

6. What are parsing tables in bottom-up parsing?

Parsing tables guide the parser on whether to shift, reduce, accept, or throw an error based on the current state and
next input symbol. These tables are generated from the grammar.

7. Why is bottom-up parsing more powerful than top-down parsing?

Bottom-up parsers can handle a wider range of context-free grammars, including left-recursive and more ambiguous
grammars, which top-down parsers often cannot.

8. What are common tools that use bottom-up parsing?

 YACC (Yet Another Compiler Compiler)


 Bison
 Menhir
 GCC (C Compiler)

9. What are the advantages of bottom-up parsing?

 Handles complex grammars


 Can parse real programming languages efficiently
 Good for compiler-level syntax checking
 Deterministic parsing with no backtracking

10. What are the disadvantages?

 Complex to implement and understand


 Large parsing tables (especially in LR(1))
 Less intuitive for error handling
 Parse tree construction is in reverse

1. Which of the following is a bottom-up parser?

A. Recursive Descent Parser


B. Predictive Parser
C. Shift-Reduce Parser
D. LL(1) Parser

✅ Answer: C. Shift-Reduce Parser

2. Bottom-up parsing is also known as:

A. Top-down derivation
B. Leftmost derivation
C. Rightmost derivation in reverse
D. Rightmost derivation

✅ Answer: C. Rightmost derivation in reverse

3. What is a handle in bottom-up parsing?

A. Leftmost non-terminal
B. The next token to be parsed
C. A substring matching the LHS of a rule
D. A substring matching the RHS of a production to be reduced

✅ Answer: D. A substring matching the RHS of a production to be reduced

4. Which of the following parsers can handle left-recursive grammars?

A. LL(1)
B. Recursive Descent
C. SLR(1)
D. Predictive Parser

✅ Answer: C. SLR(1)

5. The action 'shift' in shift-reduce parsing means:


A. Replace a symbol with a rule
B. Move input symbol to the output
C. Move next input symbol to the stack
D. Pop symbol from the stack

✅ Answer: C. Move next input symbol to the stack

6. Which parsing technique uses a parsing table?

A. Top-down recursive parsing


B. Shift-reduce parsing
C. Operator precedence parsing
D. Both B and C

✅ Answer: D. Both B and C

7. Which of the following is NOT a type of LR parser?

A. LR(0)
B. SLR(1)
C. LALR(1)
D. LL(1)

✅ Answer: D. LL(1)

8. In which step does a parser apply grammar rules in reverse?

A. Lexical analysis
B. Reduction
C. Shifting
D. Code generation

✅ Answer: B. Reduction

9. Bottom-up parsers construct the parse tree:

A. From root to leaves


B. In parallel
C. From leaves to root
D. Randomly

✅ Answer: C. From leaves to root


10. Which of the following tools use bottom-up parsing?

A. Lex
B. YACC
C. ANTLR (modern version)
D. Flex

✅ Answer: B. YACC

1. What is bottom-up parsing?

Answer:
Bottom-up parsing is a syntax analysis technique where parsing starts from the input symbols (tokens) and works upward
by reducing substrings to non-terminals until the start symbol is reached.

2. What is a shift-reduce parser?

Answer:
A shift-reduce parser is a bottom-up parser that uses a stack to shift input tokens and reduce them by applying grammar
rules in reverse.

3. What is a handle in bottom-up parsing?

Answer:
A handle is a substring that matches the right-hand side of a production rule and can be reduced to a non-terminal during
parsing.

4. Name four types of bottom-up parsers.

Answer:

 LR(0)
 SLR(1)
 LALR(1)
 Canonical LR(1)

5. Which type of derivation does bottom-up parsing follow?

Answer:
It follows rightmost derivation in reverse.

6. Can bottom-up parsing handle left-recursive grammars?

Answer:
Yes, bottom-up parsers can handle left-recursive grammars.

7. What are the main actions in shift-reduce parsing?

Answer:

 Shift
 Reduce
 Accept
 Error

8. Mention one advantage of bottom-up parsing.

Answer:
It can parse a wider range of grammars, including left-recursive grammars.

9. What is the purpose of parsing tables in LR parsers?

Answer:
Parsing tables guide the parser on when to shift, reduce, accept, or report an error based on the current state and input.

10. Name one real-world tool that uses bottom-up parsing.

Answer:
YACC (Yet Another Compiler Compiler) uses bottom-up (LALR) parsing.

What is LR Parsing?

LR parsing stands for:

 L → Left-to-right scan of the input


 R → Producing a Rightmost derivation in reverse

It is a bottom-up parsing method used for deterministic context-free languages and is very efficient for programming
languages.

📘 Types of LR Parsers

There are four main types of LR parsers:

Type Full Name Description


No lookahead symbol. Uses only current state. Simple, but supports fewer
LR(0) Simple LR parser
grammars.
Simple LR with 1 Adds 1-token lookahead using FOLLOW sets. Easier to implement than full
SLR(1)
Lookahead LR(1).
Combines states of LR(1) to reduce table size, widely used in tools like
LALR(1) Look-Ahead LR parser
YACC.
Canonical Most powerful and precise. Uses 1-token lookahead and complete context, but
Full LR parser
LR(1) has large parsing tables.

📊 Comparison Table
Feature LR(0) SLR(1) LALR(1) Canonical LR(1)
Lookahead Used None 1 token 1 token 1 token
Grammar Coverage Low Medium High Highest
Table Size Small Medium Smaller than LR(1) Very Large
Practical Use Rare Educational Very common (e.g., YACC, Bison) Less common due to size

🧠 Quick Use Case Examples:

 LR(0): Mostly academic use, good for understanding basics.


 SLR(1): Slightly more powerful, used in educational tools.
 LALR(1): Real-world tools like YACC, Bison use this due to good balance.
 Canonical LR(1): Used when maximum grammar precision is needed despite large table size

SLR Parser
SLR represents "Simple LR Parser". It is very easy and cost-effective to execute. The SLR
parsing action and goto function from the deterministic finite automata that recognizes
viable prefixes. It will not make specificially defined parsing action tables for all grammars
but does succeed on several grammars for programming languages. Given a grammar G. It
augment G to make G’, and from G’ it can construct C, the canonical collection of a set of
items for G’. It can construct ACTION the parsing action function, and GOTO, the goto
function, from C using the following simple LR Parsing table construction technique. It
needed us to understand FOLLOW (A) for each non-terminal A of a grammar.
Advertisement

-
PauseSkip backward 5 secondsSkip forward 5 seconds
Mute
Fullscreen

CLR Parser
CLR refers to canonical lookahead. CLR parsing uses the canonical collection of LR (1) items
to construct the CLR (1) parsing table. CLR (1) parsing table make more number of states as
compared to the SLR (1) parsing. In the CLR (1), it can locate the reduce node only in the
lookahead symbols.

Explore our latest online courses and learn new skills at your own pace. Enroll and
become a certified expert to boost your career.

LALR Parser
LALR Parser is Look Ahead LR Parser. It is intermediate in power between SLR and CLR
parser. It is the compaction of CLR Parser, and hence tables obtained in this will be smaller
than CLR Parsing Table.
For constructing the LALR (1) parsing table, the canonical collection of LR (1) items is used.
In the LALR (1) parsing, the LR (1) items with the equal productions but have several look
ahead are grouped to form an individual set of items. It is frequently the similar as CLR
(1) parsing except for the one difference that is the parsing table.
The overall structure of all these LR Parsers is the same. There are some common factors
such as size, class of context-free grammar, which they support, and cost in terms of time
and space in which they differ.
Let us see the comparison between SLR, CLR, and LALR Parser.
SLR Parser LALR Parser CLR Parser

It is very easy and It is also easy and cheap It is expensive and


cheap to implement. to implement. difficult to implement.

LALR and SLR have the CLR Parser is the


SLR Parser is the
same size. As they have largest. As the number
smallest in size.
less number of states. of states is very large.

Error detection can be


Error detection is not Error detection is not
done immediately in
immediate in SLR. immediate in LALR.
CLR Parser.

SLR fails to produce a It is intermediate in


It is very powerful and
parsing table for a power between SLR and
works on a large class
certain class of CLR i.e., SLR ≤ LALR ≤
of grammar.
grammars. CLR.

It requires less time It requires more time and It also requires more
SLR Parser LALR Parser CLR Parser

time and space


and space complexity. space complexity.
complexity.

SLR Parser (with Examples)


Last Updated : 30 Jul, 2024



LR parsers is an efficient bottom-up syntax analysis technique that can be used to


parse large classes of context-free grammar is called LR(k) parsing.
L stands for left-to-right scanning
R stands for rightmost derivation in reverse
k is several input symbols. when k is omitted k is assumed to be 1.
Advantages of LR parsing
 LR parsers handle context-free grammars. These grammars describe the structure of
programming languages-how statements, expressions, and other language constructs
fit together.
 LR parsers ensure that your code adheres to these rules.
 It is able to detect syntactic errors
 It is an efficient non-backtracking shift shift-reducing parsing method.
Types of LR Parsing Methods
 SLR parser
 LALR parser
 Canonical LR parser
SLR Parser
 LR parser is also called as SLR parser
 it is weakest of the three methods but easier to implement
 a grammar for which SLR parser can be constructed is called SLR grammar
Steps for constructing the SLR parsing table
1. Writing augmented grammar
2. LR(0) collection of items to be found
3. Find FOLLOW of LHS of production
4. Defining 2 functions:goto[list of terminals] and action[list of non-terminals] in the
parsing table
EXAMPLE – Construct LR parsing table for the given context-free grammar
S–>AA
A–>aA|b
Solution:
STEP1: Find augmented grammar
The augmented grammar of the given grammar is:-
S’–>.S [0th production]
S–>.AA [1st production]
A–>.aA [2nd production]
A–>.b [3rd production]
STEP2: Find LR(0) collection of items
Below is the figure showing the LR(0) collection of items. We will understand everything
one by one.

The terminals of this grammar are {a,b}.


The non-terminals of this grammar are {S,A}
RULE –
If any non-terminal has ‘ . ‘ preceding it, we have to write all its production and add ‘ . ‘
preceding each of its production.
RULE –
from each state to the next state, the ‘ . ‘ shifts to one place to the right.
 In the figure, I0 consists of augmented grammar.
 Io goes to I1 when ‘ . ‘ of 0th production is shifted towards the right of S(S’->S.). this
state is the accepted state. S is seen by the compiler.
 Io goes to I2 when ‘ . ‘ of 1st production is shifted towards right (S->A.A) . A is seen
by the compiler
 I0 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler.
 I0 goes to I4 when ‘ . ‘ of the 3rd production is shifted towards right (A->b.) . b is seen
by the compiler.
 I2 goes to I5 when ‘ . ‘ of 1st production is shifted towards right (S->AA.) . A is seen
by the compiler
 I2 goes to I4 when ‘ . ‘ of 3rd production is shifted towards right (A->b.) . b is seen by
the compiler.
 I2 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler.
 I3 goes to I4 when ‘ . ‘ of the 3rd production is shifted towards right (A->b.) . b is seen
by the compiler.
 I3 goes to I6 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler
 I3 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler.
STEP3: Find FOLLOW of LHS of production
FOLLOW(S)=$
FOLLOW(A)=a,b,$
To find FOLLOW of non-terminals, please read follow set in syntax analysis.
STEP 4: Defining 2 functions:goto[list of non-terminals] and action[list of terminals] in the
parsing table. Below is the SLR parsing table.
 $ is by default a nonterminal that takes accepting state.
 0,1,2,3,4,5,6 denotes I0,I1,I2,I3,I4,I5,I6
 I0 gives A in I2, so 2 is added to the A column and 0 rows.
 I0 gives S in I1,so 1 is added to the S column and 1 row.
 similarly 5 is written in A column and 2 row, 6 is written in A column and 3 row.
 I0 gives a in I3 .so S3(shift 3) is added to a column and 0 row.
 I0 gives b in I4 .so S4(shift 4) is added to the b column and 0 row.
 Similarly, S3(shift 3) is added on a column and 2,3 row ,S4(shift 4) is added on b
column and 2,3 rows.
 I4 is reduced state as ‘ . ‘ is at the end. I4 is the 3rd production of grammar(A–
>.b).LHS of this production is A. FOLLOW(A)=a,b,$ . write r3(reduced 3) in the
columns of a,b,$ and 4th row
 I5 is reduced state as ‘ . ‘ is at the end. I5 is the 1st production of grammar(S–>.AA).
LHS of this production is S.
FOLLOW(S)=$ . write r1(reduced 1) in the column of $ and 5th row
 I6 is a reduced state as ‘ . ‘ is at the end. I6 is the 2nd production of grammar( A–
>.aA). The LHS of this production is A.
FOLLOW(A)=a,b,$ . write r2(reduced 2) in the columns of a,b,$ and 6th row
APPLICATIONS GALORE:
 Compiler
 Data Validation
 Natural Language Processing(NLP)
 Protocol Parsing
LALR Parser (with Examples)
Last Updated : 24 Jun, 2021



LALR Parser :
LALR Parser is lookahead LR parser. It is the most powerful parser which can handle
large classes of grammar. The size of CLR parsing table is quite large as compared to
other parsing table. LALR reduces the size of this table.LALR works similar to CLR. The
only difference is , it combines the similar states of CLR parsing table into one single
state.
The general syntax becomes [A->∝.B, a ]
where A->∝.B is production and a is a terminal or right end marker $
LR(1) items=LR(0) items + look ahead
How to add lookahead with the production?
CASE 1 –
A->∝.BC, a
Suppose this is the 0th production.Now, since ‘ . ‘ precedes B,so we have to write B’s
productions as well.
B->.D [1st production]
Suppose this is B’s production. The look ahead of this production is given as- we look at
previous production i.e. – 0th production. Whatever is after B, we find FIRST(of that
value) , that is the lookahead of 1st production. So, here in 0th production, after B, C is
there. Assume FIRST(C)=d, then 1st production become.
B->.D, d
CASE 2 –
Now if the 0th production was like this,
A->∝.B, a
Here,we can see there’s nothing after B. So the lookahead of 0th production will be the
lookahead of 1st production. ie-
B->.D, a
CASE 3 –
Assume a production A->a|b
A->a,$ [0th production]
A->b,$ [1st production]
Here, the 1st production is a part of the previous production, so the lookahead will be the
same as that of its previous production.
Steps for constructing the LALR parsing table :
1. Writing augmented grammar
2. LR(1) collection of items to be found
3. Defining 2 functions: goto[list of terminals] and action[list of non-terminals] in the
LALR parsing table
EXAMPLE
Construct CLR parsing table for the given context free grammar
S-->AA
A-->aA|b
Solution:
STEP1- Find augmented grammar
The augmented grammar of the given grammar is:-
S'-->.S ,$ [0th production]
S-->.AA ,$ [1st production]
A-->.aA ,a|b [2nd production]
A-->.b ,a|b [3rd production]
Let’s apply the rule of lookahead to the above productions.
 The initial look ahead is always $
 Now,the 1st production came into existence because of ‘ . ‘ before ‘S’ in 0th
production.There is nothing after ‘S’, so the lookahead of 0th production will be the
lookahead of 1st production. i.e. : S–>.AA ,$
 Now,the 2nd production came into existence because of ‘ . ‘ before ‘A’ in the 1st
production.
After ‘A’, there’s ‘A’. So, FIRST(A) is a,b. Therefore, the lookahead of the 2nd
production becomes a|b.
 Now,the 3rd production is a part of the 2nd production.So, the look ahead will be the
same.
STEP2 – Find LR(0) collection of items
Below is the figure showing the LR(0) collection of items. We will understand everything
one by one.
The terminals of this grammar are {a,b}
The non-terminals of this grammar are {S,A}
RULES –
1. If any non-terminal has ‘ . ‘ preceding it, we have to write all its production and add ‘ . ‘
preceding each of its production.
2. from each state to the next state, the ‘ . ‘ shifts to one place to the right.
 In the figure, I0 consists of augmented grammar.
 Io goes to I1 when ‘ . ‘ of 0th production is shifted towards the right of S(S’->S.). This
state is the accept state . S is seen by the compiler. Since I1 is a part of the 0th
production, the lookahead is same i.e. $
 Io goes to I2 when ‘ . ‘ of 1st production is shifted towards right (S->A.A) . A is seen
by the compiler. Since I2 is a part of the 1st production, the lookahead is same i.e. $.
 I0 goes to I3 when ‘ . ‘ of 2nd production is shifted towards the right (A->a.A) . a is
seen by the compiler.since I3 is a part of 2nd production, the lookahead is same i.e.
a|b.
 I0 goes to I4 when ‘ . ‘ of 3rd production is shifted towards right (A->b.) . b is seen by
the compiler. Since I4 is a part of 3rd production, the lookahead is same i.e. a|b.
 I2 goes to I5 when ‘ . ‘ of 1st production is shifted towards right (S->AA.) . A is seen
by the compiler. Since I5 is a part of the 1st production, the lookahead is same i.e. $.
 I2 goes to I6 when ‘ . ‘ of 2nd production is shifted towards the right (A->a.A) . A is
seen by the compiler. Since I6 is a part of the 2nd production, the lookahead is same
i.e. $.
 I2 goes to I7 when ‘ . ‘ of 3rd production is shifted towards right (A->b.) . A is seen by
the compiler. Since I6 is a part of the 3rd production, the lookahead is same i.e. $.
 I3 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I3 is a part of the 2nd production, the lookahead is same
i.e. a|b.
 I3 goes to I8 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler. Since I8 is a part of the 2nd production, the lookahead is same
i.e. a|b.
 I6 goes to I9 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler. Since I9 is a part of the 2nd production, the lookahead is same
i.e. $.
 I6 goes to I6 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I6 is a part of the 2nd production, the lookahead is same
i.e. $.
 I6 goes to I7 when ‘ . ‘ of the 3rd production is shifted towards right (A->b.) . b is seen
by the compiler. Since I6 is a part of the 3rd production, the lookahead is same i.e. $.
STEP 3 –
Defining 2 functions: goto[list of terminals] and action[list of non-terminals] in the parsing
table.Below is the CLR parsing table

Once we make a CLR parsing table, we can easily make a LALR parsing table from it.
In the step2 diagram, we can see that
 I3 and I6 are similar except their lookaheads.
 I4 and I7 are similar except their lookaheads.
 I8 and I9 are similar except their lookaheads.
In LALR parsing table construction , we merge these similar states.
 Wherever there is 3 or 6, make it 36(combined form)
 Wherever there is 4 or 7, make it 47(combined form)
 Wherever there is 8 or 9, make it 89(combined form)
Below is the LALR parsing table.

Now we have to remove the unwanted rows


 As we can see, 36 row has same data twice, so we delete 1 row.
 We combine two 47 row into one by combining each value in the single 47 row.
 We combine two 89 row into one by combining each value in the single 89 row.
The final LALR table looks like the below.
CLR Parser (with Examples)
Last Updated : 29 Oct, 2022



LR parsers :
It is an efficient bottom-up syntax analysis technique that can be used to parse large
classes of context-free grammar is called LR(k) parsing.
L stands for the left to right scanning
R stands for rightmost derivation in reverse
k stands for no. of input symbols of lookahead

Advantages of LR parsing :
 It recognizes virtually all programming language constructs for which CFG can be
written
 It is able to detect syntactic errors
 It is an efficient non-backtracking shift reducing parsing method.
Types of LR parsing methods :
1. SLR
2. CLR
3. LALR
CLR Parser :
The CLR parser stands for canonical LR parser.It is a more powerful LR parser.It makes
use of lookahead symbols. This method uses a large set of items called LR(1) items.The
main difference between LR(0) and LR(1) items is that, in LR(1) items, it is possible to
carry more information in a state, which will rule out useless reduction states.This extra
information is incorporated into the state by the lookahead symbol. The general syntax
becomes [A->∝.B, a ]
where A->∝.B is the production and a is a terminal or right end marker $
LR(1) items=LR(0) items + look ahead
How to add lookahead with the production?
CASE 1 –
A->∝.BC, a
Suppose this is the 0th production.Now, since ‘ . ‘ precedes B,so we have to write B’s
productions as well.
B->.D [1st production]
Suppose this is B’s production. The look ahead of this production is given as we look at
previous productions ie 0th production. Whatever is after B, we find FIRST(of that
value) , that is the lookahead of 1st production.So,here in 0th production, after B, C is
there. assume FIRST(C)=d, then 1st production become
B->.D, d
CASE 2 –
Now if the 0th production was like this,
A->∝.B, a
Here, we can see there’s nothing after B. So the lookahead of 0th production will be the
lookahead of 1st production. ie-
B->.D, a
CASE 3 –
Assume a production A->a|b
A->a,$ [0th production]
A->b,$ [1st production]
Here, the 1st production is a part of the previous production, so the lookahead will be the
same as that of its previous production.
These are the 2 rules of look ahead.
Steps for constructing CLR parsing table :
1. Writing augmented grammar
2. LR(1) collection of items to be found
3. Defining 2 functions: goto[list of terminals] and action[list of non-terminals] in the CLR
parsing table
EXAMPLE
Construct a CLR parsing table for the given context-free grammar
S-->AA
A-->aA|b
Solution :
STEP 1 – Find augmented grammar
The augmented grammar of the given grammar is:-
S'-->.S ,$ [0th production]
S-->.AA ,$ [1st production]
A-->.aA ,a|b [2nd production]
A-->.b ,a|b [3rd production]
Let’s apply the rule of lookahead to the above productions
 The initial look ahead is always $
 Now, the 1st production came into existence because of ‘ . ‘ Before ‘S’ in 0th
production.There is nothing after ‘S’, so the lookahead of 0th production will be the
lookahead of 1st production. ie: S–>.AA ,$
 Now, the 2nd production came into existence because of ‘ . ‘ Before ‘A’ in the 1st
production.After ‘A’, there’s ‘A’. So, FIRST(A) is a,b
Therefore,the look ahead for the 2nd production becomes a|b.
 Now, the 3rd production is a part of the 2nd production.So, the look ahead will be the
same.
STEP 2 – Find LR(1) collection of items
Below is the figure showing the LR(1) collection of items. We will understand everything
one by one.
The terminals of this grammar are {a,b}
The non-terminals of this grammar are {S,A}
RULE-
1. If any non-terminal has ‘ . ‘ preceding it, we have to write all its production and add ‘ . ‘
preceding each of its production.
2. from each state to the next state, the ‘ . ‘ shifts to one place to the right.
3. All the rules of lookahead apply here.
 In the figure, I0 consists of augmented grammar.
 Io goes to I1 when ‘ . ‘ of 0th production is shifted towards the right of S(S’->S.). This
state is the accept state . S is seen by the compiler. Since I1 is a part of the 0th
production, the lookahead is the same ie $
 Io goes to I2 when ‘ . ‘ of 1st production is shifted towards right (S->A.A) . A is seen
by the compiler. Since I2 is a part of the 1st production, the lookahead is the same i.e.
$.
 I0 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I3 is a part of the 2nd production, the lookahead is the
same ie a|b.
 I0 goes to I4 when ‘ . ‘ of the 3rd production is shifted towards right (A->b.) . b is seen
by the compiler. Since I4 is a part of the 3rd production, the lookahead is the same
i.e. a | b.
 I2 goes to I5 when ‘ . ‘ of 1st production is shifted towards right (S->AA.) . A is seen
by the compiler. Since I5 is a part of the 1st production, the lookahead is the same i.e.
$.
 I2 goes to I6 when ‘ . ‘ of 2nd production is shifted towards the right (A->a.A) . A is
seen by the compiler. Since I6 is a part of the 2nd production, the lookahead is the
same i.e. $.
 I2 goes to I7 when ‘ . ‘ of 3rd production is shifted towards right (A->b.) . A is seen by
the compiler. Since I6 is a part of the 3rd production, the lookahead is the same i.e. $.
 I3 goes to I3 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I3 is a part of the 2nd production, the lookahead is the
same i.e. a|b.
 I3 goes to I8 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler. Since I8 is a part of the 2nd production, the lookahead is the
same i.e. a|b.
 I6 goes to I9 when ‘ . ‘ of 2nd production is shifted towards the right (A->aA.) . A is
seen by the compiler. Since I9 is a part of the 2nd production, the lookahead is the
same i.e. $.
 I6 goes to I6 when ‘ . ‘ of the 2nd production is shifted towards right (A->a.A) . a is
seen by the compiler. Since I6 is a part of the 2nd production, the lookahead is the
same i.e. $.
 I6 goes to I7 when ‘ . ‘ of the 3rd production is shifted towards right (A->b.) . b is seen
by the compiler. Since I6 is a part of the 3rd production, the lookahead is the same ie
$.
STEP 3- defining 2 functions:goto[list of terminals] and action[list of non-terminals] in the
parsing table.Below is the CLR parsing table
 $ is by default a non terminal which takes accepting state.
 0,1,2,3,4,5,6,7,8,9 denotes I0,I1,I2,I3,I4,I5,I6,I7,I8,I9
 I0 gives A in I2, so 2 is added to the A column and 0 row.
 I0 gives S in I1,so 1 is added to the S column and 1st row.
 similarly 5 is written in A column and 2nd row, 8 is written in A column and 3rd row,
9 is written in A column and 6th row.
 I0 gives a in I3, so S3(shift 3) is added to a column and 0 row.
 I0 gives b in I4, so S4(shift 4) is added to the b column and 0 row.
 Similarly, S6(shift 6) is added on ‘a’ column and 2,6 row ,S7(shift 7) is added on b
column and 2,6 row,S3(shift 3) is added on ‘a’ column and 3 row ,S4(shift 4) is added
on b column and 3 row.
 I4 is reduced as ‘ . ‘ is at the end. I4 is the 3rd production of grammar. So write
r3(reduce 3) in lookahead columns. The lookahead of I4 are a and b, so write R3 in a
and b column.
 I5 is reduced as ‘ . ‘ is at the end. I5 is the 1st production of grammar. So write
r1(reduce 1) in lookahead columns. The lookahead of I5 is $ so write R1 in $ column.
 Similarly, write R2 in a,b column and 8th row, write R2 in $ column and 9th row.
1. What does "LR" stand for in LR parsing?

Answer:
L stands for scanning the input Left-to-right, and R stands for constructing a Rightmost derivation in reverse.

2. What is LR parsing used for?

Answer:
LR parsing is used for bottom-up syntax analysis in compilers to parse deterministic context-free languages efficiently.

3. How is LR parsing different from LL parsing?

Feature LR Parsing LL Parsing


Approach Bottom-up Top-down
Derivation Rightmost (in reverse) Leftmost
Can handle left recursion? ✅ Yes ❌ No
Grammar coverage More powerful Less powerful

4. What are the types of LR parsers?

Answer:

 LR(0) – No lookahead
 SLR(1) – Uses FOLLOW sets
 LALR(1) – Optimized, used in practice (e.g., YACC)
 Canonical LR(1) – Most powerful, uses full context

5. What is a parsing table in LR parsing?

Answer:
A parsing table guides the parser on what action to take (shift, reduce, accept, error) based on the current state and
lookahead symbol.

6. What is a shift-reduce conflict in LR parsing?

Answer:
It occurs when the parser table can't decide whether to shift the next input symbol or reduce the symbols on the stack.
This usually indicates a grammar ambiguity or a limitation of the LR parser type.
7. Which LR parser type is most commonly used in practice?

Answer:
LALR(1) is most commonly used in practice because it balances power and efficiency. Tools like YACC and Bison use
it.

8. Can LR parsers handle ambiguous grammars?

Answer:
No, LR parsers require unambiguous grammars. Ambiguities must be resolved manually or by rewriting the grammar.

9. What makes Canonical LR(1) more powerful than LALR(1)?

Answer:
Canonical LR(1) uses full lookahead and context, making it able to resolve more conflicts and handle more grammars
than LALR(1), but at the cost of much larger parsing tables.

10. What is the role of lookahead in LR parsers?

Answer:
Lookahead helps the parser decide the correct action (shift or reduce) when multiple options are available — especially
important in SLR(1), LALR(1), and LR(1).

You might also like