NLP Chapter 3
NLP Chapter 3
RETRIEVAL
Syntax parsing:
Syntax parsing, also known as syntactic parsing or simply parsing, is a fundamental task in
Natural Language Processing (NLP) that involves analyzing the grammatical structure of
sentences. The goal is to identify the syntactic structure of a sentence according to a given
grammar, typically representing it in a tree-like structure called a parse tree.
Syntax parsing is the process of determining the syntactic structure of a sentence by identifying
relationships between words, such as which words function as subjects, objects, predicates, and
how they combine to form phrases. This process is essential for understanding the meaning of a
sentence and is used in various NLP applications such as machine translation, information
extraction, question answering, and more.
1. Constituency Parsing
Constituency Parsing identifies the hierarchical structure of a sentence, breaking it down into
sub-phrases (constituents) that belong together. These constituents are typically represented in a
parse tree where each node represents a phrase, and the leaves are the individual words.
Phrase Structure Grammar: Constituency parsing is based on phrase structure grammars, such
as Context-Free Grammar (CFG). These grammars consist of a set of production rules that define
how sentences can be generated.
Parse Tree: In the parse tree, each internal node corresponds to a phrase or constituent, and the
edges connect nodes to their constituent parts. The tree is rooted in the start symbol (often S for
sentence) and spans out to cover the entire sentence.
Example:
/ \
NP VP
/ / \
Det V PP
| | /\
The cat P NP
| /\
on Det N
| |
the mat
Here, "S" is the sentence, "NP" is the noun phrase ("The cat"), "VP" is the verb phrase ("sat on
the mat"), "Det" is the determiner, "V" is the verb, and "PP" is the prepositional phrase.
2. Dependency Parsing
Dependency Tree: In a dependency tree, each word is connected to its governor (or head)
through a directed edge, forming a tree structure rooted at the main verb or another central word.
The tree shows which words are directly dependent on others.
Example:
For the sentence "The cat sat on the mat," a possible dependency tree could be:
sat
/ \
cat on
| |
The mat
the
Here, "sat" is the root, and "cat" is its subject, "on" is the prepositional modifier, "mat" is the
object of "on," and "the" modifies "mat."
Parsing Techniques
Several techniques and algorithms are used for syntactic parsing, both for constituency and
dependency parsing.
1. Top-Down Parsing
Method: Begins with the start symbol and tries to derive the sentence by recursively applying
production rules.
2. Bottom-Up Parsing
Method: Starts with the input sentence and attempts to construct the parse tree by combining
constituents until the start symbol is derived.
5. Transition-Based Parsing
Method: Uses a series of transitions to build a parse tree incrementally. Often used in
dependency parsing.
6. Neural Parsing
Method: Uses neural networks, particularly Recurrent Neural Networks (RNNs) and
Transformer models, to predict the structure of sentences.
Advantage: High accuracy and ability to learn from large corpora without relying on hand-
crafted grammars.
Parse Accuracy: Measures how accurately the parser identifies the correct structure.
Precision, Recall, and F1-Score: Used to evaluate the correctness of the parser's predictions
compared to a gold standard.
UAS (Unlabeled Attachment Score): In dependency parsing, UAS measures the percentage of
correct head-dependent relations, ignoring the type of relation.
Speed: How quickly the parser can analyze sentences, important for real-time applications.
Ambiguity: Sentences can often be parsed in multiple valid ways. Ambiguity is a significant
challenge, especially in natural language with complex sentences.
Complex Sentences: Long and complex sentences with multiple clauses or nested structures can
be difficult to parse accurately.
Data Sparsity: In probabilistic models, rare or unseen constructions in the training data can lead
to poor performance.
Dependency on Annotated Data: High-quality parsing often requires large annotated corpora,
which can be costly to produce.
Grammar Checking: Used in tools that check and correct grammatical errors in text.
Conclusion
Syntax parsing is a critical component of many NLP systems, providing a foundation for
understanding the structure and meaning of language. Whether through traditional methods like
constituency and dependency parsing or modern neural approaches, syntax parsing enables a
deeper analysis of language, which is essential for advanced NLP applications.
Grammar Formalism
Grammar formalism refers to a set of rules and structures that define the syntactic (and
sometimes semantic) structure of sentences in a language. These formalisms are essential for
tasks such as parsing, where the goal is to analyze the structure of sentences according to a
specific grammar.
Example:
S → NP VP
NP → Det N
VP → V NP
Det → "the"
N → "cat" | "mat"
V → "sat"
Parse Tree: Sentences are parsed into a hierarchical tree structure using these rules.
Definition: A PCFG is an extension of CFG that assigns probabilities to each production rule,
allowing for the representation of uncertainty and preference in syntactic structures.
Application: PCFGs are used to model the likelihood of different parse trees for a given
sentence.
Example:
S → NP VP [0.9]
NP → Det N [0.5]
VP → V NP [0.8]
Benefit: PCFGs help in resolving ambiguities by choosing the most probable parse tree.
3. Dependency Grammar
Definition: Dependency grammar focuses on the relationships between words in a sentence,
where words are connected by directed edges (dependencies) indicating which words depend on
others.
Structure: The grammar consists of dependency rules that specify how words relate to each
other (e.g., subject-verb, verb-object).
Example: In the sentence "The cat sat on the mat," "sat" might be the root verb, with "cat" as the
subject and "on the mat" as a prepositional phrase dependent on "sat."
Dependency Tree: The structure is represented in a dependency tree rather than a constituency
tree.
Definition: LFG is a grammar formalism that separates syntactic structure (constituent structure)
from functional structure (grammatical functions like subject and object).
Structure:
Application: LFG is used to capture the syntactic structure and its associated grammatical
functions.
Structure: It uses feature structures (complex attribute-value matrices) to represent syntactic and
semantic properties of phrases.
Application: HPSG is used in syntactic parsing and in capturing rich linguistic information.
Definition: TAG is a highly structured grammar formalism where elementary trees represent
basic syntactic structures, and complex sentences are derived by combining these trees.
Structure:
Treebanks
Treebanks are annotated corpora that provide a collection of sentences paired with their syntactic
(and sometimes semantic) structures, typically represented as parse trees. They are essential
resources for training and evaluating syntactic parsers and other NLP tools.
Importance of Treebanks
Types of Treebanks
1. Penn Treebank
Description: One of the most famous treebanks, the Penn Treebank, contains syntactic
annotations for English sentences based on a CFG formalism.
Contents: Includes parse trees for over 4.5 million words of text from sources like the
Wall Street Journal.
Usage: Widely used for training and evaluating syntactic parsers.
4. Chinese Treebank
Description: A treebank for Mandarin Chinese, providing syntactic annotations similar
to those in the Penn Treebank.
Contents: Contains thousands of parsed sentences, primarily from news sources.
Usage: Used for training Chinese syntactic parsers and for linguistic studies on Mandarin
Chinese.
5. TIGER Treebank
Description: A German treebank that uses both phrase structure and dependency
annotations.
Contents: Provides annotations for sentences from German newspapers.
Usage: Used for research in German syntax and for training German parsers.
6. OntoNotes
Description: A large-scale corpus that includes syntactic trees, predicate-argument
structures, coreference chains, and word senses.
Contents: Covers multiple languages and genres, including newswire, broadcast
conversation, and weblogs.
Usage: Used in a wide range of NLP tasks, including parsing, coreference resolution, and
semantic role labeling.
Ambiguity:
Natural language is often ambiguous, meaning that a sentence can have multiple valid
parses. Annotators must decide on the most appropriate parse, which can be subjective.
Consistency:
Complexity:
Sentences can be syntactically complex, with nested structures and long dependencies,
making annotation difficult.
Formalisms need to be expressive enough to capture these complexities without
becoming unwieldy.
Language Variation:
Conclusion
Grammar formalism and treebanks are central to syntactic analysis in NLP. Grammar formalisms
provide the theoretical foundation for understanding and analyzing sentence structure, while
treebanks offer the practical data needed to train and evaluate models. Together, they enable a
deeper understanding of language, supporting a wide range of NLP applications from parsing to
machine translation.
Features are attributes or properties associated with linguistic elements such as words, phrases,
or syntactic categories. These features can encode various types of grammatical information,
such as:
Syntactic Features:
Semantic Features:
Lexical Features:
Subcategorization: Information about the types of complements a verb can take (e.g.,
whether a verb requires a direct object).
Head Features: Features that govern the behavior of phrases (e.g., the main verb in a
verb phrase).
"She":
POS: Pronoun
Number: Singular
Person: Third
Gender: Feminine
"sings":
POS: Verb
Tense: Present
Number: Singular
Person: Third
In this case, the verb "sings" must agree with the subject "She" in both number (singular) and
person (third person). These features help ensure that the sentence is grammatically correct.
Unification is a computational process used in syntax parsing to ensure that features match or
combine correctly across different elements in a sentence. Unification checks that the features of
syntactic constituents are compatible and merges them when they are.
Unification operates on feature structures, which are sets of attribute-value pairs associated with
linguistic elements. These structures can be represented as:
[
Subject: [ Number: Singular, Person: Third ],
Tense: Present
Unification involves comparing two feature structures and merging them if they are compatible.
If there is a conflict (e.g., one structure has "Number: Singular" and the other has "Number:
Plural"), unification fails, indicating a grammatical error or incompatibility.
Example of Unification
The subject "She" has the features [ Number: Singular, Person: Third ].
The verb "sings" requires a subject with matching features.
During parsing, unification checks that the features of "She" match the requirements of "sings":
1. Subject Features:
2. Verb Features:
If the subject and verb features are compatible (which they are in this case), unification succeeds,
and the sentence is considered grammatically correct.
Constraint Handling: Unification can enforce linguistic constraints (such as agreement) during
parsing, ensuring that only grammatically correct structures are generated.
Modularity: Feature structures can be easily extended or modified, allowing for modular
grammar development and adaptation to different languages or linguistic theories.
Challenges of Unification
Ambiguity Handling: In cases where multiple feature structures can unify with a given input,
ambiguity arises, requiring additional mechanisms to select the most appropriate parse.
Error Handling: When unification fails, it can be challenging to diagnose and recover from the
error, especially in natural language with its inherent variability and exceptions.
Conclusion
Features and unification are powerful tools in syntax parsing, allowing for detailed and flexible
representations of linguistic information. By encoding grammatical features and using
unification to ensure consistency, these mechanisms support accurate and expressive parsing in a
wide range of grammar formalisms.
Non-Terminal Symbols (Variables): These are syntactic categories like Sentence (S), Noun
Phrase (NP), Verb Phrase (VP), etc. They represent abstract grammatical structures that can be
expanded into more specific forms.
Terminal Symbols: These are the actual words or tokens in the language. For example, "cat,"
"sat," and "mat" are terminal symbols.
Production Rules: These rules define how non-terminal symbols can be expanded into
sequences of non-terminal and/or terminal symbols. For example, a production rule might
specify that a sentence (S) consists of a noun phrase (NP) followed by a verb phrase (VP):
S → NP VP
Start Symbol: This is the initial non-terminal symbol from which the parsing process begins,
typically denoted as S (for Sentence).
Example of a CFG
S → NP VP
NP → Det N
VP → V NP | V
N → "cat" | "mat"
V → "sat" | "saw"
Parsing with CFG involves determining whether a given sentence can be generated by the
grammar and, if so, constructing a parse tree that represents the syntactic structure of the
sentence.
Parse Trees
A parse tree is a tree structure that represents the syntactic derivation of a sentence according to a
CFG. Each internal node of the tree corresponds to a non-terminal symbol, and the leaves
correspond to terminal symbols (words).
For the sentence "The cat sat," the parse tree would look like this:
/ \
NP VP
/ \ |
Det N V
| | |
This tree shows that the sentence "The cat sat" is parsed into a noun phrase (NP) "The cat" and a
verb phrase (VP) "sat."
Parsing Algorithms
1. Top-Down Parsing:
Approach: Starts with the start symbol (S) and attempts to rewrite it into the sentence by
recursively applying production rules.
Method: This approach tries to match the input sentence by expanding the non-terminals
according to the rules.
Limitation: Top-down parsing may generate many unnecessary trees and struggle with left
recursion (where a non-terminal leads to itself without consuming input).
2. Bottom-Up Parsing:
Approach: Starts with the input sentence (terminal symbols) and attempts to reduce it to the
start symbol (S) by applying production rules in reverse.
Method: This approach tries to construct the parse tree from the leaves (terminals) up to the root
(start symbol).
Limitation: Bottom-up parsing may generate many partial trees that do not lead to a complete
parse.
Approach: A dynamic programming algorithm that uses a tabular method to efficiently parse
sentences with a CFG.
Method: It breaks the sentence into smaller parts and checks if these parts can be generated by
the grammar, filling a table with possible non-terminal symbols for each substring.
Limitation: The CYK algorithm requires the grammar to be in Chomsky Normal Form (CNF),
where each production rule has at most two non-terminal symbols on the right-hand side.
4. Earley Parser:
Approach: A top-down parsing algorithm that efficiently handles all CFGs, including those with
left recursion.
Method: It incrementally processes the input sentence, using a state machine to track which
parts of the grammar rules have been matched.
Advantage: The Earley parser is particularly useful for ambiguous or complex grammars, as it
can handle left recursion and is efficient in both best-case and worst-case scenarios.
Ambiguity arises when a sentence can be parsed in more than one way, resulting in multiple
valid parse trees. This is a common issue in natural languages. For example, the sentence "I saw
the man with the telescope" can be interpreted as:
(NP): "I saw [the man with the telescope]" (The man has a telescope)
(VP): "I saw [the man] [with the telescope]" (I used the telescope)
1. Syntactic Parsing:
CFG parsing is used in syntactic parsers to analyze the grammatical structure of sentences in
applications like machine translation, information extraction, and question answering.
2. Speech Recognition:
CFGs are employed in speech recognition systems to help determine the most likely sequence of
words from an input audio signal.
CFG parsing helps in understanding the structure of natural language inputs, which is crucial for
dialogue systems, chatbots, and other NLP applications.
CFGs are also used in compilers and interpreters to parse the syntax of programming languages.
1. Limited Expressiveness:
CFGs cannot capture all the syntactic phenomena of natural languages, such as cross-serial
dependencies or certain agreement patterns, which require more powerful formalisms.
2. Ambiguity Handling:
CFGs often lead to ambiguous parses, and without additional mechanisms like probabilistic
parsing, it can be challenging to select the correct parse.
3. Context-Sensitivity:
CFGs do not account for context-sensitive aspects of language, such as word meanings that
depend on context, requiring additional models or rules to handle these cases.
Conclusion
Parsing with Context-Free Grammar is a fundamental technique in NLP for analyzing the
syntactic structure of sentences. Despite its limitations, CFG is widely used due to its simplicity
and the efficiency of its associated parsing algorithms. Understanding CFG parsing is essential
for many NLP tasks, as it forms the basis for more advanced syntactic and semantic analysis.
Components of a PCFG
Non-Terminal Symbols (Variables): Similar to CFG, these represent syntactic categories like
Sentence (S), Noun Phrase (NP), Verb Phrase (VP), etc.
Terminal Symbols: These are the actual words or tokens in the language.
Start Symbol: The initial non-terminal symbol from which parsing begins.
Rule Probabilities: The sum of the probabilities of all production rules for a given non-terminal
must equal 1:
∑ α P(A → α)=1
Example of a PCFG
S → NP VP [1.0]
NP → Det N [0.8]
NP → N [0.2]
VP → V NP [0.7]
VP → V [0.3]
N → "cat" [0.5]
N → "mat" [0.5]
V → "sat" [0.6]
V → "saw" [0.4]
Parsing with a PCFG involves finding the most likely parse tree for a given sentence. This is
done by calculating the probability of each possible parse tree and selecting the one with the
highest probability.
For the sentence "The cat sat," there might be different possible parse trees, each with an
associated probability. For example:
Tree 1:
S [1.0]
/ \
NP VP [0.6]
/ \ |
Det N V [0.6]
| | |
Tree 2:
S [1.0]
/ \
NP VP [0.6]
| |
N V [0.6]
| |
cat sat
The parser would choose the tree with the higher probability (Tree 1 in this case).
The algorithms used for parsing with PCFGs are similar to those used with traditional CFGs, but
they are adapted to handle probabilities:
2. Earley Parser:
A top-down algorithm that can be extended to handle probabilities, allowing it to
efficiently parse sentences with PCFGs, even in the presence of left recursion.
3. Inside-Outside Algorithm:
Used to compute the probabilities of subtrees in a parse forest, often applied in training
PCFGs to estimate rule probabilities from data.
Training PCFGs
The probabilities in a PCFG can be estimated from a treebank (a corpus of parsed sentences)
using various methods:
Advantages of PCFGs
1. Handling Ambiguity:
PCFGs allow the parser to choose the most likely parse tree among many possible trees, which is
essential for disambiguating sentences with multiple interpretations.
2. Scalability:
PCFGs can be efficiently trained and used for parsing large corpora, making them suitable for
real-world NLP applications.
3. Probabilistic Reasoning:
By incorporating probabilities, PCFGs can model the inherent variability and uncertainty in
natural language, leading to more accurate parsing results.
Limitations of PCFGs
1. Independence Assumptions:
PCFGs assume that the choice of a production rule depends only on the current non-terminal and
not on the rest of the parse tree, which can lead to oversimplifications.
2. Limited Expressiveness:
While PCFGs are more powerful than CFGs, they still struggle with certain linguistic
phenomena, such as long-distance dependencies and complex agreement patterns.
3. Data Sparsity:
Estimating probabilities for rare rules can be challenging, especially in domains with limited
training data, leading to sparsity issues.
Applications of PCFGs
1. Syntactic Parsing:
PCFGs are widely used in syntactic parsers, such as the Stanford Parser, to analyze the structure
of sentences in tasks like machine translation, information extraction, and question answering.
2. Speech Recognition:
In speech recognition, PCFGs can help model the syntactic structure of spoken language,
improving the accuracy of word sequence predictions.
PCFG-based parsers are used in applications that require deep understanding of sentence
structure, such as dialogue systems and automated essay scoring.
Conclusion
CFGs are formal grammars used to define the syntactic structure of sentences in a
language.
They consist of a set of production rules that describe how sentences can be generated
from a start symbol by recursively replacing symbols with sequences of other symbols.
PCFGs are an extension of CFGs where each production rule is associated with a
probability.
These probabilities are used to model the likelihood of different syntactic structures,
allowing the grammar to choose the most probable parse tree for a given sentence.
The probability of a parse tree is the product of the probabilities of the rules used to
generate the tree.
3. Lexicalization:
In a standard PCFG, the rules are purely syntactic and do not consider specific words
(lexical items) in a sentence.
Lexicalization involves enriching the CFG by attaching a specific word (called a head
word) to each non-terminal symbol in the grammar.
The head word is a key word in the phrase that influences its syntactic behavior and
semantic interpretation (e.g., the verb in a verb phrase).
4. Lexicalized PCFGs:
Improved Parsing Accuracy: By incorporating lexical information, the parser can more
accurately reflect the syntactic structure of a sentence, especially in cases where syntax is
closely tied to specific word choices.
Handling Ambiguity: Lexicalized PCFGs are better at resolving syntactic ambiguities,
as they consider the influence of particular words on the structure of the sentence.
6. Challenges:
Data Sparsity: Since lexicalized rules are more specific, they require more data to
estimate reliable probabilities. This can lead to issues with sparse data, especially in cases
where certain word combinations are rare.
Computational Complexity: The addition of lexical information increases the size of
the grammar, making parsing more computationally expensive.
7. Applications:
Lexicalized PCFGs are commonly used in syntactic parsers for natural language
understanding tasks, such as machine translation, speech recognition, and information
extraction.
In summary, lexicalized PCFGs represent a sophisticated approach to parsing that integrates both
syntactic and lexical information, leading to more accurate and context-sensitive language
models.